Detailed explanation of AI+Web3 infrastructure

Intermediate

3/29/2024, 7:44:08 PM

The main projects at the infrastructure layer of the AI+Web3 industry basically take the decentralized computing network as the main narrative, low cost as the main advantage, token incentives as the main way to expand the network, and serving AI+Web3 customers as the main goal.

Forwarded original title: AI+Web3 Future Development Path (2): Infrastructure Chapter

Infrastructure is the deterministic growth direction of AI development

1.Surging AI Computing Demand

In recent years, the demand for computing power has experienced rapid growth, particularly following the emergence of the large LLM model. This surge in demand for AI computing power has significantly impacted the high-performance computing market. Data from OpenAI reveals a remarkable trend since 2012, with the computing power used to train the largest AI models growing exponentially, doubling every 3-4 months on average, surpassing the growth rate predicted by Moore’s Law. The escalating demand for AI applications has resulted in a swift upsurge in the need for computing hardware. Projections indicate that by 2025, the demand for computing hardware driven by AI applications is expected to rise by approximately 10% to 15%.

Driven by the demand for AI computing power, GPU hardware manufacturer NVIDIA has witnessed continuous growth in data center revenue. In Q2 of 2023, data center revenue reached $10.32 billion, marking a 141% increase from Q1 of 2023 and a notable 171% surge from the same period in the previous year. By the fourth quarter of fiscal year 2024, the data center segment accounted for over 83% of total revenue, experiencing a simultaneous growth of 409%, with 40% attributed to large model inference scenarios, indicating robust demand for high-performance computing power.

Simultaneously, the need for vast amounts of data imposes significant requirements on storage and hardware memory. Particularly during the model training phase, extensive parameter inputs and data storage are essential. Memory chips utilized in AI servers predominantly include high-bandwidth memory (HBM), DRAM, and SSD. Work environments for AI servers must offer increased capacity, enhanced performance, reduced latency, and quicker response times. According to Micron’s calculations, the amount of DRAM in AI servers exceeds that in traditional servers by eightfold, while the quantity of NAND surpasses traditional server standards by threefold.

2.Supply-Demand Imbalance Boosts Computing Power Costs

Typically, computing power is primarily utilized in the training, fine-tuning, and inference stages of AI models, especially during the training and fine-tuning phases. Due to the increased data parameter inputs, computational requirements, and the heightened demand for interconnectivity in parallel computing, there is a need for more powerful and interconnected GPU hardware, often in the form of high-performance GPU clusters. As large models evolve, the computational complexity increases linearly, necessitating more high-end hardware to meet the demands of model training.

Taking GPT-3 as an example, with a scenario involving around 13 million independent user visits, the corresponding chip demand would exceed 30,000 A100 GPUs. This initial investment cost would reach a staggering $800 million, with estimated daily model inference costs totaling around $700,000.

Simultaneously, industry reports indicate that in the fourth quarter of 2023, NVIDIA’s GPU supply was severely restricted globally, leading to a noticeable imbalance between supply and demand in markets worldwide. NVIDIA’s production capacity was constrained by factors such as TSMC, HBM, CoWos packaging, and the “severe shortage issue” of the H100 GPU is expected to persist at least until the end of 2024.

Therefore, the surge in demand for high-end GPUs and supply constraints have driven the soaring prices of current hardware components like GPUs. Particularly for companies like NVIDIA that occupy a core position in the industry chain, the high prices are further augmented by their monopolistic dominance, allowing them to reap additional value dividends. For instance, the material cost of NVIDIA’s H100 AI accelerator card is approximately $3,000, yet its selling price reached around $35,000 in mid-2023 and even surpassed $40,000 on eBay.”

3. AI Infrastructure Drives Industry Chain Growth

A report by Grand View Research indicates that the global cloud AI market size was estimated to be $62.63 billion in 2023, projected to reach $647.6 billion by 2030, with a compound annual growth rate of 39.6%. These figures underscore the significant growth potential of cloud AI services and their substantial contribution to the overall AI industry chain.

As per estimates by a16z, a substantial portion of funds in the AIGC (AI and Global Computing) market ultimately flows towards infrastructure companies. On average, application companies allocate approximately 20-40% of their revenue towards inference and fine-tuning for each customer. This expenditure is typically directed to the cloud provider of the compute instance or a third-party model provider, who in turn dedicates around half of the revenue to cloud infrastructure. Consequently, it is reasonable to assume that 10-20% of the total revenue generated by AIGC is channeled to cloud providers.

Moreover, a significant portion of the demand for computing power is centered around training large AI models, including various extensive LLM models. Particularly for model startups, 80-90% of the costs are attributed to AI computing power. Collectively, AI computing infrastructure, encompassing cloud computing and hardware, is anticipated to represent more than 50% of the market’s initial value.

Decentralized AI computing

As previously discussed, the current cost of centralized AI computing remains high, primarily due to the escalating demand for high-performance infrastructure for AI training. However, a significant amount of idle computing power exists in the market, leading to a mismatch between supply and demand. The key factors contributing to this imbalance include:

Limited by memory, model complexity does not have a linear growth relationship with the number of GPUs required.: Current GPUs have computing power advantages, but model training requires a large number of parameters to be stored in memory. For GPT-3, for example, in order to train a model with 175 billion parameters, more than 1 terabyte of data needs to be held in memory - more than any GPU available today, thus requiring more GPUs for parallel computing and storage. , which in turn will lead to idle GPU computing power. For example, from GPT3 to GPT4, the model parameter size increased by about 10 times, but the number of required GPUs increased by 24 times (not taking into account the increase in model training time). According to relevant analysis, OpenAI used approximately 2.15e25 FLOPS in GPT-4 training, and conducted training on approximately 25,000 A100 GPUs for 90 to 100 days, with a computing power utilization of approximately 32% to 36%.

In response to the challenges outlined above, the pursuit of designing high-performance chips or specialized ASIC chips tailored for AI tasks is a prominent avenue being explored by numerous developers and major enterprises. Another approach involves the comprehensive utilization of existing computing resources to establish a distributed computing network, aiming to reduce computing power costs through leasing, sharing, and efficient scheduling of resources. Additionally, the market currently hosts a surplus of idle consumer-grade GPUs and CPUs. While individual units may lack robust computing power, they can effectively meet existing computational requirements in specific scenarios or when integrated with high-performance chips. Crucially, ensuring an ample supply is essential, as costs can be further diminished through distributed network scheduling.

Consequently, the shift towards distributed computing power has emerged as a key direction in the advancement of AI infrastructure. Simultaneously, given the conceptual alignment between Web3 and distributed systems, decentralized computing power networks have become a primary focus in the Web3+AI infrastructure landscape. Presently, decentralized computing power platforms in the Web3 market generally offer prices that are 80%-90% lower than centralized cloud computing services.

While storage plays a vital role in AI infrastructure, centralized storage holds distinct advantages in terms of scale, usability, and low latency. However, due to the notable cost efficiencies they offer, distributed computing networks hold significant market potential and stand to reap substantial benefits from the burgeoning AI market expansion.

Model inference and small model training represent the fundamental scenarios for current distributed computing power. The dispersal of computing resources in distributed systems inevitably introduces communication challenges among GPUs, potentially leading to reduced computing performance. Consequently, distributed computing power is most suitable for scenarios that necessitate minimal communication and can effectively support parallel tasks. These scenarios include the inference phase of extensive AI models and small models with relatively fewer parameters, minimizing performance impacts.Looking ahead, as AI applications evolve, reasoning emerges as a critical requirement at the application layer. Given that most companies lack the capacity to train large models independently, distributed computing power retains significant long-term market potential.
There is a rise in high-performance distributed training frameworks tailored for large-scale parallel computing. Innovative open-source distributed computing frameworks like PyTorch, Ray, and DeepSpeed are providing developers with robust foundational support for leveraging distributed computing power in model training. This advancement enhances the applicability of distributed computing power in the future AI market, facilitating its integration into various AI applications.

The narrative logic of AI+Web3 infrastructure projects

The distributed AI infrastructure sector exhibits robust demand and significant long-term growth prospects, making it an attractive area for investment capital. Currently, the primary projects within the AI+Web3 industry’s infrastructure layer predominantly center around decentralized computing networks. These projects emphasize low costs as a key advantage, utilize token incentives to expand their networks, and prioritize serving AI+Web3 clientele as their primary objective. This sector primarily comprises two key levels:

A relatively pure decentralized cloud computing resource sharing and leasing platform: Early AI projects like Render Network, Akash Network, among others, fall into this category.

The primary competitive edge in this sector lies in computing power resources, enabling access to a diverse range of providers, rapid network establishment, and user-friendly product offerings. Early market participants such as cloud computing firms and miners are well-positioned to tap into this opportunity.
With low product thresholds and swift launch capabilities, established platforms like Render Network and Akash Network have demonstrated notable growth and hold a competitive edge.
However, new market entrants face challenges with product homogeneity. The current trend and low entry barriers have led to an influx of projects focusing on shared computing power and leasing. While these offerings lack differentiation, there is a growing need for distinct competitive advantages.
Providers typically target customers with basic computing requirements. For instance, Render Network specializes in rendering services, while Akash Network offers enhanced CPU resources. While simple computing resource leasing suffices for basic AI tasks, it falls short in meeting the comprehensive needs of complex AI processes like training, fine-tuning, and inference.

Offering decentralized computing and machine learning workflow services, numerous emerging projects have recently secured substantial funding, including Gensyn, io.net, Ritual, and others.

Decentralized computing elevates the foundation of valuation in the industry. As computing power stands as the decisive narrative in AI development, projects rooted in computing power tend to boast more robust and high-potential business models, leading to higher valuations compared to purely intermediate projects.
Middle-tier services establish distinctive advantages. The services offered by the middle layer serve as competitive edges for these computing infrastructures, encompassing functions like oracles and verifiers facilitating the synchronization of on- and off-chain calculations on the AI chain, deployment and management tools supporting the overall AI workflow, and more. The AI workflow is characterized by collaboration, continuous feedback, and high complexity, necessitating computing power across various stages. Therefore, a middleware layer that is user-friendly, highly collaborative, and capable of meeting the intricate needs of AI developers emerges as a competitive asset, particularly in the Web3 domain, catering to the requirements of Web3 developers for AI. These services are better suited for potential AI application markets, going beyond basic computing support.
Project teams with professional ML field operation and maintenance expertise are typically essential. Teams offering middle-tier services must possess a comprehensive understanding of the entire ML workflow to effectively address developers’ full life cycle requirements. While such services often leverage existing open-source frameworks and tools without requiring significant technical innovation, they demand a team with extensive experience and robust engineering capabilities, serving as a competitive advantage for the project.

Offering services at more competitive prices than centralized cloud computing services, while maintaining comparable support facilities and user experiences, this project has garnered recognition from prominent investors. However, the heightened technical complexity poses a significant challenge. Presently, the project is in the narrative and developmental phase, with no fully launched product as of yet.

Representative project

1.Render Network

Render Network is a global blockchain-based rendering platform that leverages distributed GPUs to offer creators cost-effective and efficient 3D rendering services. Upon the creator’s confirmation of the rendering results, the blockchain network dispatches token rewards to nodes. The platform features a distributed GPU scheduling and allocation network, assigning tasks based on node usage, reputation, and other factors to optimize computing efficiency, minimize idle resources, and reduce expenses.

The platform’s native token, RNDR, serves as the payment currency within the ecosystem. Users can utilize RNDR to settle rendering service fees, while service providers earn RNDR rewards by contributing computing power to complete rendering tasks. The pricing of rendering services is dynamically adjusted in response to current network usage and other relevant metrics.

Rendering proves to be a well-suited and established use case for distributed computing power architecture. The nature of rendering tasks allows for their segmentation into multiple subtasks executed in parallel, minimizing inter-task communication and interaction. This approach mitigates the drawbacks of distributed computing architecture while harnessing the extensive GPU node network to drive cost efficiencies.

The demand for Render Network is substantial, with users having rendered over 16 million frames and nearly 500,000 scenes on the platform since its inception in 2017. The volume of rendering jobs and active nodes continues to rise. Moreover, in Q1 of 2023, Render Network introduced a natively integrated Stability AI toolset, enabling users to incorporate Stable Diffusion operations. This expansion beyond rendering operations signifies a strategic move into the realm of AI applications.

2.Gensyn.ai

Gensyn operates as a global supercomputing cluster network specializing in deep learning computing, utilizing Polkadot’s L1 protocol. In 2023, the platform secured $43 million in Series A funding, spearheaded by a16z. Gensyn’s architectural framework extends beyond the infrastructure’s distributed computing power cluster to encompass an upper-layer verification system. This system ensures that extensive off-chain computations align with on-chain requirements through blockchain verification, establishing a trustless machine learning network.

Regarding distributed computing power, Gensyn accommodates a spectrum of devices, from data centers with surplus capacity to personal laptops with potential GPUs. It unites these devices into a unified virtual cluster accessible to developers for on-demand peer-to-peer usage. Gensyn aims to establish a market where pricing is dictated by market forces, fostering inclusivity and enabling ML computing costs to achieve equitable levels.

The verification system stands as a pivotal concept for Gensyn, aiming to validate the accuracy of machine learning tasks as specified. It introduces an innovative verification approach encompassing probabilistic learning proof, graph-based precise positioning protocol, and Truebit. These core technical features of the incentive game offer enhanced efficiency compared to traditional blockchain validation methods. Network participants include submitters, solvers, verifiers, and whistleblowers, collectively facilitating the verification process.

Based on the extensive test data detailed in the Gensyn protocol’s white paper, notable advantages of the platform include:

Cost Reduction in AI Model Training: The Gensyn protocol offers NVIDIA V100 equivalent compute at an estimated cost of around $0.40 per hour, presenting an 80% cost savings compared to AWS on-demand compute.
Enhanced Efficiency in Trustless Verification Network: Test results outlined in the white paper indicate a significant improvement in model training time using the Gensyn protocol. The time overhead has seen a remarkable enhancement of 1,350% compared to Truebit replication and an extraordinary 2,522,477% improvement compared to Ethereum.

However, concurrently, distributed computing power introduces an inevitable increase in training time compared to local training, attributed to communication and network challenges. Based on test data, the Gensyn protocol incurs approximately a 46% time overhead in model training.

3.Akash network

Akash Network functions as a distributed cloud computing platform that integrates various technical elements to enable users to efficiently deploy and manage applications within a decentralized cloud environment. In essence, it offers users the capability to lease distributed computing resources.

At the core of Akash lies a network of infrastructure service providers dispersed globally, offering CPU, GPU, memory, and storage resources. These providers furnish resources for user leasing through the upper Kubernetes cluster. Users can deploy applications as Docker containers to leverage cost-effective infrastructure services. Additionally, Akash implements a “reverse auction” approach to further drive down resource prices. As per estimates on the Akash official website, the platform’s service costs are approximately 80% lower than those of centralized servers.

4.io.net

io.net stands as a decentralized computing network that interlinks globally distributed GPUs to furnish computational support for AI model training and reasoning. Recently concluding a $30 million Series A financing round, the platform now boasts a valuation of $1 billion.

Distinguished from platforms like Render and Akash, io.net emerges as a robust and scalable decentralized computing network, intricately linked to multiple tiers of developer tools. Its key features encompass:

Aggregation of Diverse Computing Resources: Access to GPUs from independent data centers, crypto miners, and projects like Filecoin and Render.
Core Support for AI Requirements: Essential service capabilities encompass batch inference and model serving, parallel training, hyperparameter tuning, and reinforcement learning.
Advanced Technology Stack for Enhanced Cloud Environment Workflows: Encompassing a range of orchestration tools, ML frameworks for computing resource allocation, algorithm execution, model training, inference operations, data storage solutions, GPU monitoring, and management tools.
Parallel Computing Capabilities: Integration of Ray, an open-source distributed computing framework, leveraging Ray’s inherent parallelism to effortlessly parallelize Python functions for dynamic task execution. Its in-memory storage facilitates rapid data sharing between tasks, eliminating serialization delays. Moreover, io.net extends beyond Python by integrating other prominent ML frameworks like PyTorch and TensorFlow, enhancing scalability.

Regarding pricing, the io.net official website estimates that its rates will be approximately 90% lower than those of centralized cloud computing services.

Furthermore, io.net’s native token, IO coin, will primarily serve as the payment and rewards mechanism within the ecosystem. Alternatively, demanders can adopt a model akin to Helium by converting IO coin into the stable currency “IOSD points” for transactions.

Disclaimer:

This article is reprinted from [Wanxiang Blockchain], the original title is “AI+Web3 Future Development Road (2) ): Infrastructure”, the copyright belongs to the original author [Wanxiang Blockchain]. If there are objections to this reprint, please contact the Gate Learn Team, and they will handle it promptly.
Liability Disclaimer: The views and opinions expressed in this article are solely those of the author and do not constitute any investment advice.
Translations of the article into other languages are done by the Gate Learn team. Without mentioning Gate.io, the translated article may not be reproduced, distributed or plagiarized.