IOSG: From Silicon to Intelligence, AI Training and Inference Technology Stack

2024-08-06 10:11:20

IOSG：从硅到智能，人工智能训练与推理技术栈

The rapid development of artificial intelligence is based on complex infrastructure. The AI technology stack is a layered architecture consisting of hardware and software, which is the cornerstone of the current AI revolution. Here, we will delve into the main levels of the technology stack and explain the contribution of each level to AI development and implementation. Finally, we will reflect on the importance of mastering these fundamentals, especially when evaluating opportunities in the intersection of Crypto and AI fields, such as the DePIN (Decentralization Physical Infrastructure) project, for example, GPU network.

IOSG：从硅到智能，人工智能训练与推理技术栈

1. Hardware Layer: Silicon Foundation

At the bottom is the hardware, which provides the physical computing power for artificial intelligence.

CPU (Central Processing Unit): It is the basic processor for calculations. They are good at handling sequential tasks and are essential for general computing, including data preprocessing, small-scale AI tasks, and coordinating other components.

GPU (Graphics Processing Unit): Originally designed for graphic rendering, it has become an important part of artificial intelligence because of its ability to perform a large number of simple calculations simultaneously. This parallel processing capability makes the GPU ideal for training Depth learning models. Without the development of GPUs, modern GPT models would not be possible.

AI Accelerator: Chips specifically designed for AI workloads that optimize common AI operations, providing high-performance and energy efficiency for training and inference tasks.

FPGA (Field-Programmable Gate Array): They provide flexibility with their reprogrammable nature. They can be optimized for specific artificial intelligence tasks, especially in scenarios that require low latency inference.

IOSG：从硅到智能，人工智能训练与推理技术栈

2. Underlying Software: Middleware

This layer in the AI technology stack is crucial because it builds a bridge between high-level AI frameworks and underlying hardware. Technologies such as CUDA, ROCm, OneAPI, and SNPE strengthen the connection between high-level frameworks and specific hardware architectures, achieving performance optimization.

As NVIDIA’s proprietary software layer, CUDA is the cornerstone of the company’s rise in the AI hardware market. NVIDIA’s leadership stems not only from its hardware advantages but also from the powerful network effects of its software and ecosystem integration.

The reason why CUDA has such a huge impact is that it incorporates Depth into the AI technology stack and provides a complete set of optimization libraries that have become the de facto standard in the field. This software ecosystem has built a strong network effect: AI researchers and developers proficient in CUDA use it in the training process, spreading it to the academic and industrial communities.

The resulting virtuous cycle has strengthened NVIDIA’s market leadership, as the CUDA-based tool and library ecosystem has become increasingly indispensable for AI practitioners.

This symbiosis of hardware and software not only solidifies NVIDIA’s position at the forefront of AI computing, but also gives the company significant pricing power, which is rare in the typically commoditized hardware market.

The dominance of CUDA and the relative obscurity of its competitors can be attributed to a series of factors that create significant barriers to entry. NVIDIA’s first-mover advantage in GPU-accelerated computing enables CUDA to establish a strong ecosystem before its competitors can gain a foothold. Despite outstanding hardware from competitors like AMD and Intel, their software layer lacks essential libraries and tools, and cannot seamlessly integrate with existing technology stacks, which is why there is a significant gap between NVIDIA/CUDA and other competitors.

IOSG：从硅到智能，人工智能训练与推理技术栈

3. Compiler: Translator

TVM (Tensor Virtual Machine), MLIR (Multi-Level Intermediate Representation), and PlaidML provide different solutions to optimize AI workloads across multiple hardware architectures.

TVM originated from research at the University of Washington and quickly gained attention for its ability to optimize depth learning models for various devices, from high-performance GPUs to resource-constrained edge devices. Its advantage lies in its end-to-end optimization process, which is particularly effective in inference scenarios. It completely abstracts the differences in underlying vendors and hardware, allowing the inference workload to seamlessly run on different hardware, whether it be NVIDIA devices or AMD, Intel, etc.

However, beyond reasoning, the situation becomes even more complicated. The ultimate goal of AI training hardware replacement has not been resolved. However, there are several notable initiatives in this regard.

MLIR, a project by Google, takes a more fundamental approach. By providing a unified intermediate representation for multiple abstraction levels, it aims to simplify the entire compiler infrastructure for both inference and training use cases.

PlaidML, now led by Intel, positions itself as a dark horse in this competition. It focuses on the portability across various hardware architectures, including those beyond traditional AI accelerators, envisioning a future where AI workloads can seamlessly run on various computing platforms.

If any of these compilers can be well integrated into the technical stack without affecting model performance or requiring any additional modifications by developers, it is likely to threaten CUDA’s moat. However, at present, MLIR and PlaidML are not mature enough and have not been well integrated into the AI technology stack, so they currently do not pose a clear threat to CUDA’s leadership position.

IOSG：从硅到智能，人工智能训练与推理技术栈

4. Distributed Computing: Coordinator

Ray and Horovod represent two different approaches to distributed computing in the AI field, each addressing key requirements for scalable processing in large-scale AI applications.

Ray, developed by RISELab at UC Berkeley, is a general-purpose distributed computing framework. It excels in flexibility, allowing the allocation of various types of workloads beyond machine learning. The actor-based model in Ray greatly simplifies the parallelization process of Python code, making it particularly suitable for reinforcement learning and other artificial intelligence tasks that require complex and diverse workflows.

Horovod, originally designed by Uber, focuses on the distributed implementation of Depth learning. It provides a concise and efficient solution for scaling Depth learning training processes across multiple GPUs and server Nodes. The highlight of Horovod lies in its user-friendliness and optimization for neural network data parallel training, which allows it to seamlessly integrate with mainstream Depth learning frameworks such as TensorFlow, PyTorch, etc., enabling developers to easily scale their existing training code without extensive modifications.

IOSG：从硅到智能，人工智能训练与推理技术栈

5. Conclusion: From the perspective of Cryptocurrency

The integration with the existing AI stack is crucial for the DePin project aimed at building a distributed computing system. This integration ensures compatibility with current AI workflows and tools, lowering the adoption threshold.

In the field of Cryptocurrency, the current GPU network is essentially a Decentralization GPU rental platform, which marks the initial steps towards more complex distributed AI infrastructure. These platforms are more like Airbnb-style markets than operating as distributed clouds. Although they are useful for certain applications, these platforms are not yet sufficient to support true distributed training, which is a key requirement for driving large-scale AI development.

Current distributed computing standards like Ray and Horovod are not designed for global distributed networks. For a truly functional Decentralization network, we need to develop another framework at this layer. Some skeptics even argue that, due to the dense communication and global function optimization required during the learning process of Transformer models, they are incompatible with distributed training methods. On the other hand, optimists are attempting to propose new distributed computing frameworks that can work well with globally distributed hardware. Yotta is one of the startups trying to address this issue.

NeuroMesh goes further. It redesigns the machine learning process in a particularly innovative way. By using Predictive Coding Network (PCN) to search for convergence of local error minimization instead of directly searching for the optimal solution of global loss function, NeuroMesh solves a fundamental bottleneck in distributed AI training.

This method not only achieves unprecedented parallelization, but also makes it possible to train models on consumer-grade GPU hardware (such as RTX 4090), democratizing AI training. Specifically, the computational power of the 4090 GPU is similar to that of the H100, but due to insufficient bandwidth, they are not fully utilized during model training. The importance of bandwidth has been dropped, making it possible to utilize these low-end GPUs, which may lead to significant cost savings and efficiency improvements.

GenSyn, another ambitious encryptionAI startup, aims to build a trap compiler. Gensyn’s compiler allows seamless use of AI workloads on any type of computing hardware. For example, just as TVM serves for inference, GenSyn is attempting to build a similar tool for model training.

If successful, it can significantly expand the ability of decentralized AI computing networks to handle more complex and diverse AI tasks by efficiently utilizing various hardware. This ambitious vision, although challenging due to the complexity of optimizing across diverse hardware architectures and high technical risks, may weaken the moat of CUDA and NVIDIA if they are able to execute this vision and overcome obstacles such as maintaining heterogeneous system performance.

Regarding reasoning: Hyperbolic’s approach combines verifiable reasoning with a decentralized network of heterogeneous computing resources, embodying a relatively pragmatic strategy. By leveraging compiler standards such as TVM, Hyperbolic can utilize a wide range of hardware configurations while maintaining performance and reliability. It can aggregate chips from multiple vendors, including consumer-grade hardware and high-performance hardware, from NVIDIA to AMD, Intel, and others.

The developments in the field of encryptionAI herald a future where AI computing may become more distributed, efficient, and accessible. The success of these projects depends not only on their technological advantages, but also on their ability to seamlessly integrate with existing AI workflows, as well as their ability to address the practical concerns of AI practitioners and businesses.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.