Original Author: IOSG Ventures
The rapid development of artificial intelligence is based on a complex infrastructure. The AI technology stack is a layered architecture consisting of hardware and software, and it is the cornerstone of the current AI revolution. Here, we will delve into the main levels of the technology stack and explain the contributions of each level to AI development and implementation. Finally, we will reflect on the importance of mastering these fundamentals, especially when evaluating opportunities at the intersection of Cryptocurrency and AI, such as the DePIN (Decentralization Physical Infrastructure) project, and GPU networks.
At the bottom is hardware, which provides physical computing power for artificial intelligence.
This layer in the AI technology stack is crucial because it serves as a bridge between high-level AI frameworks and underlying hardware. Technologies such as CUDA, ROCm, OneAPI, and SNPE strengthen the connection between high-level frameworks and specific hardware architectures, achieving performance optimization.
As NVIDIA’s proprietary software layer, CUDA is the cornerstone of the company’s rise in the AI hardware market. NVIDIA’s leadership stems not only from its hardware advantages, but also from the strong network effect of its software and ecosystem integration.
The reason why CUDA has such a big impact is that it has integrated AI technology stack and provided a set of optimization libraries that have become the de facto standard in this field. This software ecosystem has built a powerful network effect: AI researchers and developers proficient in CUDA will use it in the training process and spread it to the academic and industrial sectors.
The resulting virtuous cycle reinforces NVIDIA’s market leadership, as the CUDA-based tool and library ecosystem becomes increasingly indispensable for AI practitioners.
The symbiosis of this software and hardware not only consolidates NVIDIA’s position at the forefront of AI computing, but also gives the company significant pricing power, which is rare in the usually commoditized hardware market.
CUDA’s dominance and the relative obscurity of its competitors can be attributed to a series of factors that create significant barriers to entry. NVIDIA’s early advantage in GPU-accelerated computing allows CUDA to establish a strong ecosystem before its competitors can gain a foothold. Despite strong hardware from competitors like AMD and Intel, their software layer lacks necessary libraries and tools, and cannot seamlessly integrate with existing technology stacks, which explains the significant gap between NVIDIA/CUDA and other competitors.
TVM( Tensor Virtual Machine )、MLIR( Multi-Level Intermediate Representation ) and PlaidML provide different solutions to the challenge of optimizing AI workloads across multiple hardware architectures.
TVM originated from research at the University of Washington and quickly gained attention for its ability to optimize depth learning models for various devices, from high-performance GPUs to resource-constrained edge devices. Its advantage lies in an end-to-end optimization process, which is particularly effective in inference scenarios. It completely abstracts the differences in underlying vendors and hardware, allowing the inference workload to seamlessly run on different hardware, whether it is NVIDIA devices or AMD, Intel, etc.
However, beyond reasoning, the situation becomes even more complex. The hardware for AI training to replace calculations, the ultimate goal, remains unresolved. However, there are several noteworthy initiatives in this regard.
MLIR, Google’s project, adopts a more fundamental approach. By providing a unified intermediate representation for multiple abstraction levels, it aims to simplify the entire compiler infrastructure for inference and training use cases.
PlaidML, now led by Intel, positions itself as a dark horse in this competition. It focuses on the portability across various hardware architectures (including architectures beyond traditional AI accelerators), envisioning a future where AI workloads can seamlessly run on various computing platforms.
If any of these compilers can be well integrated into the technical stack without affecting model performance and without requiring any additional modifications by developers, it is highly likely to threaten the moat of CUDA. However, at present, MLIR and PlaidML are not mature enough and have not been well integrated into the AI technology stack, so they do not currently pose a significant threat to CUDA’s leadership position.
Ray and Horovod represent two different approaches to distributed computing in the AI field, each addressing the key requirement of scalable processing in large-scale AI applications.
Ray, developed by UC Berkeley’s RISELab, is a general-purpose distributed computing framework. It excels in flexibility, allowing for the allocation of various types of workloads beyond machine learning. The actor-based model in Ray greatly simplifies the parallelization process of Python code, making it particularly suitable for reinforcement learning and other artificial intelligence tasks that require complex and diverse workflows.
Horovod, originally designed by Uber, focuses on the distributed implementation of Depth learning. It provides a concise and efficient solution for scaling Depth learning training processes across multiple GPUs and server Nodes. The highlight of Horovod lies in its user-friendliness and optimization for neural network data parallel training, which enables seamless integration with mainstream Depth learning frameworks such as TensorFlow and PyTorch. This allows developers to easily scale their existing training code without extensive modifications.
The integration with the existing AI stack is crucial for the DePin project, which aims to build a distributed computing system. This integration ensures compatibility with current AI workflows and tools, dropping the adoption threshold.
In the field of Cryptocurrency, the current GPU network is essentially a Decentralization GPU leasing platform, marking the initial step towards a more complex distributed AI infrastructure. These platforms are more like Airbnb-style markets rather than operating as distributed clouds. While they are useful for certain applications, these platforms are not yet sufficient to support true distributed training, which is a key requirement for advancing large-scale AI development.
Current distributed computing standards like Ray and Horovod are not designed for a globally distributed network. For a truly functional Decentralization network, we need to develop another framework at this level. Some skeptics even argue that because Transformer models require intensive communication and global function optimization during the learning process, they are not compatible with distributed training methods. On the other hand, optimists are attempting to propose new distributed computing frameworks that can work well with globally distributed hardware. Yotta is one of the startup companies trying to address this issue.
NeuroMesh goes further. It reimagines the machine learning process in a particularly innovative way. By using Predictive Coding Network (PCN) to seek local error minimization convergence instead of directly seeking the optimal solution of the global loss function, NeuroMesh solves a fundamental bottleneck in distributed AI training.
This method not only achieves unprecedented parallelization, but also makes model training on consumer-grade GPU hardware (such as RTX 4090) possible, thereby democratizing AI training. Specifically, the computational power of the 4090 GPU is similar to that of the H 100, but due to insufficient bandwidth, they are not fully utilized during model training. Due to the importance of bandwidth drop in PCN, it is possible to use these low-end GPUs, which may result in significant cost savings and efficiency improvements.
GenSyn, another ambitious encryption AI startup, aims to build a trap compiler. Gensyn’s compiler allows any type of computing hardware to seamlessly be used for AI workloads. To give an analogy, just as TVM plays a role in inference, GenSyn is trying to build a similar tool for model training.
If successful, it can significantly expand the capabilities of the Decentralization AI computing network by efficiently utilizing various hardware to handle more complex and diverse AI tasks. This ambitious vision, although challenging due to the complexity and high technical risks of optimizing across diverse hardware architectures, may weaken the moat of CUDA and NVIDIA if they can execute this vision and overcome obstacles such as maintaining the performance of heterogeneous systems.
Regarding reasoning: The Hyperbolic approach combines verifiable reasoning with a decentralized network of heterogeneous computing resources, reflecting a relatively pragmatic strategy. By leveraging compiler standards such as TVM, Hyperbolic can utilize a wide range of hardware configurations while maintaining performance and reliability. It can aggregate chips from multiple vendors (from NVIDIA to AMD, Intel, etc.), including consumer-grade and high-performance hardware.
The developments in the field of encryption AI cross indicate a future in which AI computing may become more distributed, efficient and accessible. The success of these projects depends not only on their technological advantages, but also on their ability to seamlessly integrate with existing AI workflows and address the actual concerns of AI practitioners and businesses.