H100 reasoning soared 8 times! NVIDIA officially announced open source TensorRT-LLM, supporting 10+ models

Original source: Xinzhiyuan

Image source: Generated by Unbounded AI‌

The "GPU poor" are about to bid farewell to their predicament!

Just now, NVIDIA released an open source software TensorRT-LLM, which can accelerate the reasoning of large language models on H100.

So, how many times can it be improved?

After adding TensorRT-LLM and its series of optimization functions (including In-Flight batch processing), the total model throughput increased by 8 times.

Comparison of GPT-J-6B A100 and H100 with and without TensorRT-LLM

In addition, taking Llama 2 as an example, TensorRT-LLM can improve inference performance by 4.6 times compared to using A100 alone.

Comparison of Llama 2 70B, A100 and H100 with and without TensorRT-LLM

Netizens said that the super powerful H100, combined with TensorRT-LLM, will undoubtedly completely change the current situation of large-scale language model inference!

## TensorRT-LLM: Large model inference acceleration artifact

Currently, due to the huge parameter scale of large models, the difficulty and cost of "deployment and inference" remain high.

TensorRT-LLM developed by NVIDIA aims to significantly improve LLM throughput and reduce costs through GPU.

Specifically, TensorRT-LLM encapsulates TensorRT's deep learning compiler, FasterTransformer's optimized kernel, pre- and post-processing, and multi-GPU/multi-node communication into a simple open source Python API.

NVIDIA has further enhanced FasterTransformer to make it a productized solution.

It can be seen that TensorRT-LLM provides an easy-to-use, open source and modular Python application programming interface.

Coders do not need in-depth C++ or CUDA expertise to deploy, run, and debug various large language models, and can also obtain top performance and rapid customization.

According to Nvidia’s official blog, TensorRT-LLM optimizes LLM inference performance on Nvidia GPUs in four ways.

First, TensorRT-LLM is introduced for the current 10+ large models, allowing developers to run them immediately.

Secondly, TensorRT-LLM, as an open source software library, allows LLM to perform inference on multiple GPUs and multiple GPU servers simultaneously.

These servers are connected via NVIDIA's NVLink and InfiniBand interconnects.

The third is "In-flight batch processing", which is a brand-new scheduling technology that allows different model tasks to enter and exit the GPU independently of other tasks.

Finally, TensorRT-LLM is optimized to utilize the H100 Transformer Engine to reduce memory usage and latency during model inference.

Next, let’s take a closer look at how TensorRT-LLM improves model performance.

Support rich LLM ecology

TensorRT-LLM provides very good support for the open source model ecosystem.

The largest and most advanced language models, such as Llama 2-70B from Meta, require multiple GPUs working together to provide responses in real time.

Previously, if they wanted to achieve optimal performance for LLM inference, developers had to rewrite the AI model and manually split it into multiple fragments and coordinate execution across GPUs.

TensorRT-LLM uses tensor parallelism to distribute the weight matrix to each device, thereby simplifying this process and enabling large-scale efficient inference.

Each model can run in parallel on multiple GPUs and multiple servers connected via NVLink, without developer intervention or model changes.

With the introduction of new models and model architectures, developers can optimize their models using the latest NVIDIA AI kernel (Kernal) open sourced in TensorRT-LLM.

Supported kernel fusion (Kernal Fusion), including the most cutting-edge FlashAttention implementation and masked multi-head attention for the context and generation stages of GPT model execution, etc.

In addition, TensorRT-LLM includes fully optimized, ready-to-run versions of many large language models that are popular today.

These include Meta Llama 2, OpenAI GPT-2 and GPT-3, Falcon, Mosaic MPT, BLOOM and more than 10 models, all of which can be called using the simple and easy-to-use TensorRT-LLM Python API.

These features can help developers build customized large language models faster and more accurately to meet the different needs of various industries.

In-flight batch processing

Large language models are used in a wide variety of applications today.

A model can be used simultaneously for multiple seemingly disparate tasks - from simple Q&A responses in a chatbot, to document summarization or the generation of long code blocks. Workloads are highly dynamic and output sizes need to be of varying orders of magnitude for the task. need.

The diversity of tasks can make it difficult to batch requests effectively and perform efficient parallel execution, possibly causing some requests to complete earlier than others.

To manage these dynamic loads, TensorRT-LLM includes an optimized scheduling technology called "In-flight batching".

Its core principle is that the entire text generation process of a large language model can be broken down into multiple execution iterations on the model.

With in-flight batching, the TensorRT-LLM runtime releases completed sequences from the batch immediately, rather than waiting for the entire batch to complete before continuing to process the next set of requests.

While a new request is being executed, other requests from the previous batch that have not been completed are still being processed.

In-flight batching and additional kernel-level optimizations improve GPU utilization and can at least double the throughput of the actual LLM request benchmark on the H100.

H100 Transformer engine using FP 8

TensorRT-LLM also provides a feature called H100 Transformer Engine, which can effectively reduce memory consumption and latency during large model inference.

Because LLM contains billions of model weights and activation functions, it is usually trained and represented with FP16 or BF16 values, each occupying 16 bits of memory.

However, at inference time, most models can be efficiently represented with lower precision using quantization techniques, such as 8-bit or even 4-bit integers (INT8 or INT4).

Quantization is the process of reducing model weights and activation accuracy without sacrificing accuracy. Using lower precision means each parameter is smaller and the model takes up less space in GPU memory.

This enables inference on larger models using the same hardware while spending less time on memory operations during execution.

Through H100 Transformer Engine technology, the H100 GPU with TensorRT-LLM allows users to easily convert model weights to the new FP8 format and automatically compile the model to take advantage of the optimized FP8 kernel.

And this process doesn’t require any coding! The FP8 data format introduced by H100 enables developers to quantify their models and dramatically reduce memory consumption without reducing model accuracy.

Compared with other data formats such as INT8 or INT4, FP8 quantization retains higher precision while achieving the fastest performance and is the most convenient to implement.

How to obtain TensorRT-LLM

Although TensorRT-LLM has not yet been officially released, users can now have early access.

The application link is as follows:

NVIDIA also said that TensorRT-LLM will be integrated into the NVIDIA NeMo framework soon.

This framework is part of the AI Enterprise launched by NVIDIA not long ago, providing enterprise customers with a secure, stable, and highly manageable enterprise-level AI software platform.

Developers and researchers can access TensorRT-LLM through the NeMo framework on NVIDIA NGC or as a project on GitHub.

However, it should be noted that users must register for the NVIDIA Developer Program to apply for the early access version.

Hot discussion among netizens

Netizens on Reddit launched a heated discussion on the launch of TensorRT-LLM.

It’s hard to imagine how much the effect will be improved after optimizing the hardware specifically for LLM.

But some netizens believe that the purpose of this thing is to help Lao Huang sell more H100s.

However, some netizens do not agree very much. He feels that Tensor RT is also helpful for users who deploy SD locally, so as long as there is an RTX GPU, it should be possible to benefit from similar products in the future.

From a more macro perspective, perhaps for LLM, there will also be a series of hardware-level optimizations, and even hardware designed specifically for LLM will appear in the future to improve the performance of LLM. This situation is actually already popular in many It has appeared in applications, and LLM is no exception.

References:

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
Trade Crypto Anywhere Anytime
qrCode
Scan to download Gate App
Community
English
  • 简体中文
  • English
  • Tiếng Việt
  • 繁體中文
  • Español
  • Русский
  • Français (Afrique)
  • Português (Portugal)
  • Bahasa Indonesia
  • 日本語
  • بالعربية
  • Українська
  • Português (Brasil)