The Evolution of AI and the Shift to the Inference Phase - デジタルインフラ・ラボ株式会社

TOPICS & NEWS

News & Topics The Evolution of AI and the Shift to the Inference Phase

TOPICS & NEWS

2025.04.22

The Evolution of AI and the Shift to the Inference Phase

In March 2025, U.S.-based NVIDIA held its annual developer conference, “GTC,” and announced its new software “Dynamo,” specifically designed for inference processing. This announcement comes against the backdrop of a significant shift in AI’s evolution, moving from a primary focus on “learning” to “inference.”

NVIDIA, a company that has historically excelled in technologies for training AI models, emphasized that its hardware and software are now essential for inference as well. CEO Jensen Huang stressed that accelerating inference processing is key to determining the quality of AI services.

Key Features of the New “Dynamo” Software

Dynamo will be available as open-source software and is designed to accelerate inference processing by efficiently coordinating multiple GPUs. When combined with the latest “Blackwell” GPU architecture, it can reportedly increase the processing speed of the “R1” AI model from the Chinese AI company DeepSeek by up to 30 times compared to previous methods.

A core feature is a technique called “fine-grained serving,” which significantly improves processing efficiency by separating the inference process into two phases: “prefill” and “decode,” and assigning them to different GPUs.

Furthermore, by leveraging a technology called “KV cache” to store and reuse past token information, Dynamo reduces computational load. The “KV Cache Manager” integrated into Dynamo enables efficient cache management to avoid exceeding GPU memory limits.

The Trade-off Problem and Hardware Evolution

In his keynote speech, CEO Huang highlighted the trade-off between “total tokens per second (throughput)” and “tokens per user (latency)” in inference. This illustrates the dilemma where faster response times can limit the number of concurrent users, while supporting more users can lead to increased response delays.

To address this, NVIDIA has adopted a strategy of overcoming this trade-off through hardware enhancements. The newly announced “Blackwell” architecture boasts up to 25 times the processing power of its predecessor, “Hopper,” enabling a balance between quality and scale.

Continued Strong Investment in AI-Related Data Centers

As the primary use case of AI shifts towards inference, the demand for computational processing is experiencing exponential growth. Following “Blackwell,” NVIDIA has unveiled development plans for even higher-performance GPUs, such as “Rubin” and “Feynman,” with Dynamo evolving as the corresponding software foundation.

To support such high-density and high-performance AI processing, distributed and large-scale computing environments are essential. Consequently, with the expansion of AI agents and generative AI, investment in data centers as the underlying infrastructure is expected to remain robust in the future.

dil_admin

TOPICS & NEWS