● AI Inference War
GTC 10-minute summary: “AI reservation orders of 1500 trillion” and NVIDIA’s ‘weakness补完’ strategy—this time, the reasoning war really has begun
Three things you especially need to see in today’s post (the points that make you read to the end)
-
NVIDIA 공개ed a strategy for breaking through the reasoning (inference) bottleneck by unveiling “Ver a Rubin + Gro k LPU,” claiming the best cost-performance ratio up to 50x versus Black (previous generation).
-
The flow that AI demand has rapidly shifted from “training” to “inference (reasoning)” and that, especially with the spread of agentic AI, token consumption explodes—GTC strongly confirmed this.
-
The post also laid out the risk of concentration among the top six hyperscalers (60%) that Wall Street points to as a reason the stock price doesn’t rise immediately, and then summarized the view that fundamentals (results) will ultimately follow.
1) The biggest message NVIDIA delivered at GTC: “In AI, inference is what makes money now”
1-1. Shifting the weight from training to inference
- If older AI was “almost everything is about training,” the post says that now, as processes like conversation/coding/verification/retry increase, the structure has grown where tokens are continuously consumed in the ‘reasoning process’.
- So the emphasis this time from NVIDIA is not only “training chips,” but more directly improving the inference speed and cost structure.
1-2. The spread of agentic AI explodes token demand
- After ChatGPT, as models advanced, “answer-only AI” evolved into coding agents (agentic AI) that repeat plan → execute → verify.
- At this point, as the process of correcting wrong results and re-running keeps repeating, the explanation is that the inference workload grows and that it connects to overall market demand for GPUs/infrastructure.
2) Unveiling the “Ver a Rubin architecture + Gro k 3 LPU”: directly addressing and补完ing weaknesses
2-1. Ver a Rubin (next-gen) + Gro k 3 LPU onboard
- The core revealed at GTC is a system in which the new Ver a Rubin architecture is equipped with Gro k 3 LPU (inference processor).
- There was also a claim that, when looked at on a 10-year basis, performance improves by “tens of millions of times,” and the message is clear: NVIDIA will no longer neglect inference-efficiency weaknesses.
2-2. Eliminating the inference bottleneck is the key—distributed inference (Pre-fill / Decode)
- Here, the important concept is distributed inference. Simply put, it divides inference into two stages.
- Pre-fill: handled by the existing Rubin CPX
- Decode: handled by the newly released Gro k LPX
- And they also emphasized that this is not fixed but applied dynamically via a software stack (Dynamo), depending on the situation.
3) A complete “AI factory” built through extreme co-design: vertical integration of 7 chip types
3-1. Designed “from start to finish,” including CPU~GPU~networking
- The approach NVIDIA is pushing this time isn’t just “making the chips better.” It’s about optimizing the entire hardware-software ecosystem as one set.
- There’s a description that they linked together “almost seven chips” in a vertical lineup, combining CPU, Rubin CPU, Rubin GPU, and networking-related chips.
- Ultimately, the goal is to extremely optimize an AI factory at the data-center level.
3-2. Cost-performance: claims of 35x (up to 50x) improvement versus Black
- This is where the most aggressive number comes in.
- Analysts mentioned that cost-performance improved by roughly 35x, and “up to 50x,” and from the perspective of credible institutions, it reads as the conclusion that “cost-performance is unbeatable.”
4) Features of the Gro k 3 LPU: SRAM-based + performance leaps by rack unit
4-1. SRAM-based means ‘small capacity but insane speed’
- One of the Gro k 3 LPU’s features mentioned is SRAM-based.
- Intuitively, it’s explained as understanding it as a semiconductor that has relatively less storage space, but with extremely fast access speed and computation speed.
4-2. Assemble eight Gro k chips to form a ‘massive number-theory rack’
- They explain that the Gro k 3 LPU is configured by bundling 8 units into a “rack” unit, and as a result, performance comes out at a level that overwhelms competitors.
- The important comparison point here is that at minimum, they mention at least “3x or more” versus Black; and further, they believe that if NVIDIA’s entire technology stack is applied, there is room for improvement up to several thousand times.
4-3. The message that it seems likely competitors could even ‘take over the inference market’
- It’s not that a single chip is fast; it’s that combining distributed inference and the software stack can change the cost structure of the inference market itself.
5) “Reservation orders of 1500 trillion” and the 2027 revenue outlook: the size of growth the market can feel
5-1. Mention of 2027 revenue of $1 trillion+ and profit margins in the 70% range
- A forecast stated that 2027 revenue will exceed $1 trillion, including numbers indicating total profit margins above the 70% range.
- Operating profit was also discussed around the $700 billion level (described as “about $700 billion”), emphasizing “profitability” more than just revenue.
5-2. The trend that Wall Street’s upward estimates keep missing
- There’s a strong nuance that estimates keep being raised repeatedly beyond what Wall Street predicted.
- A picture keeps emerging in which market expectations keep climbing, like “it was $500 billion a few months ago, but has been raised to nearly double.”
6) Why the stock price stays in place: the risk of concentration among 6 hyperscalers (60%)
6-1. The top 6 companies account for 60% of revenue
- The points Wall Street points to are simple.
- There’s concern that 60% of revenue is concentrated in the top six hyperscalers such as Amazon/Microsoft/Google/XAI/Oracle/Meta.
6-2. So maintaining “investment intensity” is the key
- Even if the top companies are stable (higher credit ratings, lower risk of default), ultimately from the market’s viewpoint there can be concern that if the additional investment cycle even pauses, valuation could shake.
6-3. Comment from the blog’s perspective: still, results follow
- But from the writer’s perspective, the “AI bubble” is a fading trend, and the important thing is results (fundamentals)—so in the end, the stock price has no choice but to follow the results.
7) Interpretation of the “AI bubble” controversy: geopolitical variables (Hormuz) are also variables
- The original text mentions that AI demand/investment sentiment is influenced by geopolitical issues (e.g., wars/risks related to the Strait of Hormuz).
- They added scenarios such as: if geopolitical issues calm down/are resolved, an interpretation could emerge that “the AI bubble controversy wasn’t a big deal after all.”
8) Another NVIDIA weapon: open software (open source) + hardware lock-in (CUDA-style)
8-1. Open-sourcing Nemo clo (Nemo/NeMo): enterprise shift for local agents
- For open clo / similar lines known recently as local base agents (local base agent types), there’s a nuance that they were weak for enterprise use (especially personal data protection).
- NVIDIA says it has started opening the Nemo clo, an enterprise-oriented platform meant to补完 that problem, as open source.
8-2. Strategy summary: “Openness expands the ecosystem; dependence turns into revenue”
- The assessment is that the approach repeats: keep hardware aligned with NVIDIA, and spread software through open access (the CUDA ecosystem model).
- This can be easily understood as a “razor-blade” kind of structure.
9) Roadmap: Rubin Ultra (2028) + the Fine man series (after that)
- Next year’s Rubin Ultra release was mentioned, followed by a roadmap for subsequent products like the Fine man/Fine man Erosa in 2028.
- What’s important here is not just continuously releasing products, but simulating data-center design itself with AI and advancing it toward more efficient directions.
- By strengthening vertical integration from power to cooling to networking, the view is that the efficiency gap between data centers can grow even larger.
Final conclusion from the writer’s perspective (the key takeaway of this post)
- The core of this GTC is ultimately this.
- NVIDIA accurately predicted the era of AI agents, and laid the matching infrastructure from start to finish to strengthen market dominance.
- It feels strongly like they built a “structure that is hard to compete with,” not only by hardware performance, but by bundling distributed inference + software stack + data-center efficiency together as one package.
Only the ‘most important content’ that was covered less in other news, separately summarized
-
What NVIDIA truly targets this time isn’t “training speed,” but the bottleneck in the inference workload that agentic AI increases.
-
The Gro k LPU isn’t about a single performance boast; the key is the system design of separating Pre-fill/Decode and applying them dynamically with Dynamo.
-
The “concentration among 6 hyperscalers at 60%,” mentioned as a reason the stock price doesn’t rise right away, is a short-term volatility point—but it was also presented alongside the view that the structural trend is hard to break as long as shipments (supply) and results (demand) keep moving up.
-
The strategy of widening the ecosystem with open source (Nemo clo) and locking revenue in with hardware dependence (CUDA-style) is also naturally included in this announcement flow.
Main takeaway the writer wants to convey (one-paragraph summary)
At GTC, NVIDIA strongly presented the direction of changing the inference cost structure in the agentic AI era centered on the next-gen architecture (Ver a Rubin) and the Gro k 3 LPU. Combined with distributed inference (separating pre-fill/decode) and the software stack (dynamic application), along with a vertical data-center efficiency strategy spanning CPU~GPU~networking, the conclusion of this summary is that they’ve made it into a game where competitors can’t easily catch up using “chip performance” alone.
< Summary >
- AI demand focus shifts from training to inference (reasoning), and agentic AI causes token consumption to explode.
- NVIDIA unveiled a strategy to remove the inference bottleneck with Ver a Rubin + Gro k 3 LPU (distributed inference: separating pre-fill/decode).
- Claims emerged that through extreme co-design at the level of seven chip types and vertical integration across data centers, cost-performance was boosted from 35x up to a maximum of 50x versus Black.
- While emphasis was placed on the 2027 revenue outlook of $1 trillion+ and an upward trend in results, the risk of concentration among the top six hyperscalers’ revenue (60%) was mentioned as a reason the stock price moved less.
- NVIDIA’s strategy continues: software spreads via open source (Nemo clo), while hardware locks in via dependence (CUDA-style) to keep revenue secured.
[Related article keywords: GTC / LPU / NVIDIA]
- AI infrastructure investment cycle after GTC: why GPU demand doesn’t break
- The inference cost structure changed by LPU: where the battle lies in the agentic AI era
*Source: [ 월텍남 – 월스트리트 테크남 ]
– GTC 10분 요약..”AI 예약주문 1500조”


