GPU-stall, Memory-wars, HBM-peak, HBF-shock

·

·

● GPU Slows, Memory Wars Ignite, HBM Peaks, HBF Disrupts

HBM, Next Is HBF? Why the Real Battleground in the “AI Speed War” Has Shifted from GPUs to “Memory”

Today’s post includes these key takeaways.

First, as AI shifts from ‘training’ to ‘inference,’ why memory bottlenecks end up determining performance.

Second, why metrics like TTFT and TPOT—which drive user-perceived quality—are directly tied to memory bandwidth and capacity.

Third, why HBM alone is not enough, and why ‘memory tiering (Hot/Warm/Cold)’ is becoming the standard design pattern for data centers.

Fourth, how HBF (a stacked-NAND-based intermediate-tier memory) discussed as the next card after HBM reshapes the market landscape.

Fifth, a re-interpretation—from an investment and industry perspective—of the structural shift in which “memory companies take the lead in AI computing design,” a point relatively undercovered in other news.

1) News Briefing: In the AI Performance Race, “Memory” Beats “Compute” Now

Key takeaway in one line

No matter how fast a GPU is, if it can’t fetch the needed data in time, AI stalls.

Why this is becoming a bigger topic right now

As generative AI becomes more advanced with Transformer (Attention)-based architectures, it must continuously reference “full-sentence context” and “previous conversation context.”

In that process, the KV Cache (an in-conversation scratchpad) grows, and as tokens get longer, the memory burden snowballs.

The change created by the era of inference

Training can scale to a degree by throwing more GPUs at parallel processing, but inference generates tokens sequentially.

So, more than “compute,” what determines perceived performance is “how fast you can load prior tokens and intermediate results.”

2) New KPIs That Change User Experience: TTFT and TPOT

AI service competitiveness is moving from “accuracy” to “latency”

Users don’t look at FLOPS or GPU core counts.

They look at “when the first character appears” and “whether it continues smoothly after that.”

TTFT (Time To First Token)

The time from pressing Enter until the first token appears on the screen.

It determines the first impression.

TPOT (Time Per Output Token)

After the first token, how smoothly the next tokens keep coming.

The longer the conversation, the bigger the perceived quality gap becomes.

Summary

One of the most direct ways to reduce these two metrics is “faster memory access (bandwidth/latency)” and an architecture that “keeps needed data closer.”

3) Why “Just Buy More GPUs” Doesn’t Work: The Bottleneck Is the “Road”

Asymmetric pace of progress (original gist)

AI’s required compute has exploded, but expanding data movement (memory bandwidth) has been relatively much slower.

It’s like the engine (GPU) became a supercar, while the road (memory/interconnect) stayed the same.

The physical limits of the von Neumann architecture

Compute units and storage units are separated, and while fetching data, the compute unit waits.

The awareness that GPU idle time occurs in data centers ultimately leads to “attach memory to compute.”

The key industry keyword here

In the end, this is a fight over data center investment efficiency (= performance per CAPEX), and an issue that changes the cloud computing cost structure.

4) Why HBM Took Off, and Why HBM Alone Isn’t Enough

The role of HBM (High Bandwidth Memory)

By stacking and attaching it close to the GPU, it greatly increases bandwidth and reduces bottlenecks.

It’s expensive, but it’s dominant where “latency/bandwidth” directly translates into money.

But the points where HBM alone still doesn’t work

1) The price is high, so you can’t add it endlessly.

2) Capacity is limited, so in workloads like agents that carry lots of long-term context, you hit limits quickly.

3) As inference scale grows, the data you must store “a lot of + close by” explodes.

Conclusion

HBM is optimized for “hot data,” but putting all data into HBM is inefficient both economically and physically.

5) The Emerging Design Standard: “Memory Tiering (Hot / Warm / Cold)”

Hot (hot memory)

Data needed for compute right now.

Handled by HBM next to the GPU.

Fast but expensive, with limited space (capacity).

Cold (cold memory)

Massive data that isn’t used often but must be stored.

Handled by SSDs/storage.

Cheap and large, but slow.

The problem is the “gap” between Hot and Cold

This gap forces GPUs to wait and worsens TTFT/TPOT.

So you need Warm (warm memory)

An intermediate layer that keeps “large-capacity data you frequently pull” relatively close.

HBF emerges here as a candidate.

6) What Is HBF: The “Intermediate-Tier Memory” That Opens the Next Round After HBM

Concept (reframed from the original gist)

HBF is a direction that stacks NAND Flash and, with an HBM-like “packaging/stacking structure,” aims to raise bandwidth and density.

Why NAND suddenly becomes a key player in AI

NAND is far cheaper than DRAM and enables very large-capacity configurations.

It becomes a “low-cost, large-capacity tier” that can hold far more data than HBM.

Trade-off

It’s slower than HBM/DRAM.

But the core concept is reducing bottlenecks by configuring it “closer and thicker” than far-away storage like SSDs.

Meaning from a data center perspective

When designing AI infrastructure, it becomes an optimization problem of allocating budget not only to “more HBM,” but to “HBM + HBF (or a similar intermediate tier) + SSD.”

This connects directly to where pricing power and influence shift across the semiconductor supply chain.

7) Market Landscape: The Center of Gravity Shifts to “Memory-Centric”

Before

GPU/accelerator companies set AI computing specs, and memory followed to match.

After

As inference-centric workloads expand, KV Cache grows, and agentization progresses, “memory architecture determines system performance.”

In other words, memory technology/packaging/tier design increasingly takes the lead in AI system design.

Where does the industry grow?

As AI semiconductor competition continues centered on HBM, memory semiconductor companies’ bargaining power and premium-product mix can rise.

If intermediate tiers (like HBF) grow, the NAND camp (e.g., SanDisk, Kioxia, etc.) may be re-rated from “simple storage” to an “AI performance component.”

A macro-level important point

This is not a specific company issue, but a story that changes the AI infrastructure investment cycle itself.

AI investment expansion → data center build-out → memory-tier redesign → higher share of high-value memory → a structure that can increase cyclicality.

8) The “Really Important Content” That Other YouTube/News Relatively Under-Talk About

1) HBF is less a “product” and more a signal of a change in “budget allocation”

In the market, it’s easy to consume it as merely “the next new product after HBM,” but the essence is the trend of rewriting the data center memory cost structure.

That is, the answer to CAPEX optimization shifts from “adding GPUs” to “memory tier design.”

2) Inference economics turns the memory supercycle into a “mega-cycle”

Training can concentrate around a few big tech players, but inference spreads across every industry as services proliferate.

As inference workloads expand, “expensive HBM alone is unaffordable → an intermediate tier becomes mandatory.”

Once this structure takes hold, the demand floor for memory rises and the cycle can lengthen.

3) The more “AI agents” spread, the sooner memory “capacity” blows up

Agents drive explosive growth in reference data: long context, personalized history, tool-use logs, vector search, and more.

In the end, for many services, the bottleneck is likely to hit first not in compute, but in “nearby large-capacity memory/storage tiers.”

4) Memory companies’ bargaining power comes from “packaging/standards”

That’s true for HBM, and going forward, “what form it can attach to the system in” becomes even more important.

To gain leadership, they must bundle not only chip performance but also packaging, interfaces, and the software stack (memory management).

9) Economic and Industry Outlook (Blog-Style Summary): Watch Points for 2026–2028

Watch point A: The growth rate of inference traffic

More than model performance competition, “how widely services get deployed” determines memory demand.

As enterprise agents/contact centers/commerce/search ramp up in earnest, demand grows more structurally.

Watch point B: Where memory-tier standards settle

Winners diverge depending on whether only HBM expands or whether an intermediate tier like HBF becomes standard.

In this process, benefits can ripple through the supply chain (materials, equipment, packaging).

Watch point C: Inflation/interest-rate environment and data center investment

Data centers are capital-intensive, so they are sensitive to interest rates and financing costs.

However, AI can simultaneously drive cost reduction (automation) and revenue expansion (new services), so investment may persist even during slowdowns.

SEO core keywords naturally included in this post

global economic outlook, rate cuts, inflation, semiconductor supply chain, data center investment

< Summary >

As AI shifts to prioritizing inference over training, memory bandwidth, capacity, and latency—not GPU compute—are increasingly determining perceived performance (TTFT/TPOT).

HBM solves hot data but is expensive and capacity-limited, so “memory tiering” that splits Hot/Warm/Cold becomes the core point of data center design.

HBF is an attempt to fill the intermediate (Warm) tier via NAND-based stacking, and it signals that memory companies can take the lead in AI computing design.

[Related posts…]

*Source: [ 티타임즈TV ]

– HBM 다음은 HBF? AI 연산의 판도를 바꾸는 메모리 반도체 경쟁


● GPU Slows, Memory Wars Ignite, HBM Peaks, HBF Disrupts HBM, Next Is HBF? Why the Real Battleground in the “AI Speed War” Has Shifted from GPUs to “Memory” Today’s post includes these key takeaways. First, as AI shifts from ‘training’ to ‘inference,’ why memory bottlenecks end up determining performance. Second, why metrics like TTFT…

Feature is an online magazine made by culture lovers. We offer weekly reflections, reviews, and news on art, literature, and music.

Please subscribe to our newsletter to let us know whenever we publish new content. We send no spam, and you can unsubscribe at any time.

Korean