● GPU Slows, Memory Wars Ignite, HBM Peaks, HBF Disrupts
HBM, Next Is HBF? Why the Real Battleground in the “AI Speed War” Has Shifted from GPUs to “Memory”
Today’s post includes these key takeaways.
First, as AI shifts from ‘training’ to ‘inference,’ why memory bottlenecks end up determining performance.
Second, why metrics like TTFT and TPOT—which drive user-perceived quality—are directly tied to memory bandwidth and capacity.
Third, why HBM alone is not enough, and why ‘memory tiering (Hot/Warm/Cold)’ is becoming the standard design pattern for data centers.
Fourth, how HBF (a stacked-NAND-based intermediate-tier memory) discussed as the next card after HBM reshapes the market landscape.
Fifth, a re-interpretation—from an investment and industry perspective—of the structural shift in which “memory companies take the lead in AI computing design,” a point relatively undercovered in other news.
1) News Briefing: In the AI Performance Race, “Memory” Beats “Compute” Now
Key takeaway in one line
No matter how fast a GPU is, if it can’t fetch the needed data in time, AI stalls.
Why this is becoming a bigger topic right now
As generative AI becomes more advanced with Transformer (Attention)-based architectures, it must continuously reference “full-sentence context” and “previous conversation context.”
In that process, the KV Cache (an in-conversation scratchpad) grows, and as tokens get longer, the memory burden snowballs.
The change created by the era of inference
Training can scale to a degree by throwing more GPUs at parallel processing, but inference generates tokens sequentially.
So, more than “compute,” what determines perceived performance is “how fast you can load prior tokens and intermediate results.”
2) New KPIs That Change User Experience: TTFT and TPOT
AI service competitiveness is moving from “accuracy” to “latency”
Users don’t look at FLOPS or GPU core counts.
They look at “when the first character appears” and “whether it continues smoothly after that.”
TTFT (Time To First Token)
The time from pressing Enter until the first token appears on the screen.
It determines the first impression.
TPOT (Time Per Output Token)
After the first token, how smoothly the next tokens keep coming.
The longer the conversation, the bigger the perceived quality gap becomes.
Summary
One of the most direct ways to reduce these two metrics is “faster memory access (bandwidth/latency)” and an architecture that “keeps needed data closer.”
3) Why “Just Buy More GPUs” Doesn’t Work: The Bottleneck Is the “Road”
Asymmetric pace of progress (original gist)
AI’s required compute has exploded, but expanding data movement (memory bandwidth) has been relatively much slower.
It’s like the engine (GPU) became a supercar, while the road (memory/interconnect) stayed the same.
The physical limits of the von Neumann architecture
Compute units and storage units are separated, and while fetching data, the compute unit waits.
The awareness that GPU idle time occurs in data centers ultimately leads to “attach memory to compute.”
The key industry keyword here
In the end, this is a fight over data center investment efficiency (= performance per CAPEX), and an issue that changes the cloud computing cost structure.
4) Why HBM Took Off, and Why HBM Alone Isn’t Enough
The role of HBM (High Bandwidth Memory)
By stacking and attaching it close to the GPU, it greatly increases bandwidth and reduces bottlenecks.
It’s expensive, but it’s dominant where “latency/bandwidth” directly translates into money.
But the points where HBM alone still doesn’t work
1) The price is high, so you can’t add it endlessly.
2) Capacity is limited, so in workloads like agents that carry lots of long-term context, you hit limits quickly.
3) As inference scale grows, the data you must store “a lot of + close by” explodes.
Conclusion
HBM is optimized for “hot data,” but putting all data into HBM is inefficient both economically and physically.
5) The Emerging Design Standard: “Memory Tiering (Hot / Warm / Cold)”
Hot (hot memory)
Data needed for compute right now.
Handled by HBM next to the GPU.
Fast but expensive, with limited space (capacity).
Cold (cold memory)
Massive data that isn’t used often but must be stored.
Handled by SSDs/storage.
Cheap and large, but slow.
The problem is the “gap” between Hot and Cold
This gap forces GPUs to wait and worsens TTFT/TPOT.
So you need Warm (warm memory)
An intermediate layer that keeps “large-capacity data you frequently pull” relatively close.
HBF emerges here as a candidate.
6) What Is HBF: The “Intermediate-Tier Memory” That Opens the Next Round After HBM
Concept (reframed from the original gist)
HBF is a direction that stacks NAND Flash and, with an HBM-like “packaging/stacking structure,” aims to raise bandwidth and density.
Why NAND suddenly becomes a key player in AI
NAND is far cheaper than DRAM and enables very large-capacity configurations.
It becomes a “low-cost, large-capacity tier” that can hold far more data than HBM.
Trade-off
It’s slower than HBM/DRAM.
But the core concept is reducing bottlenecks by configuring it “closer and thicker” than far-away storage like SSDs.
Meaning from a data center perspective
When designing AI infrastructure, it becomes an optimization problem of allocating budget not only to “more HBM,” but to “HBM + HBF (or a similar intermediate tier) + SSD.”
This connects directly to where pricing power and influence shift across the semiconductor supply chain.
7) Market Landscape: The Center of Gravity Shifts to “Memory-Centric”
Before
GPU/accelerator companies set AI computing specs, and memory followed to match.
After
As inference-centric workloads expand, KV Cache grows, and agentization progresses, “memory architecture determines system performance.”
In other words, memory technology/packaging/tier design increasingly takes the lead in AI system design.
Where does the industry grow?
As AI semiconductor competition continues centered on HBM, memory semiconductor companies’ bargaining power and premium-product mix can rise.
If intermediate tiers (like HBF) grow, the NAND camp (e.g., SanDisk, Kioxia, etc.) may be re-rated from “simple storage” to an “AI performance component.”
A macro-level important point
This is not a specific company issue, but a story that changes the AI infrastructure investment cycle itself.
AI investment expansion → data center build-out → memory-tier redesign → higher share of high-value memory → a structure that can increase cyclicality.
8) The “Really Important Content” That Other YouTube/News Relatively Under-Talk About
1) HBF is less a “product” and more a signal of a change in “budget allocation”
In the market, it’s easy to consume it as merely “the next new product after HBM,” but the essence is the trend of rewriting the data center memory cost structure.
That is, the answer to CAPEX optimization shifts from “adding GPUs” to “memory tier design.”
2) Inference economics turns the memory supercycle into a “mega-cycle”
Training can concentrate around a few big tech players, but inference spreads across every industry as services proliferate.
As inference workloads expand, “expensive HBM alone is unaffordable → an intermediate tier becomes mandatory.”
Once this structure takes hold, the demand floor for memory rises and the cycle can lengthen.
3) The more “AI agents” spread, the sooner memory “capacity” blows up
Agents drive explosive growth in reference data: long context, personalized history, tool-use logs, vector search, and more.
In the end, for many services, the bottleneck is likely to hit first not in compute, but in “nearby large-capacity memory/storage tiers.”
4) Memory companies’ bargaining power comes from “packaging/standards”
That’s true for HBM, and going forward, “what form it can attach to the system in” becomes even more important.
To gain leadership, they must bundle not only chip performance but also packaging, interfaces, and the software stack (memory management).
9) Economic and Industry Outlook (Blog-Style Summary): Watch Points for 2026–2028
Watch point A: The growth rate of inference traffic
More than model performance competition, “how widely services get deployed” determines memory demand.
As enterprise agents/contact centers/commerce/search ramp up in earnest, demand grows more structurally.
Watch point B: Where memory-tier standards settle
Winners diverge depending on whether only HBM expands or whether an intermediate tier like HBF becomes standard.
In this process, benefits can ripple through the supply chain (materials, equipment, packaging).
Watch point C: Inflation/interest-rate environment and data center investment
Data centers are capital-intensive, so they are sensitive to interest rates and financing costs.
However, AI can simultaneously drive cost reduction (automation) and revenue expansion (new services), so investment may persist even during slowdowns.
SEO core keywords naturally included in this post
global economic outlook, rate cuts, inflation, semiconductor supply chain, data center investment
< Summary >
As AI shifts to prioritizing inference over training, memory bandwidth, capacity, and latency—not GPU compute—are increasingly determining perceived performance (TTFT/TPOT).
HBM solves hot data but is expensive and capacity-limited, so “memory tiering” that splits Hot/Warm/Cold becomes the core point of data center design.
HBF is an attempt to fill the intermediate (Warm) tier via NAND-based stacking, and it signals that memory companies can take the lead in AI computing design.
[Related posts…]
- A Summary of HBM Demand Explosion and AI Data Center Investment Points
- In the AI Inference Era, Why the Cloud Cost Structure Is Changing
*Source: [ 티타임즈TV ]
– HBM 다음은 HBF? AI 연산의 판도를 바꾸는 메모리 반도체 경쟁


