VDPU Arms Race, HBM Bottleneck, KV Cache Crisis

● HBM KV Cache Bottleneck Triggers VDPU Search Arms Race

Not enough with HBM and TurboQuantum alone… In the age of agentic AI, the “memory war” is about to be reshaped by an incoming VDPU

A core point the market can easily miss right now: the bottleneck isn’t “computation,” it’s “data access”

As agentic AI rolls out in earnest, the era of boosting only “model performance,” as if in a paper, is ending.
Instead, the explosive growth of repeated searching (more search iterations) + context (agenting memory) + per-session caching shifts the bottleneck away from GPUs/compute and toward “the path to fetching data.”

As a result, the key keywords change like this.
HBM capacity/bandwidth limits → KV cache pressure → CPU bottleneck (search) → need for a dedicated accelerator (VDPU)

In this article (an interview with Jeongmukkyung, CEO of Denotasia), the most important message is just one.
The claim is that the “memory war” doesn’t end by simply stacking more HBM, and that a dedicated processing that accelerates data-access (search) workloads will change the game.

1) Why did the “memory war” suddenly get bigger? Agentic AI keeps ‘reading-finding-generating’ data

Agentic AI isn’t just about answering once—it repeats the loop of action-memory-search-generation.

That repeated loop is where the data explodes.

The core of LLM answer quality is not only the “model,” but also the “retrieved data”
When it receives a question, it mostly searches external data, brings it in, and then formulates the answer based on it.
Whenever the agent loop runs, search/tool calls repeat
As the process of obtaining the “information needed to craft the answer” happens multiple times, the search workload surges.
Accumulation of context (work records) and agenting memory
Because the AI must bring back memory, session data keeps piling up, increasing cache/memory burden.

In conclusion, the core point is: “As AI grows, compute doesn’t just increase.”
Because agentic AI repeats data access itself, the memory, storage, and CPU-accelerated pipeline all get shaken at the same time.

2) If we just add more HBM, will that solve it? The article says it’s “not that simple”

Here is where the debate emerges.
Approaches like TurboQuantum, which drew attention from Google, created expectations that “memory demand would be reduced.”

But the CEO pushes back with this logic.

Reducing context could also destabilize “performance”
Since context length is tied to service quality, the market will move toward the side that “uses even longer context even if it reduces memory.”
In summary: even if it gets compressed, if total demand grows more, the problem comes back again
It’s hard to jump straight to a conclusion that HBM doesn’t need to increase.

So the key isn’t that “one technology ends the whole game,” but that
the continuously growing context, cache, and search must be handled at the system level.

3) Why the KV cache becomes the bottleneck: it keeps taking up space per session, then has to be dropped and brought back

The bottleneck the CEO talks about most strongly is the KV cache.

In agentic AI/LLM services, when question-answering repeats, the “previous context” the model needs when generating tokens is kept in the form of KV cache.

The problem is that this KV cache grows—and in data center environments, it gets even more intense.

It’s supposed to live in HBM for speed, but capacity is the limit
As KV cache grows, you hit a point where it can’t all fit in HBM, and then speed can drop sharply.
GPUs are shared across multiple users/sessions
Even if model parameters are shared, context differs per session.
If sessions pause briefly, you need scheduling that drops the cache and then brings it back
The process of moving the KV cache down to disk/storage and then bringing it back becomes a system bottleneck.
Cache movement connects bottlenecks all the way to “network/storage/scheduling”
This directly leads to the bottleneck that determines overall latency.

So the “memory war” is not just a DRAM/HBM capacity issue.
It’s shifting into a problem of where the cache sits (tiers), how it moves (scheduling), and how it’s accelerated (accelerator).

4) Then why did CPU become important? Agents increased the share of “search” running on the CPU

Many people think, “Isn’t AI a GPU thing?” but the explanation is that the landscape changes once agents appear.

The point emphasized in the article is this.

The agent loop is a structure that amplifies CPU bottlenecks
While the CPU oversees the whole operation, data search/tool calls/orchestration get attached.
The “CPU 50% / GPU 50%” view shown in reports
In reality it may differ depending on the service/optimization, but the concern is spreading that “the CPU is taking up a larger share than people expect.”
As search workloads explode, it becomes a burden to handle everything with CPU alone
That’s why you need dedicated chips that accelerate data search.

Here, the article naturally moves to its conclusion.
Dedicated semiconductors for accelerating search that sit beside the CPU (orchestration)—VDPU becomes the key axis.

5) What does VDPU accelerate? Fast dedicated acceleration for “vector DB search”

The role of VDPU, as explained by the CEO, is relatively clear.

A commonly used flow in agentic AI is
vectorization (transforming question/document meaning) → vector similarity search → sending top results back to the LLM.

In this process, the VDPU targets two aspects.

The computation volume problem
Vectors must be compared, and when there’s lots of data, “comparison computations” explode.
The search structure problem
Simple comparison (exhaustive comparison) is slow, so DBs need to create efficiency through structure (like graph forms), and the system must be optimized to execute that search faster.

In other words, the VDPU isn’t just a “model computation accelerator.”
It’s aimed at accelerating the search pipeline that fetches the information AI needs.

And there’s another passage that this article connects naturally to.
When search gets faster, overall loop latency for the LLM goes down, and even if you run more loops, the service can still hold up.

6) “Semantic interfaces” are coming: the role of storage/DB shifts from file-centered to meaning-based

There’s one more perspective the article takes quite boldly.

In the past, storage strongly favored a file-centered interface.
But for AI, it’s less about “find me this document” and more about “find it based on meaning.”

So the CEO predicts this shift.

Meaning-based discovery like vector DB/graph DB will take root as a storage interface
Instead of “scrolling” to find stored data, AI calls it directly by meaning
Ultimately, the data access approach itself changes

The economic/industry angle that connects here is this.
In the AI era, dark data (data that’s stored but not used) becomes a “source,” so
data value rises and
data access/search technologies become the center of competitiveness.

7) The product/solution direction Denotasia is talking about: a “system” approach bundling chips + software

Let’s also summarize the company direction described in the article.

Hardware: VDPU/accelerators (prototype stage, mentioning the first chips in summer)
Software: an engine including vector DB/graph capabilities + an integrated solution shaped like an agent OS/agenting leg
Goal: reduce the bottlenecks that bounce between CPU/storage/DB with an “accelerated pipeline”

There’s also an important point.
The CEO says the reason large hyperscalers build their own chips isn’t the “number of transistors,” but rather
application understanding optimized for their own workloads/platform.

So, combined with domestic/memory strengths (data supply based on storage/memory infrastructure), the story continues toward winning in the AI-era data pipeline.

8) Investment points (mentioned in the article): market shifts connect directly to revenue/validation

The article also mentions an investment round.

Series A of roughly 90 billion KRW is mentioned
Persuasion logic: the context that “memory/data/domain-specific” trends are becoming real, and POC preparation and real-world validation with big tech/cloud are underway

The point to carry into the market perspective is this.
The AI semiconductor competition is expanding beyond just a “GPU performance race,” and is read as a sign that it’s shifting toward accelerating data access (search/meaning exploration/caching movement).

5-line “news-style core takeaway” readers must take away

Agentic AI repeats “search/access” more than raw computation and shifts the bottleneck to the data layer.
KV cache burdens HBM, and the cost of dropping/reloading it due to session scheduling grows.
Even with compression/optimization like TurboQuantum, demand to expand context and improve service quality keeps growing, so the “demand-bottleneck” doesn’t disappear.
As CPU orchestration + search workloads grow, CPU bottlenecks move to the forefront.
Search-dedicated accelerators like VDPU are likely to create a new arena that complements the limits of the “keep stacking more HBM” strategy.

The most important separate takeaway to deliver (a point less discussed elsewhere)

In this interview, there’s a passage that sounds “most important” yet is relatively less intuitive.

That the bottleneck in AI performance shifts from “model inference FLOPS” to “where per-session context/cache sits and how it gets handed over”.

So future investment is likely to go like this.

It’s not just a competition for HBM capacity; performance improves when KV cache management (tiering) + search acceleration (VDPU) + meaning-based DB interfaces are bundled together.
In other words, not a single “one technology,” but memory-centric (cache/movement) + data-centric (semantic search) + system-centric (pipeline) must evolve together.

Once this perspective clicks, the flows you’ll see in the news going forward will be far easier to organize.

For example:
“Even though GPU performance improved, latency doesn’t decrease” → it could be due to KV/search/cache movement, and
“Agentic is booming—why is there also talk about CPU?” → because search/orchestration is getting layered onto the CPU.

SEO keyword natural insertion (connecting to the article’s context)

The keywords that especially connect in this piece are AI semiconductors, HBM, agentic AI, data centers, and vector databases.
The key this time is that the weight is shifting from “compute acceleration” to “data access/search acceleration.”

< Summary >

Agentic AI repeats search and context management during the answer process, making data access the bottleneck.
KV cache grows per session, creating HBM pressure and latency costs from dropping and bringing the cache back up.
Even with compression like TurboQuantum, demand to expand context keeps growing, so the bottleneck doesn’t go away.
As search workloads increase, CPU bottlenecks become more prominent, and the need grows for dedicated accelerators to help the CPU.
VDPU aims to reduce agent loop latency by accelerating search/matching operations in vector databases (semantic search).
Also, storage/DB interfaces change from file-centered to meaning-based, making the competition for AI data pipelines more important.

[Related article…]

*Source: [ 티타임즈TV ]

– “HBM, 터보퀀텀으론 해결 안되는 메모리 전쟁, 판을 바꾸는 반도체 VDPU가 온다”(정무경 디노티시아 대표)

NextGenInsight.Net