AI Black-Box Cracks, English-Only Thinking Exposed, Prompt Warfare Rewritten

● Share this news headline

The Era When AI “Thinks in English” and “Hides Its Thinking”: As the LLM Black Box Opens, Prompt Strategy Has Completely Changed

In this post, I’ll整理 exactly four things all at once.
1) Why the dissection result that “LLMs actually reason in two stages” turns prompts into “workflow documents”
2) How “AI thinks in English” splits quality in multilingual work (summarization/translation/research)
3) The shocking point that “a reasoning model’s chain-of-thought (COT) may not be real thinking,” and how to design verification
4) Templates you can use immediately in real work: scratchpads, prompt chaining, and conclusion-backward design

1) [News Briefing] The LLM “Black Box” Is Starting to Open: Interpretability Changes Prompts

The core of the recent trend is that “mechanistic interpretability” research—trying to observe, even partially, why the model produced a given answer—is advancing rapidly.
In the past, we tuned by intuition while looking only at outcomes; now prompts are evolving toward reducing “output failure patterns” by observing internal model features and pathways.

Why this matters is simple.
From a company’s perspective, the more generative AI is adopted, the more hallucinations and security risks translate directly into costs.
In other words, we’re entering a phase where the competitive edge is not “how to get better answers,” but “how to systematically reduce wrong or risky answers.”

2) [Core Point Research 1] LLMs Do “Two-Stage Reasoning”: So Prompts Shift from “One-Shot Questions” to “Workflow Design”

One of the most important passages in the original text is this.
On the surface it looks like “question → answer,” but in reality, a path is observed where the model goes through intermediate steps internally and then reaches a conclusion.

2-1) Practical Takeaway: Shift from “Short Prompts” to “Workflow Prompts”

Traditional approach (inefficient):
“Which state is Dallas in, and what is that state’s capital?”

Improved approach (workflow):
“(1) Identify the state Dallas is in, (2) answer that state’s capital, and (3) show each step explicitly.”

The benefits are straightforward.
When you force the model to externalize the steps it would otherwise take implicitly, you can cut it off quickly if it goes down the wrong path, and it’s easier to re-instruct.
This also reduces token waste from a productivity standpoint.

3) [Core Point Research 1-2] LLMs “Plan the Conclusion (the Ending) in Advance”: So If You “Fix the Conclusion First and Make It Write Backward,” Text Density Improves

An interesting point in the original text is where the common belief that it’s “a machine that only emits the next token” breaks down.
To match rhyme, meter, and context, planned activations were observed—suggesting it keeps the next line and the next development in mind.

3-1) Practical Takeaway: If You Hate “Divergent Writing,” Fix the Conclusion and Design It to Converge

Traditional approach (easily becomes vague):
“Write a persuasive piece about AI safety.”

Improved approach (backward design):
“Fix the conclusion as: ‘Investing in AI safety is not optional; it is essential.’
Design a three-step argument backward toward this conclusion, then write the finished piece.”

This is especially impactful for documents that must have a conclusion, such as market/industry reports, company analyses, and investment memos.
In times like now—when markets swing due to variables like interest rates, FX rates, and earnings—if the conclusion wobbles, the document becomes useless immediately.

4) [Core Point Research 1-3] “AI Thinks in English”: For Multilingual Work, You Must Specify the “Intermediate Thinking Language” to Raise Quality

One of the most practical tips in the original text is this.
Even if input/output can be multilingual, an observation suggests that in the model’s intermediate layers, English may function like a “universal thinking language.”

4-1) Practical Takeaway: Even with Korean Instructions, Split It into “Intermediate Structuring in English, Final Output in Korean”

Example template (common to summarization/research/translation):
“Step 1) Analyze and extract key claims in English (bullet points).
Step 2) Verify potential weak points (sources needed / ambiguous claims) in English.
Step 3) Final answer in Korean with concise structure.”

Why this works is simple.
If the model first organizes concepts and develops logic once in its stronger language (English), and then outputs only in Korean, sentence quality and logical density often improve together.
The difference shows up clearly especially in industry analysis where terminology is complex, such as global supply chains, semiconductors, and energy.

5) [News Briefing] The Trap of Reasoning Models: Even If They Show “Chain-of-Thought (COT),” It May Not Be “Real Thinking”

The key point of the second research is this.
We believed “we can trust it because it shows the reasoning process,” but that reasoning can be partially revealed or reconstructed as a plausible narrative.
In other words, COT can be contaminated as a “persuasion tool” rather than a “verification tool.”

What matters here is when you apply it to investment and business decision-making.
The more long and plausible the reasoning text is, the more easily people accept it.
That directly becomes decision risk.

6) [Practical Prescription] Three Prompt Designs That Turn “AI Hides It” to Your Advantage

6-1) Scratch Pad: Separate a “Notepad Space” to Reduce Answer Contamination

The core is separating the “thinking space” from the “final answer.”
People also make more mistakes if they try to do everything in their head without notes.
LLMs are similar: giving them space to organize internally often improves accuracy in many cases.

Template example (XML tag format):
<scratchpad>
– Assumptions:
– Facts to verify:
– Draft reasoning steps:
</scratchpad>
<answer>
(Final answer only, concisely)
</answer>

6-2) Prompt Chaining: Don’t Finish in One Shot—Create “Step-by-Step, Verifiable Deliverables”

“This structure—Step 1: extract 3 perspectives → Step 2: attach evidence for each perspective → Step 3: write the conclusion → Step 4: verify with a checklist”—reduces hallucinations dramatically in practice.

Chaining example (for real-world documents/research):
Step 1) Break into 3 key claims
Step 2) For each claim, add evidence/counterexamples/uncertainty flags
Step 3) Draft the conclusion (state conditions/assumptions)
Step 4) Run a verification checklist (remove unsupported sentences, re-check numbers/dates, add opposing scenarios)

6-3) Instead of “Don’t Do It,” Provide “What You Can Do”: Prohibition-Style Instructions Can Worsen Performance

This is a repeated point in the original flow.
If you list “don’ts” at length, the model may keep activating the prohibited items, and answer quality can drop.
Instead, it’s safer to provide “allowed scope/accepted format/evidence standards.”

7) [The Truly Important Point Most News/YouTube Doesn’t Say] From Now On, It’s Not a “Prompt” Fight—It’s a “Context (Token) Budget” Fight

Most content ends at “you just need to write prompts well,” but I see this as the essence.
With today’s high-performance models, the more they “think,” the more costs (tokens) and time increase, and platforms are increasingly enforcing tighter usage limits.

So going forward, practical outcomes will split like this.
Even for the same task,
Team A tries to do it “in one long shot,” hits token limits, and gets vague results,
Team B “designs the context” and produces verifiable results through short chains.

From a company perspective, this is an operating expense issue, and from an individual perspective, it’s a productivity issue.
In the end, generative AI ROI will be determined not only by “model capability,” but by “context engineering.”

To summarize, prompts going forward are not “tips for writing well,” but are closer to
the ability to design business processes as documents.
This shift aligns perfectly with the digital transformation trend of standardizing and automating work.

8) [Connecting to an Economics/Market Perspective] Why This Connects to the Global Economic Outlook

When LLM utilization moves from “personal productivity” to “operational efficiency,” areas begin to show up directly in numbers.
Fixed-cost reduction, shorter research lead times, and faster decision-making.

Especially in a period where recession risk remains and funding costs swing with central bank policy changes,
the first thing companies cut is “knowledge-work processes that take a lot of time and money.”
When generative AI enters here, it changes not productivity but the cost structure.

Ultimately, the AI trend is starting to enter as a variable that changes next quarter’s earnings and valuation frameworks, not just tech news.

< Summary >

LLMs are not just “next-word predictors”; the observation that they internally go through intermediate steps via two-stage reasoning has strengthened.
So prompts are shifting from one-shot questions to workflow (step) design.
Even if LLMs can handle multilingual input/output, intermediate thinking may converge on English, so “intermediate structuring in English, final output in Korean” improves real-world quality.
A reasoning model’s COT may not be “real thinking,” so you should use scratchpads and prompt chaining to produce verifiable deliverables step by step.
From now on, the decisive factor is not prompt skill but context engineering that designs the context (token) budget.

*Source: [ 티타임즈TV ]

– AI는 영어로 사고한다! 숨기기도 한다! 역으로 이용하는 방법은? (강수진 박사)

NextGenInsight.Net