BlogEngineering

Diffusion LLMs & performance: what happens when the model is no longer the bottleneck?

Mercury 2 is an early signal of what changes when model latency drops fast: the bottleneck moves, and suddenly the rest of your stack matters.

Nathan Booker/Co-Founder & CPO

March 27, 20268 min read

Diffusion LLMs & performance: what happens when the model is no longer the bottleneck?

At Runtype, we started routing chat experiences through Mercury 2, and the radically different performance profile of this new class of model had us looking closely at the rest of our stack.

Mercury 2 is one of the earliest production signals for what happens when model latency & throughput radically improves. Inception reports 1,009 tokens per second on NVIDIA Blackwell GPUs [1]. In our own production testing, we saw a separate but related shift: the model response stopped dominating the request timeline the way it used to. In the autoregressive world, the LLM call often felt like roughly 95% of total latency in our stack. With Mercury 2, it was closer to 50%.

This latency shift is a Runtype internal observation, not a vendor benchmark. But it changed what we optimized; other parts of the request flow were no longer second-order terms for performance, but back in the spotlight.

Illustrative Runtype internal observation, not an industry benchmark.

Sonnet 4.6 vs Mercury 2 in Runtype Evals — same prompt, compared on speed, quality, and cost.

That changes the product equation. Prompt assembly, context retrieval, response serialization, and frontend rendering stop being rounding errors and start shaping the user experience.

This is great news. AI products can finally feel instant — responsive enough for real-time voice, in-editor code completion, agentic loops that iterate dozens of times without the user going to make coffee. But it also means the stack around the model needs to keep up.

What product builders need to understand about diffusion

Traditional LLMs write one token at a time. Diffusion language models generate and refine text in parallel. You do not need the full denoising lecture to make a product decision. The important consequence is that latency moves.

Mercury 2 generating a complete response in real time. Source: Inception Labs [1].

When the model gets much faster, the user starts feeling delays in retrieval, orchestration, rendering, and tool calls instead. That is the real wake-up call here. Diffusion is not just "the same product, but nicer." It changes which product experiences become viable and which parts of your stack suddenly matter.

Inception describes Mercury 2 as converging over "a small number of steps" rather than hundreds of left-to-right decoding steps [1]. If you want a concrete mental model, Goedecke's explainer suggests thinking in terms of roughly a dozen denoising passes for a diffusion response, with fewer passes buying more speed at the cost of output quality [4]. That tunability is one reason diffusion changes the latency conversation so much.

The important caveat: the speedup is not uniform across every workload. Diffusion tends to look strongest on bounded responses where parallel refinement can amortize those denoising passes, and weaker on tiny answers, highly variable-length writing, or very long-context workloads where the architecture has to do more repeated attention work [4].

One technical clarification worth making explicit: diffusion is an architecture around the generation loop, not a wholesale replacement for transformers. Inception says Mercury Coder uses a transformer as its denoising network; the difference is that the model predicts how to improve a block of text, not just the next token logits [5].

Mercury 2 is the most concrete early production signal right now. Inception positions it for realtime voice, customer support, enterprise search, and workflow subagents, with a 128K context window, tool use, structured output, and tunable reasoning [1] [2]. Google's experimental Gemini Diffusion is another sign this is becoming a real product category rather than a one-company experiment [3].

These are not small gains. This is the kind of latency shift that changes where product teams spend their time.

What this unlocks

Chat, support, and voice that feel instant

If the user is waiting on every turn, speed is not a backend metric. It is the product experience. Faster models make chat feel snappier, support flows feel more responsive, and voice interactions feel less like talking over lag.

Agent loops people will actually wait for

An agent that chains 20 inference calls per task has its latency multiplied 20x. When per-call latency falls from multi-second waits toward sub-second turns, you go from "go get coffee" territory to something users will actually wait for.

Faster inference also means you can afford more steps. More retries, more tool calls, more self-correction. The quality ceiling rises because you're not burning your latency budget on the model.

Code assist that keeps momentum

At these speeds, suggestions can show up fast enough to stay inside the user's flow. That matters for code completion, operator copilots, and any interface where hesitation breaks concentration.

RAG that still feels responsive

Multi-hop retrieval is brutal with autoregressive models. Retrieve, summarize, retrieve again, synthesize — each step waiting on model inference. With diffusion models, you can add reasoning to the retrieval loop without blowing your latency budget.

Where we'd benchmark carefully before defaulting to diffusion

So if diffusion is this much faster, why not just move everything over to it? These are a new class of models with a different set of strengths and weaknesses; this is what we'd recommend evaluating. When speed matters more than accuracy or strong reasoning, Mercury really shines.

Mercury 2 PinchBench results showing competitive success rates against frontier models with significantly faster execution times — Mercury 2 on PinchBench — competitive agent performance, compared on success rate and execution time. Source: Inception [6].

Long-form content with variable length

Our current guidance is to benchmark long-form generation carefully. Diffusion's speed advantage is easiest to feel on short, bounded responses: chat turns, code completions, structured extractions. For reports, long documents, or open-ended writing, we would still compare against strong autoregressive models for coherence, output control, and how gracefully they handle variable-length answers. Goedecke's framing is useful here too: diffusion systems can still produce long outputs, but fixed-size block generation makes "how long should this answer be?" more of an architectural benchmark question than a default win [4].

Hard reasoning and planning

Mercury 2 includes tunable reasoning [1], and that will make it a good fit for plenty of real workloads. But if the task depends on deep multi-step reasoning, long troubleshooting chains, or the model revising its own plan as it goes, reasoning-oriented autoregressive models are still our safer default today. One reason to stay cautious is that "change your mind mid-generation" reasoning maps cleanly onto token-by-token decoding and less obviously onto block denoising. That does not mean diffusion cannot reason; it means the open question is more interesting than a simple yes/no. Could denoising passes themselves become a form of test-time reasoning? Could future models spend far more passes when the task demands it? That is still unsettled, and it is probably the most interesting research question in this category right now [4].

Tiny outputs like routing and classification

If you only need a label, a route, or a yes/no decision, the answer is so short that raw generation speed may not matter much. A diffusion model may still need to pay for its denoising passes even when the final answer is just a couple of tokens, so the architectural advantage can shrink or disappear on these micro-responses [4]. Benchmark the full workflow, not just the model architecture.

Huge contexts

If your product regularly pushes 50K+ tokens into the prompt, benchmark carefully. Long-context behavior can matter more than raw generation speed, and the win you see in chat may narrow or disappear. The crisp intuition is that autoregressive decoding gets to lean on KV cache as it steps forward token by token, while diffusion-style block updates cannot cache attention scores between denoising passes as cleanly because each pass can change every token in the current output block [4].

The playbook

1. Audit your stack before you swap models

The biggest lesson from our Mercury 2 integration: the model isn't your bottleneck anymore, so everything else becomes visible.

Before switching, instrument your full pipeline:

Prompt assembly time (context retrieval, template rendering)
Network latency to/from your inference provider
Response parsing and serialization
Database queries triggered by tool calls
Frontend time-to-first-paint for streamed responses

If your model does the work in 200ms but your retrieval layer adds 500ms, a faster model just makes that 500ms more painful.

2. Match the model to the job

This is our current routing heuristic based on vendor materials plus Runtype's own testing, not a market-wide benchmark table.

What you're building	Default choice	Why
Chat, support, copilots	Mercury 2	When speed is part of UX, latency matters more than perfect prose
Code completion	Mercury Coder	Fast suggestions protect flow state
Multi-step agents	Mercury 2 + Claude Sonnet	Use diffusion for fast turns, reasoning models for the hard parts
Reports, docs, long replies	Claude Sonnet / GPT-5	Our current safer default for long, variable-length output
Planning and hard reasoning	o3 / Claude with extended thinking	Our current safer default for deep multi-step logic
Classification and routing	Haiku 4.5 / GPT-5 Mini / Llama 4	The response is so short that diffusion may not matter much
RAG over moderate context	Mercury 2	Faster retrieve-answer loop
RAG over huge context (100K+)	Claude / Gemini	Benchmark long-context behavior first

3. Route between architectures

This is where platforms like Runtype earn their keep. The most effective production architectures aren't single-model — they route different tasks to different models based on the latency-quality-cost tradeoff each step needs. A customer support agent uses Mercury 2 for conversational turns (where speed is UX) and Claude for complex resolution logic (where reasoning depth matters). The user gets instant responses while the hard thinking happens on the right architecture for the job.

The practical adoption pattern is not "replace every model with diffusion." It is "move the latency-sensitive turns to diffusion and keep deep reasoning where it belongs."

4. Prepare your streaming infrastructure

This is the production gotcha product teams will actually feel. Diffusion models are fast enough that your streaming layer — not the model — becomes the bottleneck.

We hit this at Runtype. Our SSE implementation and frontend rendering were designed around much slower autoregressive streams. When Mercury 2 responses sped up enough through the same path, the UI stuttered. Responses arrived so quickly that they rendered in awkward jumps instead of feeling smooth. We had to batch chunks and pace rendering on the client side. That is a Runtype internal observation, not a vendor benchmark, but the lesson was simple: if your product was tuned for slower models, a faster model can expose UX problems you did not know you had.

If your streaming infrastructure was built for autoregressive speeds, test it at 10x throughput before going live. The model is no longer the slow part.

5. Watch workflow expansion, not just per-token price

Mercury 2 is listed at $0.25/1M input tokens and $0.75/1M output tokens [1]. But speed changes behavior.

Here's the napkin math: an agent running 20 Mercury 2 calls at 500 output tokens each uses 10,000 output tokens, or $0.0075 in output-token cost per run. If that workflow expands to 50 calls because the latency budget now allows it, you're at 25,000 output tokens, or about $0.019, before you even count input tokens. Lower latency invites more steps, more retries, and more tool calls.

The speed advantage changes what you can afford to do. Make sure you've decided what you should do first.

Diffusion models matter because they change product design, not because they make for a flashy benchmark chart. If your product lives or dies on responsiveness, you should test them. If your product depends on long-form generation, huge contexts, or hard reasoning, benchmark selectively rather than assuming diffusion is the new default.

The best AI products will route between architectures the way good backends route between databases.

We're building Runtype around this exact problem: helping product teams route fast conversational turns, deeper reasoning, and tool-heavy workflows across the right models. If you're evaluating diffusion models for production, start building on Runtype.

References

Inception Labs. "Introducing Mercury 2." 2026. https://www.inceptionlabs.ai/blog/introducing-mercury-2
Inception Labs. "Our Models." 2026. https://www.inceptionlabs.ai/models
Google. "Gemini Diffusion: Google DeepMind's experimental research model." 2026. https://blog.google/innovation-and-ai/models-and-research/google-deepmind/gemini-diffusion/
Goedecke, Sean. "Strengths and limitations of diffusion language models." May 22, 2025. https://www.seangoedecke.com/limitations-of-text-diffusion-models/
Inception Labs. "Introducing Mercury, the World's First Commercial-Scale Diffusion Large Language Model." 2025. https://www.inceptionlabs.ai/blog/introducing-mercury
Inception Labs. "Mercury 2 on PinchBench: Fast Diffusion Models and the Personal Agent Era." 2026. https://www.inceptionlabs.ai/blog/mercury-2-on-pinchbench

diffusion-modelsperformancearchitectureagentsinference