Mercury 2
Fastest reasoning LLM built for instant production AI
Mercury 2 – Fastest reasoning LLM with parallel token generation
Summary: Mercury 2 is a reasoning diffusion large language model that generates tokens simultaneously, achieving over 1,000 tokens per second. It replaces sequential decoding with parallel refinement to deliver low-latency, high-quality reasoning outputs suitable for real-time applications.
What it does
Mercury 2 uses a diffusion-based architecture to generate tokens in parallel rather than sequentially, enabling a 5x speed increase over traditional autoregressive models. This approach reduces latency in multi-step agentic loops and real-time voice applications.
Who it's for
It is designed for developers and applications requiring fast, reasoning-grade language generation with minimal latency, compatible with existing OpenAI APIs.
Why it matters
By drastically reducing inference time, Mercury 2 enables real-time AI interactions where latency impacts performance across multiple processing steps.