What is first-token latency?

First-token latency (FTL) is the time between sending a prompt and receiving the first token of the response. It is critical for interactive applications.

What is throughput in LLMs?

Throughput measures how many tokens per second a model generates after the first token. Higher throughput means faster completion of long responses.

How often is the benchmark updated?

The benchmark is updated monthly with the latest public data from providers and independent tests. The current dataset was last updated on June 12, 2026.

LLM Speed Benchmark 2026

Compare first-token latency and throughput across 15+ models. Filter by provider, category, and price. Updated 2026-06-13.

Provider

Category

Sort by

15 models

Model	Provider	Category	First token	Throughput	Context	Price / 1M
Mistral Small text	Mistral	fast	150 ms	190 /s	32k	$0.80
Gemini 2.0 Flash textvisionaudio	Google	fast	160 ms	180 /s	1000k	$0.50
Llama 4 Scout textvision	Meta	fast	170 ms	175 /s	256k	$0.80
GPT-4o mini textvision	OpenAI	fast	180 ms	165 /s	128k	$0.75
Claude 3.5 Haiku textvision	Anthropic	fast	210 ms	145 /s	200k	$4.80
DeepSeek V3 text	DeepSeek	balanced	260 ms	120 /s	64k	$1.30
Llama 3.3 70B text	Meta	balanced	290 ms	115 /s	128k	$1.60
Amazon Nova Pro textvision	AWS	balanced	310 ms	110 /s	300k	$4.00
GPT-4o textvision	OpenAI	frontier	320 ms	108 /s	128k	$12.50
Mistral Large text	Mistral	frontier	340 ms	100 /s	128k	$8.00
Grok 3 textvision	xAI	frontier	360 ms	105 /s	128k	$18.00
Gemini 2.5 Pro textvisionaudio	Google	frontier	380 ms	95 /s	1000k	$11.25
Claude 3.7 Sonnet textvision	Anthropic	frontier	450 ms	85 /s	200k	$18.00
o3-mini text	OpenAI	reasoning	850 ms	70 /s	200k	$5.50
DeepSeek R1 text	DeepSeek	reasoning	1200 ms	55 /s	64k	$2.74

How we measure speed

First-token latency (FTL)

The time from sending a prompt to receiving the first token. Lower is better for interactive applications like chat and coding assistants.

Throughput (tokens/sec)

The rate at which the model generates tokens after the first one. Higher is better for long-form content, summarization, and batch jobs.

Numbers are representative benchmarks collected from public provider documentation and independent tests. Actual performance varies by region, load, and prompt length. Last updated: 2026-06-12.