Which benchmarks are included?

The visualizer covers MMLU (general knowledge), HumanEval (code generation), GPQA Diamond (graduate-level reasoning), and MATH-500 (competition mathematics).

Can I filter the chart by provider or model size?

Yes. Use the provider dropdown and size filter to narrow the chart to specific vendors or parameter ranges.

How do I export a chart?

Click the “Export image” button above the active chart to download a PNG image for presentations or reports.

Is this tool free to use?

Yes. The Model Benchmark Visualizer is completely free, requires no signup, and runs entirely in your browser.

VePrompts

Collections

Model Benchmark Visualizer

Compare MMLU, HumanEval, GPQA, and MATH scores across major LLMs. Filter by provider and model size, then export any chart as an image.

Last updated: 2026-06-13

Provider

Model size

16 models

MMLU Pro / MMLU

Massive Multitask Language Understanding. Measures knowledge across dozens of academic subjects. Higher is better.

OpenAI o3 OpenAI

92.9%

OpenAI o3

Gemini 2.5 Pro Google

91.5%

Gemini 2.5 Pro

OpenAI o4-mini OpenAI

91.3%

OpenAI o4-mini

DeepSeek-R1 DeepSeek

90.8%

DeepSeek-R1

OpenAI GPT-4.5 OpenAI

90.4%

OpenAI GPT-4.5

OpenAI GPT-4o OpenAI

88.7%

OpenAI GPT-4o

Claude 3.5 Sonnet Anthropic

88.5%

Claude 3.5 Sonnet

DeepSeek V3 DeepSeek

88.5%

DeepSeek V3

Claude 3 Opus Anthropic

86.8%

Claude 3 Opus

Llama 3.1 405B Meta

85.2%

Llama 3.1 405B

Gemini 1.5 Pro Google

85.0%

Gemini 1.5 Pro

Qwen 2.5 72B Qwen

85.0%

Qwen 2.5 72B

Mistral Large 2 Mistral

84.0%

Mistral Large 2

Llama 3 70B Meta

82.0%

Llama 3 70B

Claude 3.5 Haiku Anthropic

81.9%

Claude 3.5 Haiku

Gemini 1.5 Flash Google

79.0%

Gemini 1.5 Flash

Model	Provider	Size	MMLU	Coding	GPQA	MATH
OpenAI o3	OpenAI	300B	92.9%	95.0%	83.7%	91.0%
Gemini 2.5 Pro	Google	200B	91.5%	94.0%	78.0%	87.0%
OpenAI o4-mini	OpenAI	200B	91.3%	93.0%	74.0%	86.0%
DeepSeek-R1	DeepSeek	671B	90.8%	92.0%	71.5%	86.0%
OpenAI GPT-4.5	OpenAI	175B	90.4%	92.0%	71.0%	85.0%
OpenAI GPT-4o	OpenAI	200B	88.7%	90.2%	53.6%	76.6%
Claude 3.5 Sonnet	Anthropic	175B	88.5%	92.0%	65.0%	78.0%
DeepSeek V3	DeepSeek	671B	88.5%	90.0%	59.0%	82.0%
Claude 3 Opus	Anthropic	200B	86.8%	84.9%	50.4%	61.0%
Llama 3.1 405B	Meta	405B	85.2%	82.0%	45.0%	69.0%
Gemini 1.5 Pro	Google	100B	85.0%	80.0%	46.0%	72.0%
Qwen 2.5 72B	Qwen	72B	85.0%	86.0%	48.0%	75.0%
Mistral Large 2	Mistral	123B	84.0%	85.0%	45.0%	70.0%
Llama 3 70B	Meta	70B	82.0%	81.7%	40.0%	65.0%
Claude 3.5 Haiku	Anthropic	20B	81.9%	75.0%	40.0%	65.0%
Gemini 1.5 Flash	Google	15B	79.0%	74.0%	35.0%	60.0%

About these benchmarks

What the scores mean

Scores are reported as percentages where higher is better. MMLU measures broad knowledge, HumanEval measures code generation, GPQA tests graduate-level reasoning, and MATH-500 measures competition-level math problem solving.

How to use this visualizer

Filter by provider or model size, switch between benchmarks, and export any chart as a PNG for reports or presentations. Use the table to compare all four benchmarks side-by-side.

Compare model pricing → View speed benchmarks →

Why benchmark scores matter

Pick the right model for the task

A model that tops the coding leaderboard may not be the best choice for research summarization. Use benchmark filters to match capability strengths with your use case.

Balance capability and cost

Larger models often score higher, but smaller models can be dramatically cheaper and faster. Compare these charts with our pricing comparison and speed benchmark to find the best trade-off.