Skip to main content
VePrompts

Model Benchmark Visualizer

Compare MMLU, HumanEval, GPQA, and MATH scores across major LLMs. Filter by provider and model size, then export any chart as an image.

Last updated: 2026-06-13

16 models

MMLU Pro / MMLU

Massive Multitask Language Understanding. Measures knowledge across dozens of academic subjects. Higher is better.

OpenAI o3
92.9%
OpenAI o3
Gemini 2.5 Pro
91.5%
Gemini 2.5 Pro
OpenAI o4-mini
91.3%
OpenAI o4-mini
DeepSeek-R1
90.8%
DeepSeek-R1
OpenAI GPT-4.5
90.4%
OpenAI GPT-4.5
OpenAI GPT-4o
88.7%
OpenAI GPT-4o
Claude 3.5 Sonnet
88.5%
Claude 3.5 Sonnet
DeepSeek V3
88.5%
DeepSeek V3
Claude 3 Opus
86.8%
Claude 3 Opus
Llama 3.1 405B
85.2%
Llama 3.1 405B
Gemini 1.5 Pro
85.0%
Gemini 1.5 Pro
Qwen 2.5 72B
85.0%
Qwen 2.5 72B
Mistral Large 2
84.0%
Mistral Large 2
Llama 3 70B
82.0%
Llama 3 70B
Claude 3.5 Haiku
81.9%
Claude 3.5 Haiku
Gemini 1.5 Flash
79.0%
Gemini 1.5 Flash
ModelProviderSizeMMLUCodingGPQAMATH
OpenAI o3
OpenAI300B92.9%95.0%83.7%91.0%
Gemini 2.5 Pro
Google200B91.5%94.0%78.0%87.0%
OpenAI o4-mini
OpenAI200B91.3%93.0%74.0%86.0%
DeepSeek-R1
DeepSeek671B90.8%92.0%71.5%86.0%
OpenAI GPT-4.5
OpenAI175B90.4%92.0%71.0%85.0%
OpenAI GPT-4o
OpenAI200B88.7%90.2%53.6%76.6%
Claude 3.5 Sonnet
Anthropic175B88.5%92.0%65.0%78.0%
DeepSeek V3
DeepSeek671B88.5%90.0%59.0%82.0%
Claude 3 Opus
Anthropic200B86.8%84.9%50.4%61.0%
Llama 3.1 405B
Meta405B85.2%82.0%45.0%69.0%
Gemini 1.5 Pro
Google100B85.0%80.0%46.0%72.0%
Qwen 2.5 72B
Qwen72B85.0%86.0%48.0%75.0%
Mistral Large 2
Mistral123B84.0%85.0%45.0%70.0%
Llama 3 70B
Meta70B82.0%81.7%40.0%65.0%
Claude 3.5 Haiku
Anthropic20B81.9%75.0%40.0%65.0%
Gemini 1.5 Flash
Google15B79.0%74.0%35.0%60.0%

About these benchmarks

What the scores mean

Scores are reported as percentages where higher is better. MMLU measures broad knowledge, HumanEval measures code generation, GPQA tests graduate-level reasoning, and MATH-500 measures competition-level math problem solving.

How to use this visualizer

Filter by provider or model size, switch between benchmarks, and export any chart as a PNG for reports or presentations. Use the table to compare all four benchmarks side-by-side.

Why benchmark scores matter

Pick the right model for the task

A model that tops the coding leaderboard may not be the best choice for research summarization. Use benchmark filters to match capability strengths with your use case.

Balance capability and cost

Larger models often score higher, but smaller models can be dramatically cheaper and faster. Compare these charts with our pricing comparison and speed benchmark to find the best trade-off.