Model Benchmark Visualizer
Compare MMLU, HumanEval, GPQA, and MATH scores across major LLMs. Filter by provider and model size, then export any chart as an image.
Last updated: 2026-06-13
16 models
MMLU Pro / MMLU
Massive Multitask Language Understanding. Measures knowledge across dozens of academic subjects. Higher is better.
| Model | Provider | Size | MMLU | Coding | GPQA | MATH |
|---|---|---|---|---|---|---|
OpenAI o3 | OpenAI | 300B | 92.9% | 95.0% | 83.7% | 91.0% |
Gemini 2.5 Pro | 200B | 91.5% | 94.0% | 78.0% | 87.0% | |
OpenAI o4-mini | OpenAI | 200B | 91.3% | 93.0% | 74.0% | 86.0% |
DeepSeek-R1 | DeepSeek | 671B | 90.8% | 92.0% | 71.5% | 86.0% |
OpenAI GPT-4.5 | OpenAI | 175B | 90.4% | 92.0% | 71.0% | 85.0% |
OpenAI GPT-4o | OpenAI | 200B | 88.7% | 90.2% | 53.6% | 76.6% |
Claude 3.5 Sonnet | Anthropic | 175B | 88.5% | 92.0% | 65.0% | 78.0% |
DeepSeek V3 | DeepSeek | 671B | 88.5% | 90.0% | 59.0% | 82.0% |
Claude 3 Opus | Anthropic | 200B | 86.8% | 84.9% | 50.4% | 61.0% |
Llama 3.1 405B | Meta | 405B | 85.2% | 82.0% | 45.0% | 69.0% |
Gemini 1.5 Pro | 100B | 85.0% | 80.0% | 46.0% | 72.0% | |
Qwen 2.5 72B | Qwen | 72B | 85.0% | 86.0% | 48.0% | 75.0% |
Mistral Large 2 | Mistral | 123B | 84.0% | 85.0% | 45.0% | 70.0% |
Llama 3 70B | Meta | 70B | 82.0% | 81.7% | 40.0% | 65.0% |
Claude 3.5 Haiku | Anthropic | 20B | 81.9% | 75.0% | 40.0% | 65.0% |
Gemini 1.5 Flash | 15B | 79.0% | 74.0% | 35.0% | 60.0% |
About these benchmarks
What the scores mean
Scores are reported as percentages where higher is better. MMLU measures broad knowledge, HumanEval measures code generation, GPQA tests graduate-level reasoning, and MATH-500 measures competition-level math problem solving.
How to use this visualizer
Filter by provider or model size, switch between benchmarks, and export any chart as a PNG for reports or presentations. Use the table to compare all four benchmarks side-by-side.
Why benchmark scores matter
Pick the right model for the task
A model that tops the coding leaderboard may not be the best choice for research summarization. Use benchmark filters to match capability strengths with your use case.
Balance capability and cost
Larger models often score higher, but smaller models can be dramatically cheaper and faster. Compare these charts with our pricing comparison and speed benchmark to find the best trade-off.