AutoBench LLM Leaderboard
Interactive leaderboard for AutoBench, where LLMs rank LLMs' responses. Includes performance, cost, and latency metrics.Data updated on April 25, 2025.
More info for this benchmark run: AutoBench Run 2 Results. If you want to know more about AutoBench: AutoBench Release.
Overall Model Performance
Models ranked by AutoBench score. Lower cost ($ Cents) and latency (s) are better.
llama-4-Maverick-17B-128E-Instruct-FP8 | 4.57 | 0.793 | 36.57 | 223.47 |
Benchmark Comparison
Comparison of AutoBench scores with other popular benchmarks. AutoBench features 82.51% correlation with Chatbot Arena, 83.74% with Artificial Analysis Intelligence Index, and 71.51% with MMLU. Models sorted by AutoBench score.
llama-4-Maverick-17B-128E-Instruct-FP8 | 4.57 | null | 69830 | 0.832 |
o4-mini-2025-04-16 | 4.57 | null | 69830 | 0.832 |
gemini-2.5-pro-preview-03-25 | 4.46 | 1439 | 67840 | 0.858 |
claude-3.7-sonnet:thinking | 4.39 | 1303 | 57390 | 0.837 |
gpt-4.1-mini | 4.34 | null | 52860 | 0.781 |
grok-3-beta | 4.34 | 1402 | 50630 | 0.799 |
llama-3_1-Nemotron-Ultra-253B-v1 | 4.26 | null | null | 0.69 |
deepSeek-R1 | 4.26 | 1358 | 60220 | 0.844 |
o3-mini-2025-01-31 | 4.26 | 1305 | 62860 | 0.791 |
gemma-3-27b-it | 4.2 | 1342 | 37620 | 0.669 |
claude-3.7-sonnet | 4.2 | 1293 | 48150 | 0.803 |
llama-3.1-Nemotron-70B-Instruct-HF | 4.18 | 1269 | 37280 | null |
qwen-plus | 4.17 | 1310 | null | null |
deepSeek-V3-0324 | 4.16 | 1372 | 53240 | 0.819 |
gemini-2.0-flash-001 | 4.16 | 1356 | 48090 | 0.779 |
grok-2-1212 | 4.1 | 1288 | 39230 | 0.709 |
deepSeek-V3 | 4.09 | 1318 | 45580 | 0.752 |
mistral-large-2411 | 4.05 | 1249 | 38270 | 0.697 |
llama-3.3-70B-Instruct | 4.02 | 1257 | 41110 | 0.713 |
llama-4-Scout-17B-16E-Instruct | 4 | null | 42990 | 0.752 |
llama-4-Maverick-17B-128E-Instruct-FP8 | 4 | 1271 | 50530 | 0.809 |
gpt-4o-mini | 4 | 1272 | 35680 | 0.648 |
claude-3.5-haiku-20241022 | 3.99 | 1237 | 34740 | 0.634 |
nova-lite-v1 | 3.89 | 1217 | 32530 | 0.59 |
mistral-small-24b-instruct-2501 | 3.88 | 1217 | 35280 | 0.652 |
nova-pro-v1 | 3.83 | 1245 | 37080 | 0.691 |
Performance Visualizations
Exploring relationships between AutoBench Rank, Latency, and Cost.
Rank vs. Average Cost
Rank vs. Average Latency
Rank vs. P99 Latency
Performance vs. Cost/Latency Trade-offs
Cost Breakdown per Domain ($ Cents/Response)
llama-4-Maverick-17B-128E-Instruct-FP8 | 0.329 | 0.154 | 0.161 | 0.163 | 0.152 | 0.178 | 0.172 | 10.232 | 0.152 | 0.161 | 0.183 |
claude-3.5-haiku-20241022 | 0.329 | 0.154 | 0.161 | 0.163 | 0.152 | 0.178 | 0.172 | 0.205 | 0.152 | 0.161 | 0.183 |
claude-3.7-sonnet | 2.26 | 0.883 | 0.904 | 1.128 | 0.853 | 1.138 | 0.836 | 1.336 | 1.003 | 0.998 | 1.134 |
claude-3.7-sonnet:thinking | 7.97 | 2.547 | 2.257 | 2.741 | 2.584 | 2.936 | 6.542 | 10.232 | 2.794 | 2.593 | 4.32 |
deepSeek-R1 | 0.823 | 0.318 | 0.289 | 0.285 | 0.37 | 0.282 | 0.862 | 1.36 | 0.291 | 0.28 | 0.516 |
deepSeek-V3 | 0.144 | 0.087 | 0.08 | 0.08 | 0.073 | 0.073 | 0.112 | 0.147 | 0.07 | 0.077 | 0.094 |
deepSeek-V3-0324 | 0.155 | 0.064 | 0.08 | 0.071 | 0.074 | 0.071 | 0.152 | 0.165 | 0.069 | 0.121 | 0.102 |
gemini-2.0-flash-001 | 0.078 | 0.015 | 0.027 | 0.027 | 0.027 | 0.026 | 0.038 | 0.057 | 0.03 | 0.03 | 0.035 |
gemini-2.5-pro-preview-03-25 | 2.948 | 0.636 | 0.976 | 1.222 | 1.205 | 1.157 | 0.784 | 1.085 | 1.128 | 1.113 | 1.225 |
gemma-3-27b-it | 0.044 | 0.016 | 0.021 | 0.021 | 0.02 | 0.019 | 0.023 | 0.041 | 0.024 | 0.023 | 0.025 |
gpt-4.1-mini | 0.253 | 0.089 | 0.096 | 0.094 | 0.12 | 0.098 | 0.194 | 0.309 | 0.094 | 0.109 | 0.145 |
gpt-4o-mini | 0.059 | 0.037 | 0.03 | 0.033 | 0.028 | 0.031 | 0.04 | 0.063 | 0.03 | 0.034 | 0.039 |
grok-2-1212 | 1.304 | 0.551 | 0.661 | 0.664 | 0.646 | 0.687 | 0.952 | 1.721 | 0.614 | 0.673 | 0.847 |
grok-3-beta | 2.638 | 1.027 | 1.24 | 1.299 | 1.412 | 1.317 | 2.143 | 2.827 | 1.218 | 1.829 | 1.695 |
llama-3.1-Nemotron-70B-Instruct-HF | 0.055 | 0.027 | 0.032 | 0.034 | 0.034 | 0.035 | 0.042 | 0.062 | 0.032 | 0.032 | 0.039 |
llama-3.3-70B-Instruct | 0.051 | 0.021 | 0.031 | 0.033 | 0.029 | 0.032 | 0.04 | 0.055 | 0.031 | 0.032 | 0.036 |
llama-3_1-Nemotron-Ultra-253B-v1 | 0.511 | 0.173 | 0.167 | 0.166 | 0.223 | 0.151 | 0.605 | 0.848 | 0.166 | 0.154 | 0.316 |
llama-4-Maverick-17B-128E-Instruct-FP8 | 0.108 | 0.042 | 0.052 | 0.055 | 0.055 | 0.055 | 0.092 | 0.109 | 0.053 | 0.052 | 0.067 |
llama-4-Scout-17B-16E-Instruct | 0.08 | 0.033 | 0.041 | 0.043 | 0.038 | 0.042 | 0.055 | 0.069 | 0.038 | 0.039 | 0.048 |
mistral-large-2411 | 0.831 | 0.459 | 0.471 | 0.463 | 0.415 | 0.458 | 0.468 | 0.716 | 0.469 | 0.497 | 0.525 |
mistral-small-24b-instruct-2501 | 0.018 | 0.011 | 0.011 | 0.01 | 0.009 | 0.01 | 0.013 | 0.018 | 0.01 | 0.011 | 0.012 |
nova-lite-v1 | 0.024 | 0.012 | 0.012 | 0.013 | 0.013 | 0.013 | 0.018 | 0.029 | 0.012 | 0.013 | 0.016 |
nova-pro-v1 | 0.248 | 0.149 | 0.105 | 0.115 | 0.109 | 0.104 | 0.139 | 0.21 | 0.096 | 0.101 | 0.138 |
o3-mini-2025-01-31 | 0.929 | 0.435 | 0.347 | 0.375 | 0.411 | 0.385 | 0.96 | 1.558 | 0.371 | 0.355 | 0.613 |
o4-mini-2025-04-16 | 1.199 | 0.602 | 0.628 | 0.634 | 0.798 | 0.606 | 0.987 | 1.304 | 0.612 | 0.559 | 0.793 |
qwen-plus | 0.148 | 0.108 | 0.074 | 0.073 | 0.074 | 0.074 | 0.102 | 0.148 | 0.068 | 0.078 | 0.095 |
Average Latency Breakdown per Domain (Seconds)
llama-4-Maverick-17B-128E-Instruct-FP8 | 15.48 | 11.25 | 10.95 | 10.37 | 13.08 | 10.82 | 136.12 | 205.27 | 13.77 | 14.83 | 15.53 |
claude-3.5-haiku-20241022 | 15.48 | 11.25 | 10.95 | 10.37 | 9.97 | 10.82 | 8.59 | 11.27 | 9.6 | 9.76 | 10.8 |
claude-3.7-sonnet | 23.57 | 16.24 | 14.73 | 16.55 | 13.08 | 17.49 | 10.55 | 14.46 | 13.77 | 14.83 | 15.53 |
claude-3.7-sonnet:thinking | 71.38 | 35.11 | 30.36 | 34.95 | 34.29 | 39 | 58.43 | 85.34 | 34.01 | 35.17 | 45.8 |
deepSeek-R1 | 132.9 | 72.63 | 45.19 | 49.89 | 63.33 | 45.66 | 136.12 | 205.27 | 49.02 | 47.69 | 84.77 |
deepSeek-V3 | 71.53 | 28.24 | 28.23 | 32.61 | 27.73 | 26.69 | 32.57 | 44.95 | 23.61 | 29.47 | 34.57 |
deepSeek-V3-0324 | 60.1 | 29.61 | 38.87 | 31.23 | 31.3 | 30.26 | 49.36 | 70.08 | 34.02 | 47.99 | 42.28 |
gemini-2.0-flash-001 | 10.77 | 3.26 | 4.94 | 4.9 | 5.14 | 4.94 | 5.33 | 7.55 | 5.5 | 5.24 | 5.76 |
gemini-2.5-pro-preview-03-25 | 51.62 | 23.1 | 25.99 | 29.23 | 32.35 | 29.55 | 49.82 | 68.76 | 27.3 | 27.97 | 36.57 |
gemma-3-27b-it | 57.3 | 18.12 | 26.05 | 21.7 | 24.51 | 25.17 | 23.42 | 40.69 | 34.57 | 28.76 | 30.03 |
gpt-4.1-mini | 24.38 | 9.05 | 11.06 | 11.79 | 14.19 | 12.07 | 17.77 | 30.85 | 11.08 | 11.55 | 15.38 |
gpt-4o-mini | 16.86 | 11.6 | 10.77 | 11.06 | 10.29 | 10.93 | 11.29 | 18.05 | 10.2 | 10.68 | 12.17 |
grok-2-1212 | 16.88 | 8.21 | 9.83 | 10.24 | 9.54 | 10.44 | 12.2 | 20.29 | 9.47 | 10.32 | 11.74 |
grok-3-beta | 44.1 | 28.57 | 28.82 | 30.47 | 35.2 | 30.32 | 37.7 | 42.02 | 26.85 | 35.39 | 33.94 |
llama-3.1-Nemotron-70B-Instruct-HF | 35.44 | 17.3 | 21.43 | 23.43 | 23.41 | 23.64 | 24.97 | 37.67 | 21.89 | 21.21 | 25.04 |
llama-3.3-70B-Instruct | 42.84 | 19.57 | 26.71 | 33.2 | 26.5 | 27.23 | 31.8 | 42.4 | 32.56 | 27.52 | 31.03 |
llama-3_1-Nemotron-Ultra-253B-v1 | 70.17 | 23.7 | 23.43 | 24.11 | 31.86 | 21.37 | 80.5 | 116.39 | 24.5 | 22.37 | 43.84 |
llama-4-Maverick-17B-128E-Instruct-FP8 | 19.09 | 5.48 | 7.29 | 8.48 | 7.92 | 7.71 | 11.39 | 15.29 | 7.93 | 6.97 | 9.76 |
llama-4-Scout-17B-16E-Instruct | 15.31 | 5.74 | 7.21 | 7.8 | 6.92 | 7.76 | 8.66 | 11.21 | 7.25 | 7.04 | 8.49 |
mistral-large-2411 | 52.36 | 24.34 | 26.9 | 29.83 | 22.78 | 28.73 | 25.14 | 33.72 | 19.13 | 28.87 | 29.18 |
mistral-small-24b-instruct-2501 | 20.56 | 13.11 | 13.6 | 11.5 | 11.19 | 10.28 | 11.95 | 22.39 | 12.28 | 13.1 | 13.99 |
nova-lite-v1 | 6.84 | 4.6 | 5.93 | 4.74 | 4.62 | 4.55 | 4.67 | 6.3 | 4.74 | 5.24 | 5.22 |
nova-pro-v1 | 9.29 | 6.08 | 4.63 | 5.19 | 5.01 | 4.59 | 5.11 | 7.41 | 4.64 | 4.54 | 5.65 |
o3-mini-2025-01-31 | 15.17 | 7.85 | 7.25 | 8.29 | 8.8 | 7.57 | 15.26 | 22.95 | 7.32 | 6.45 | 10.69 |
o4-mini-2025-04-16 | 25.56 | 14.83 | 14.82 | 15.96 | 19.49 | 15.95 | 21.58 | 34.85 | 14.75 | 13.26 | 19.1 |
qwen-plus | 52.6 | 40.48 | 28.34 | 28.7 | 29.42 | 28.43 | 33.31 | 49.76 | 26.57 | 29.67 | 34.73 |
P99 Latency Breakdown per Domain (Seconds)
llama-4-Maverick-17B-128E-Instruct-FP8 | 122.38 | 557.13 | 19.06 | 123.61 | 195.61 | 160.87 | 137.09 | 145.15 | 136.28 | 318.33 | 223.47 |
claude-3.5-haiku-20241022 | 23.7 | 19.75 | 19.06 | 20.65 | 15.82 | 17.31 | 15.7 | 21.4 | 13.3 | 13.16 | 17.98 |
claude-3.7-sonnet | 34.5 | 42.15 | 35.2 | 24.45 | 22 | 41.91 | 32.66 | 45.31 | 21 | 29.41 | 32.86 |
claude-3.7-sonnet:thinking | 122.38 | 86.02 | 45.39 | 56.93 | 59.81 | 58.5 | 137.09 | 145.15 | 53.35 | 61.38 | 82.6 |
deepSeek-R1 | 265.87 | 557.13 | 73 | 123.61 | 195.61 | 68.46 | 393.58 | 391.41 | 97.07 | 68.91 | 223.47 |
deepSeek-V3 | 489.39 | 52.4 | 48.32 | 63.66 | 57.52 | 64.71 | 72.51 | 89.72 | 35.64 | 91.46 | 106.53 |
deepSeek-V3-0324 | 132.59 | 60.61 | 92.63 | 98.62 | 64.37 | 68.91 | 202.2 | 230.86 | 136.28 | 318.33 | 140.54 |
gemini-2.0-flash-001 | 13.29 | 6.62 | 7.15 | 7.43 | 9.12 | 6.65 | 10.66 | 12.31 | 7.46 | 7.56 | 8.82 |
gemini-2.5-pro-preview-03-25 | 78.66 | 32.41 | 37.16 | 59.49 | 48.05 | 40.25 | 128.55 | 137.5 | 39.82 | 39.92 | 64.18 |
gemma-3-27b-it | 205.53 | 47.1 | 49.13 | 39.73 | 59.89 | 77.24 | 62.05 | 98.21 | 80.09 | 72.23 | 79.12 |
gpt-4.1-mini | 39.52 | 15.25 | 20.58 | 32.46 | 25.08 | 28.67 | 36.87 | 52.4 | 19.65 | 21.41 | 29.19 |
gpt-4o-mini | 31.65 | 18.1 | 16.75 | 19.07 | 20.07 | 22.09 | 17.86 | 37.33 | 17.63 | 16.96 | 21.75 |
grok-2-1212 | 25.85 | 12.72 | 13.91 | 17.66 | 17.06 | 17.54 | 28.11 | 72.73 | 13.08 | 14.55 | 23.32 |
grok-3-beta | 80.37 | 50.31 | 69.86 | 44.77 | 81.45 | 56.9 | 90.91 | 87.52 | 45.89 | 89.96 | 69.79 |
llama-3.1-Nemotron-70B-Instruct-HF | 65.57 | 29.35 | 29.17 | 33.06 | 35.17 | 29.16 | 44.7 | 165.37 | 28.79 | 27.03 | 48.74 |
llama-3.3-70B-Instruct | 83.59 | 53.04 | 79.05 | 70.62 | 70.25 | 57.85 | 69.79 | 117.32 | 77.75 | 57.72 | 73.7 |
llama-3_1-Nemotron-Ultra-253B-v1 | 157.34 | 41.92 | 48.86 | 54.35 | 55.23 | 43.9 | 205.29 | 236.74 | 51.17 | 49.75 | 94.45 |
llama-4-Maverick-17B-128E-Instruct-FP8 | 80.03 | 10.11 | 13.5 | 15.19 | 12.64 | 13.25 | 18.17 | 42.55 | 12.47 | 13.21 | 23.11 |
llama-4-Scout-17B-16E-Instruct | 26.4 | 9.28 | 10.96 | 11.67 | 9.58 | 12 | 14.52 | 20.09 | 10.46 | 13.2 | 13.82 |
mistral-large-2411 | 157.72 | 54.36 | 82.25 | 77.68 | 69.55 | 160.87 | 136.52 | 98.81 | 29.76 | 100.24 | 96.77 |
mistral-small-24b-instruct-2501 | 36.17 | 26.9 | 21.63 | 20.29 | 16.93 | 19.97 | 32.45 | 75.08 | 21.31 | 25.44 | 29.62 |
nova-lite-v1 | 12.92 | 5.83 | 32.65 | 6.61 | 6.51 | 5.7 | 9.03 | 19.28 | 7.8 | 18.38 | 12.47 |
nova-pro-v1 | 15.72 | 11.25 | 6.86 | 10.76 | 7.67 | 8.17 | 9.15 | 14.68 | 6.84 | 8.2 | 9.93 |
o3-mini-2025-01-31 | 35.06 | 16.4 | 16.14 | 20.01 | 18.56 | 14.95 | 39.78 | 52.4 | 13.09 | 10.28 | 23.67 |
o4-mini-2025-04-16 | 57.74 | 39.5 | 24.57 | 39.62 | 48.97 | 33.24 | 70.85 | 164.19 | 25.11 | 19.22 | 52.3 |
qwen-plus | 77.04 | 72.72 | 55.77 | 55.36 | 69.05 | 48.73 | 68.49 | 121.78 | 41.3 | 56.74 | 66.7 |
Performance Across Different Domains
Model ranks within specific knowledge or task areas. Higher is better.
llama-4-Maverick-17B-128E-Instruct-FP8 | 3.85 | 4.27 | 4.07 | 4.15 | 4.05 | 4.11 | 4.41 | 3.98 | 4.04 | 3.44 | 3.99 |
claude-3.5-haiku-20241022 | 3.85 | 4 | 4.07 | 4.15 | 4.05 | 4.11 | 4.2 | 3.98 | 4.04 | 3.44 | 3.99 |
claude-3.7-sonnet | 3.96 | 4.27 | 4.3 | 4.34 | 4.29 | 4.31 | 4.41 | 4.14 | 4.15 | 3.87 | 4.2 |
claude-3.7-sonnet:thinking | 4.18 | 4.48 | 4.48 | 4.54 | 4.45 | 4.48 | 4.48 | 4.4 | 4.32 | 4.06 | 4.39 |
deepSeek-R1 | 3.97 | 4.05 | 4.39 | 4.39 | 4.35 | 4.35 | 4.46 | 4.32 | 4.29 | 3.95 | 4.26 |
deepSeek-V3 | 4.04 | 4.01 | 4.12 | 4.06 | 4.13 | 4.14 | 4.32 | 4.11 | 4.08 | 3.91 | 4.09 |
deepSeek-V3-0324 | 4.07 | 4.25 | 4.13 | 4.18 | 4.11 | 4.17 | 4.33 | 4.22 | 4.17 | 3.97 | 4.16 |
gemini-2.0-flash-001 | 3.97 | 4.18 | 4.29 | 4.3 | 4.25 | 4.28 | 3.99 | 4.24 | 4.18 | 3.85 | 4.16 |
gemini-2.5-pro-preview-03-25 | 4.17 | 4.5 | 4.59 | 4.6 | 4.56 | 4.59 | 4.42 | 4.53 | 4.48 | 4.17 | 4.46 |
gemma-3-27b-it | 3.9 | 3.98 | 4.34 | 4.38 | 4.33 | 4.36 | 4.35 | 4.33 | 4.29 | 3.7 | 4.2 |
gpt-4.1-mini | 4.3 | 4.42 | 4.4 | 4.32 | 4.3 | 4.3 | 4.41 | 4.44 | 4.22 | 4.34 | 4.34 |
gpt-4o-mini | 3.82 | 3.97 | 4.07 | 4.03 | 4.07 | 4.1 | 4.2 | 3.97 | 4 | 3.79 | 4 |
grok-2-1212 | 3.92 | 4.12 | 4.14 | 4.16 | 4.19 | 4.17 | 4.17 | 4.16 | 4.08 | 3.87 | 4.1 |
grok-3-beta | 4.05 | 4.33 | 4.36 | 4.45 | 4.42 | 4.43 | 4.47 | 4.54 | 4.36 | 4.07 | 4.34 |
llama-3.1-Nemotron-70B-Instruct-HF | 3.99 | 4.1 | 4.29 | 4.32 | 4.3 | 4.32 | 4.3 | 4.27 | 4.2 | 3.68 | 4.18 |
llama-3.3-70B-Instruct | 3.93 | 3.83 | 4.21 | 4.13 | 4.15 | 4.17 | 4.02 | 4.1 | 4.07 | 3.52 | 4.02 |
llama-3_1-Nemotron-Ultra-253B-v1 | 4.06 | 4.17 | 4.36 | 4.34 | 4.31 | 4.33 | 4.38 | 4.36 | 4.33 | 3.91 | 4.26 |
llama-4-Maverick-17B-128E-Instruct-FP8 | 3.86 | 3.98 | 4.1 | 4.1 | 4.1 | 4.05 | 4.04 | 4.1 | 3.99 | 3.64 | 4 |
llama-4-Scout-17B-16E-Instruct | 3.89 | 3.88 | 4.09 | 4.11 | 4.14 | 4.1 | 4.04 | 4.09 | 4.04 | 3.53 | 4 |
mistral-large-2411 | 3.87 | 3.98 | 4.18 | 4.09 | 4.17 | 4.08 | 4.19 | 4.05 | 4.07 | 3.88 | 4.05 |
mistral-small-24b-instruct-2501 | 3.66 | 3.86 | 4.08 | 4.02 | 4.09 | 4.05 | 3.42 | 3.94 | 4.01 | 3.59 | 3.88 |
nova-lite-v1 | 3.77 | 3.73 | 4.05 | 4.02 | 4.02 | 4.04 | 3.86 | 3.9 | 3.86 | 3.56 | 3.89 |
nova-pro-v1 | 3.74 | 3.81 | 3.86 | 3.82 | 3.86 | 3.9 | 4.06 | 3.91 | 3.78 | 3.56 | 3.83 |
o3-mini-2025-01-31 | 4.32 | 4.44 | 4.25 | 4.21 | 4.21 | 4.2 | 4.35 | 4.23 | 4.09 | 4.41 | 4.26 |
o4-mini-2025-04-16 | 4.48 | 4.55 | 4.61 | 4.61 | 4.67 | 4.59 | 4.51 | 4.6 | 4.57 | 4.57 | 4.57 |
qwen-plus | 4.1 | 4.23 | 4.24 | 4.21 | 4.22 | 4.17 | 4.3 | 4.19 | 4.06 | 4.03 | 4.17 |
About AutoBench
AutoBench is an LLM benchmark where Large Language Models (LLMs) evaluate and rank the responses generated by other LLMs. The questions themselves are also generated by LLMs across a diverse set of domains and ranked for quality.
Methodology
- Question Generation: High-quality questions across various domains (Coding, History, Science, etc.) are generated by selected LLMs.
- Response Generation: The models being benchmarked generate answers to these questions.
- Ranking: Ranking LLMs rank the responses from different models for each question, on a 1-5 scale.
- Aggregation: Scores are averaged across multiple questions and domains to produce the final AutoBench rank.
Metrics
- AutoBench Score (AB): The average rank received by a model's responses across all questions/domains (higher is better).
- Avg Cost (USD Cents/response): Estimated average cost to generate one response based on model provider pricing (input+output tokens). Lower is better.
- Avg Latency (s): Average time taken by the model to generate a response. Lower is better.
- P99 Latency (s): The 99th percentile of response time, indicating worst-case latency. Lower is better.
- Chatbot Arena / Artificial Analysis Intelligence Index / MMLU: Scores from other well-known benchmarks for comparison (where available).
Data
This leaderboard reflects a run completed on April 23, 2025. Models included recently released models such as o4-mini, Gpt-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thikning, etc..
Links
Disclaimer: Benchmark results provide one perspective on model capabilities. Performance can vary based on specific tasks, prompts, and API conditions. Costs are estimates and subject to change by providers. Latency depends on server load and geographic location.