AutoBench LLM Leaderboard
Interactive leaderboard for AutoBench, where LLMs rank LLMs' responses. Includes performance, cost, and latency metrics. Use the dropdown below to navigate between different benchmark runs.
Choose a benchmark run to view its results
Current Run: AutoBench Agentic Run 1 - April 2026 (2026-04-16) - 31 models
Overall Model Performance
Models ranked by AutoBench score. Lower cost ($ Cents), latency (s), and fail rate (%) are better. Iterations shows the number of evaluations per model.
Benchmark Correlations: AutoBench features 82.71% with Artificial Analysis Intelligence Index, 80.25% with Terminal-Bench Hard, 81.81% with GDPval-AA, 66.45% with Tau2-Bench Telecom.
Overall Rankings
Gemini-3.1-flash-lite-preview | 3.17 | 2.563 | 129 | 147 | 33.33% | 198 |
Benchmark Comparison
This run targets agentic performance; reference scores are drawn from agentic benchmarks alongside the Artificial Analysis Intelligence Index.
Comparison of AutoBench scores with reference benchmarks: AAI Index, Terminal-bench, GDPval-AA, and Tau2-bench Telecom. Models sorted by AutoBench score (higher is better).
Benchmark Correlations: AutoBench features 82.71% with Artificial Analysis Intelligence Index, 80.25% with Terminal-Bench Hard, 81.81% with GDPval-AA, 66.45% with Tau2-Bench Telecom.
Benchmark Comparison
Gemini-3.1-flash-lite-preview | 3.17 | 68 | 46 | 56 | 92% |
Claude-opus-4.6 | 3.17 | 68 | 46 | 56 | 92% |
Claude-sonnet-4.6 | 3.13 | 63 | 53 | 58 | 76% |
Gemini-3.1-pro-preview | 3.1 | 59 | 54 | 41 | 96% |
GLM-5.1 | 3.06 | 67 | 43 | 52 | 98% |
Gpt-5.4 | 3.02 | 68 | 58 | 59 | 87% |
Qwen3.6-plus | 2.99 | 62 | 44 | 43 | 95% |
Mimo-V2-Pro | 2.99 | 63 | 41 | 46 | 95% |
Kimi-K2.5 | 2.92 | 59 | 35 | 39 | 96% |
Grok-4.20 | 2.92 | 54 | 38 | 27 | 93% |
Claude-haiku-4.5 | 2.92 | 40 | 27 | 34 | 55% |
GLM-4.7 | 2.92 | 55 | 32 | 35 | 96% |
Gpt-5.4-mini | 2.91 | 59 | 52 | 46 | 83% |
Minimax-m2.7 | 2.9 | 61 | 39 | 51 | 85% |
Gemini-3-flash-preview | 2.89 | 50 | 39 | 35 | 80% |
Grok-4.1-fast | 2.84 | 49 | 24 | 27 | 93% |
Qwen3.5-122b-a10b | 2.84 | 53 | 31 | 31 | 89% |
Qwen3.5-35b-a3b | 2.83 | 44 | 27 | 21 | 94% |
Gemini-3.1-flash-lite-preview | 2.82 | 26 | 24 | 21 | 31% |
Gpt-5.4-nano | 2.78 | 48 | 42 | 34 | 76% |
Gpt-oss-120b | 2.76 | 38 | 24 | 22 | 66% |
Minimax-m2.5 | 2.72 | 56 | 35 | 34 | 95% |
Gemma-4-31b-it | 2.7 | 41 | 36 | 31 | 60% |
Nemotron-3-super-120b-a12b | 2.7 | 40 | 29 | 25 | 68% |
Mistral-small-4 | 2.69 | 26 | 17 | 18 | 25% |
Nova-2-lite-v1 | 2.66 | 37 | 17 | 17 | 73% |
Gpt-oss-20b | 2.65 | 28 | 11 | 8 | 60% |
Nemotron-3-nano-30b-a3b | 2.63 | 19 | 14 | 4 | 41% |
Mistral-large-2512 | 2.62 | 22 | 16 | 18 | 41% |
Deepseek-v3.2 | 2.55 | 53 | 36 | 35 | 91% |
Gemma-4-26b-a4b-it | 2.53 | 32 | 14 | 26 | 44% |
Llama-4-maverick | 2.27 | 7 | 7 | 0 | 18% |
Performance Visualizations
Exploring relationships between AutoBench Rank, Latency, and Cost.
Rank vs. Average Cost
Rank vs. Average Latency
Rank vs. P99 Latency
Performance vs. Cost/Latency Trade-offs
Cost Breakdown per Domain ($ Cents/Response)
Cost Breakdown
Gemini-3.1-flash-lite-preview | 0.809 | 0.876 | 1.241 | 1.353 | 0.611 | 0.783 | 0.615 | 0.777 | 0.667 | 0.824 | 2.563 |
Claude-haiku-4.5 | 0.809 | 0.876 | 1.241 | 1.353 | 0.611 | 0.783 | 0.615 | 0.777 | 0.667 | 0.824 | 0.86 |
Claude-opus-4.6 | 2.71 | 3.036 | 2.974 | 2.71 | 2.678 | 2.576 | 3.428 | 2.214 | 1.376 | 1.711 | 2.563 |
Claude-sonnet-4.6 | 1.848 | 1.958 | 2.999 | 2.051 | 2.137 | 1.71 | 2.699 | 1.397 | 1.082 | 1.46 | 1.948 |
Deepseek-v3.2 | 0.072 | 0.064 | 0.05 | 0.059 | 0.068 | 0.07 | 0.056 | 0.052 | 0.046 | 0.051 | 0.059 |
GLM-4.7 | 0.147 | 0.124 | 0.162 | 0.19 | 0.131 | 0.139 | 0.187 | 0.102 | 0.086 | 0.081 | 0.115 |
GLM-5.1 | 0.511 | 0.533 | 0.614 | 0.567 | 0.549 | 0.465 | 0.616 | 0.451 | 0.29 | 0.31 | 1.328 |
Gemini-3-flash-preview | 0.303 | 0.268 | 0.297 | 0.269 | 0.278 | 0.251 | 0.312 | 0.368 | 0.196 | 0.247 | 0.278 |
Gemini-3.1-flash-lite-preview | 0.119 | 0.121 | 0.11 | 0.11 | 0.11 | 0.095 | 0.121 | 0.145 | 0.113 | 0.114 | 0.02 |
Gemini-3.1-pro-preview | 1.22 | 1.239 | 1.307 | 1.408 | 1.195 | 1.343 | 1.416 | 1.647 | 1.055 | 1.579 | 0.024 |
Gemma-4-26b-a4b-it | 0.019 | 0.018 | 0.023 | 0.023 | 0.019 | 0.024 | 0.029 | 0.015 | 0.011 | 0.013 | 0.139 |
Gemma-4-31b-it | 0.026 | 0.029 | 0.027 | 0.028 | 0.025 | 0.027 | 0.028 | 0.021 | 0.013 | 0.018 | 0.499 |
Gpt-5.4 | 6.666 | 6.379 | 5.844 | 6.129 | 4.072 | 5.446 | 2.722 | 7.231 | 5.625 | 9.744 | 5.821 |
Gpt-5.4-mini | 3.09 | 1.763 | 1.512 | 1.276 | 2.596 | 1.935 | 1.181 | 3.533 | 2.447 | 0.616 | 2.03 |
Gpt-5.4-nano | 0.458 | 0.19 | 0.452 | 0.491 | 0.343 | 0.614 | 0.485 | 0.546 | 0.433 | 0.393 | 0.432 |
Gpt-oss-120b | 0.016 | 0.021 | 0.017 | 0.019 | 0.016 | 0.022 | 0.018 | 0.014 | 0.014 | 0.014 | 0.017 |
Gpt-oss-20b | 0.012 | 0.014 | 0.014 | 0.014 | 0.011 | 0.013 | 0.014 | 0.017 | 0.011 | 0.016 | 0.013 |
Grok-4.1-fast | 0.119 | 0.147 | 0.128 | 0.111 | 0.118 | 0.121 | 0.12 | 0.122 | 0.102 | 0.124 | 0.12 |
Grok-4.20 | 1.342 | 1.825 | 1.406 | 1.806 | 1.294 | 1.612 | 1.163 | 2.519 | 1.139 | 1.546 | 1.536 |
Kimi-K2.5 | 0.134 | 0.132 | 0.14 | 0.142 | 0.116 | 0.111 | 0.15 | 0.133 | 0.095 | 0.102 | 0.126 |
Llama-4-maverick | 0.024 | 0.027 | 0.03 | 0.024 | 0.027 | 0.029 | 0.027 | 0.038 | 0.026 | 0.031 | 0.028 |
Mimo-V2-Pro | 0.347 | 0.349 | 0.346 | 0.369 | 0.328 | 0.319 | 0.371 | 0.312 | 0.249 | 0.211 | 0.324 |
Minimax-m2.5 | 0.052 | 0.06 | 0.065 | 0.057 | 0.054 | 0.054 | 0.061 | 0.044 | 0.034 | 0.034 | 0.052 |
Minimax-m2.7 | 0.097 | 0.1 | 0.113 | 0.099 | 0.098 | 0.093 | 0.114 | 0.084 | 0.058 | 0.079 | 0.094 |
Mistral-large-2512 | 0.102 | 0.098 | 0.12 | 0.106 | 0.098 | 0.101 | 0.123 | 0.082 | 0.054 | 0.071 | 0.096 |
Mistral-small-4 | 0.049 | 0.051 | 0.059 | 0.042 | 0.08 | 0.044 | 0.056 | 0.038 | 0.029 | 0.044 | 0.05 |
Nemotron-3-nano-30b-a3b | 0.089 | 0.082 | 0.082 | 0.066 | 0.081 | 0.094 | 0.069 | 0.122 | 0.075 | 0.071 | 0.082 |
Nemotron-3-super-120b-a12b | 0.065 | 0.046 | 0.07 | 0.107 | 0.056 | 0.107 | 0.094 | 0.042 | 0.038 | 0.052 | 0.068 |
Nova-2-lite-v1 | 1.926 | 1.805 | 1.776 | 2.032 | 1.862 | 1.734 | 2.236 | 0.856 | 0.497 | 0.613 | 1.556 |
Qwen3.5-122b-a10b | 0.162 | 0.118 | 0.123 | 0.135 | 0.142 | 0.105 | 0.181 | 0.153 | 0.145 | 0.163 | 0.144 |
Qwen3.5-35b-a3b | 0.116 | 0.138 | 0.132 | 0.107 | 0.103 | 0.111 | 0.114 | 0.102 | 0.082 | 0.12 | 0.112 |
Qwen3.6-plus | 0.161 | 0.114 | 0.224 | 0.288 | 0.138 | 0.148 | 0.189 | 0.203 | 0.228 | 0.253 | 0.197 |
Average Latency Breakdown per Domain (Seconds)
Average Latency Breakdown
Gemini-3.1-flash-lite-preview | 190.6972 | 125.9768 | 111.2079 | 140.3793 | 101.9265 | 131.3381 | 127.4537 | 160.1292 | 108.8256 | 175.7158 | 43.95979553 |
Claude-haiku-4.5 | 46.1336 | 46.1122 | 48.0425 | 60.335 | 36.1667 | 55.4396 | 39.5512 | 35.164 | 24.5206 | 52.1675 | 43.95979553 |
Claude-opus-4.6 | 42.1124 | 53.3346 | 45.461 | 43.0368 | 39.8221 | 37.2462 | 45.1839 | 28.905 | 16.9896 | 22.3545 | 37.85942643 |
Claude-sonnet-4.6 | 51.1068 | 52.8177 | 76.5625 | 50.1423 | 52.0923 | 42.691 | 52.3426 | 38.5508 | 18.7954 | 28.4199 | 46.45277089 |
Deepseek-v3.2 | 70.4417 | 60.617 | 46.1213 | 56.2128 | 79.4036 | 77.5038 | 55.4781 | 35.3541 | 26.0408 | 27.0283 | 54.31511897 |
GLM-4.7 | 50.4502 | 40.405 | 43.2821 | 66.6907 | 39.1247 | 48.016 | 52.613 | 36.519 | 23.0467 | 22.9732 | 23.16857653 |
GLM-5.1 | 70.5772 | 71.5014 | 86.1363 | 80.6663 | 75.4139 | 67.054 | 95.9933 | 32.9641 | 26.5569 | 35.1629 | 25.96268468 |
Gemini-3-flash-preview | 16.7706 | 13.8003 | 14.096 | 13.6193 | 14.8726 | 12.9421 | 15.4606 | 9.8464 | 8.0062 | 8.2219 | 12.95436087 |
Gemini-3.1-flash-lite-preview | 25.9681 | 15.6794 | 18.291 | 29.8768 | 31.6073 | 36.0709 | 29.2888 | 13.9176 | 12.3058 | 17.3742 | 13.64022252 |
Gemini-3.1-pro-preview | 25.2172 | 28.3457 | 25.5406 | 28.3863 | 26.0591 | 29.5616 | 35.0487 | 23.2719 | 16.5858 | 20.0105 | 51.69145245 |
Gemma-4-26b-a4b-it | 16.9871 | 10.217 | 13.0184 | 16.2429 | 16.5371 | 16.1576 | 20.9129 | 6.4718 | 7.1763 | 5.9912 | 43.54876781 |
Gemma-4-31b-it | 41.6696 | 53.5776 | 81.3055 | 54.9655 | 52.0971 | 59.0855 | 83.5473 | 22.6931 | 24.6705 | 49.91 | 66.08806343 |
Gpt-5.4 | 190.6972 | 125.9768 | 111.2079 | 140.3793 | 81.5197 | 131.3381 | 101.409 | 160.1292 | 108.8256 | 175.7158 | 129.263534 |
Gpt-5.4-mini | 153.3592 | 90.6251 | 81.1314 | 90.317 | 101.9265 | 103.8755 | 49.8503 | 91.5657 | 78.5646 | 26.5415 | 86.91001719 |
Gpt-5.4-nano | 128.935 | 68.2781 | 67.8651 | 145.881 | 77.745 | 140.7391 | 82.563 | 78.7779 | 80.9436 | 78.7016 | 93.33818571 |
Gpt-oss-120b | 18.207 | 12.6311 | 20.5413 | 25.8277 | 20.1329 | 22.4405 | 20.2597 | 14.1992 | 14.5309 | 6.511 | 18.02638485 |
Gpt-oss-20b | 43.7038 | 43.953 | 49.1998 | 60.1144 | 48.8196 | 37.083 | 49.6733 | 32.327 | 22.8055 | 30.2565 | 42.86725918 |
Grok-4.1-fast | 36.4589 | 39.2228 | 40.1179 | 31.148 | 35.6533 | 64.629 | 41.2036 | 30.4571 | 21.3232 | 29.431 | 36.20348296 |
Grok-4.20 | 35.1335 | 48.2891 | 37.4329 | 51.0896 | 35.0874 | 40.0045 | 36.0676 | 40.4575 | 19.2072 | 24.2121 | 36.77304221 |
Kimi-K2.5 | 87.0184 | 40.7341 | 54.4748 | 62.1407 | 52.2043 | 52.926 | 76.3145 | 40.2496 | 39.7803 | 31.1301 | 54.67825758 |
Llama-4-maverick | 44.0942 | 43.6807 | 36.7344 | 44.4963 | 42.2188 | 45.3713 | 34.3164 | 42.2113 | 36.3075 | 45.0926 | 41.27163619 |
Mimo-V2-Pro | 28.5039 | 28.9949 | 27.0924 | 30.5006 | 26.2146 | 27.5217 | 30.3201 | 19.0163 | 17.842 | 15.4146 | 25.60343568 |
Minimax-m2.5 | 93.1916 | 84.3699 | 103.6352 | 118.2817 | 98.4981 | 93.134 | 86.3949 | 45.7514 | 43.5987 | 60.6954 | 85.19656068 |
Minimax-m2.7 | 33.9994 | 25.9516 | 30.3238 | 34.9334 | 31.0331 | 37.5538 | 31.5957 | 17.6385 | 17.5101 | 15.506 | 28.14855769 |
Mistral-large-2512 | 11.0209 | 11.3873 | 12.334 | 9.9634 | 9.4495 | 9.7352 | 12.1589 | 5.7922 | 4.0873 | 5.7834 | 9.267698618 |
Mistral-small-4 | 10.4777 | 9.5762 | 11.4058 | 9.2103 | 22.2899 | 9.7938 | 12.1189 | 4.8819 | 4.3778 | 6.4862 | 10.55675923 |
Nemotron-3-nano-30b-a3b | 136.4421 | 120.1786 | 121.2343 | 103.1525 | 107.5256 | 131.6642 | 127.4537 | 68.5201 | 53.644 | 60.038 | 102.9642784 |
Nemotron-3-super-120b-a12b | 109.4218 | 65.4083 | 83.8886 | 86.9403 | 89.8686 | 82.8918 | 72.4423 | 33.2578 | 37.0007 | 44.3284 | 71.86705572 |
Nova-2-lite-v1 | 84.8424 | 69.59 | 61.1813 | 79.1771 | 69.1531 | 58.651 | 81.8847 | 23.9343 | 12.6582 | 16.2989 | 56.85229296 |
Qwen3.5-122b-a10b | 15.3923 | 14.5582 | 13.6746 | 13.4825 | 13.4436 | 14.0513 | 19.8363 | 8.3142 | 8.0972 | 8.4222 | 13.13992066 |
Qwen3.5-35b-a3b | 18.6065 | 17.7434 | 17.0379 | 17.1596 | 14.0943 | 16.2087 | 18.5391 | 6.9951 | 6.5958 | 8.4414 | 14.33161555 |
Qwen3.6-plus | 55.5548 | 40.2061 | 61.5001 | 60.691 | 43.9526 | 47.4626 | 60.6145 | 24.8841 | 32.3799 | 40.6555 | 47.51772478 |
P99 Latency Breakdown per Domain (Seconds)
P99 Latency Breakdown
Gemini-3.1-flash-lite-preview | 154.3386 | 110.2682 | 127.0755 | 276.1924 | 113.2281 | 243.8365 | 150.5253 | 134.9327 | 330.2345 | 290.4749 | 152.2685 |
Claude-haiku-4.5 | 91.0773 | 110.2682 | 127.0755 | 276.1924 | 113.2281 | 243.8365 | 95.0228 | 97.5884 | 77.9215 | 290.4749 | 152.2685 |
Claude-opus-4.6 | 80.041 | 197.754 | 100.162 | 80.4153 | 86.5288 | 110.9398 | 150.5253 | 69.7232 | 44.9602 | 52.2452 | 97.3295 |
Claude-sonnet-4.6 | 154.3386 | 189.2712 | 278.3433 | 153.4241 | 207.9232 | 104.0079 | 141.6592 | 134.9327 | 43.7089 | 63.8222 | 147.1431 |
Deepseek-v3.2 | 188.9782 | 107.3153 | 112.5801 | 185.4105 | 259.0915 | 250.2864 | 125.7354 | 221.9516 | 57.3445 | 55.6669 | 156.436 |
GLM-4.7 | 129.6235 | 110.1411 | 95.4605 | 303.947 | 131.7817 | 89.6153 | 211.9996 | 160.5917 | 63.7457 | 49.3721 | 134.6278 |
GLM-5.1 | 167.4809 | 251.4347 | 285.7352 | 312.5488 | 233.0885 | 168.078 | 295.6721 | 104.8135 | 72.4833 | 86.1206 | 197.7456 |
Gemini-3-flash-preview | 27.2873 | 27.2941 | 25.3661 | 24.226 | 26.8827 | 19.5807 | 55.168 | 22.6091 | 15.7135 | 13.7824 | 25.791 |
Gemini-3.1-flash-lite-preview | 158.9503 | 43.2124 | 83.2036 | 191.2867 | 237.7518 | 119.9766 | 140.9744 | 86.1881 | 24.6225 | 57.7764 | 114.3943 |
Gemini-3.1-pro-preview | 40.5014 | 53.4536 | 43.3775 | 55.1287 | 87.1046 | 52.6185 | 113.2668 | 46.1479 | 43.2473 | 44.1445 | 57.8991 |
Gemma-4-26b-a4b-it | 41.4358 | 37.5319 | 33.5177 | 49.9711 | 58.6268 | 47.8931 | 109.4397 | 22.3967 | 23.3193 | 15.6139 | 43.9746 |
Gemma-4-31b-it | 94.5327 | 193.1804 | 250.3586 | 189.0602 | 168.4594 | 146.1042 | 288.4547 | 43.589 | 84.5247 | 313.5323 | 177.1796 |
Gpt-5.4 | 357.3064 | 304.6053 | 300.9001 | 395.2089 | 208.725 | 333.4138 | 313.7329 | 396.4187 | 330.2345 | 353.3553 | 329.3901 |
Gpt-5.4-mini | 333.0933 | 217.5008 | 303.0277 | 239.124 | 302.5534 | 198.3588 | 179.3315 | 267.1145 | 313.2008 | 55.5172 | 240.8822 |
Gpt-5.4-nano | 254.3095 | 188.8738 | 196.3221 | 291.0544 | 196.7436 | 301.9089 | 327.645 | 320.1109 | 320.8538 | 224.2656 | 262.2088 |
Gpt-oss-120b | 51.5995 | 33.3092 | 73.1303 | 66.8018 | 76.0803 | 43.4072 | 91.5204 | 56.4928 | 126.5068 | 14.7681 | 63.3616 |
Gpt-oss-20b | 185.9629 | 182.1386 | 169.189 | 208.175 | 174.7058 | 141.3277 | 205.7395 | 137.3125 | 145.4453 | 105.2351 | 165.5231 |
Grok-4.1-fast | 98.0764 | 86.5536 | 110.6121 | 54.0332 | 88.8177 | 236.0511 | 86.7175 | 73.692 | 56.2482 | 67.3797 | 95.8181 |
Grok-4.20 | 60.9856 | 180.7621 | 61.1629 | 297.1904 | 70.4811 | 86.8368 | 72.1829 | 67.2458 | 40.2727 | 41.1219 | 97.8242 |
Kimi-K2.5 | 334.1608 | 122.3857 | 189.8468 | 158.9584 | 191.315 | 104.8266 | 274.5484 | 124.984 | 162.1486 | 81.6347 | 174.4809 |
Llama-4-maverick | 64.5353 | 73.8019 | 67.0179 | 63.9908 | 82.1587 | 118.3985 | 76.4981 | 72.3758 | 68.6175 | 73.6123 | 76.1007 |
Mimo-V2-Pro | 45.9749 | 69.6757 | 43.2072 | 53.6061 | 57.1214 | 57.5479 | 72.8357 | 49.422 | 63.7333 | 28.8174 | 54.1942 |
Minimax-m2.5 | 239.6571 | 217.6127 | 267.7839 | 395.5708 | 245.0672 | 268.9264 | 239.7041 | 113.4059 | 200.6302 | 218.9969 | 240.7355 |
Minimax-m2.7 | 70.5145 | 63.2912 | 70.5855 | 73.8646 | 65.1628 | 147.4334 | 62.5303 | 45.931 | 39.8098 | 42.5468 | 68.167 |
Mistral-large-2512 | 26.3446 | 27.9479 | 30.0158 | 21.8918 | 21.6876 | 15.6656 | 29.0625 | 18.6591 | 10.0165 | 14.2327 | 21.5524 |
Mistral-small-4 | 22.5922 | 18.2761 | 26.9347 | 17.2356 | 243.0557 | 20.7109 | 26.2265 | 8.4539 | 11.0912 | 17.9602 | 41.2537 |
Nemotron-3-nano-30b-a3b | 312.2043 | 358.6536 | 290.4211 | 228.6499 | 278.3621 | 372.3838 | 282.7243 | 194.9471 | 267.3245 | 208.9367 | 279.4607 |
Nemotron-3-super-120b-a12b | 346.3498 | 232.4492 | 206.6109 | 306.4442 | 295.0199 | 227.9731 | 316.5712 | 169.1652 | 212.1549 | 141.617 | 245.4356 |
Nova-2-lite-v1 | 174.3589 | 185.7674 | 131.2926 | 142.2304 | 201.7181 | 135.5685 | 185.4033 | 118.5547 | 43.3698 | 70.0387 | 138.8302 |
Qwen3.5-122b-a10b | 40.9727 | 34.9177 | 36.3253 | 30.2049 | 46.4513 | 32.0967 | 78.2364 | 17.6285 | 25.204 | 22.4722 | 36.451 |
Qwen3.5-35b-a3b | 34.734 | 51.2018 | 34.9103 | 37.8562 | 30.8286 | 29.5903 | 43.737 | 13.6506 | 14.1181 | 16.6684 | 30.7295 |
Qwen3.6-plus | 99.6381 | 113.883 | 180.359 | 163.168 | 85.5265 | 90.039 | 114.3871 | 58.4227 | 90.5119 | 97.8585 | 109.3794 |
Performance Across Different Domains
Model ranks within specific knowledge or task areas. Higher is better.
Domain Performance
Gemini-3.1-flash-lite-preview | 3.0614 | 2.8375 | 2.9419 | 3.1008 | 3.6112 | 2.9925 | 2.8819 | 2.4656 | 2.8192 | 2.6517 | 2.9151 |
Claude-haiku-4.5 | 3.0614 | 2.8375 | 2.9419 | 3.07 | 3.221 | 2.9925 | 2.8819 | 2.4656 | 2.8192 | 2.6517 | 2.9151 |
Claude-opus-4.6 | 3.2254 | 3.2528 | 3.1685 | 3.1008 | 3.6112 | 3.143 | 3.3686 | 2.6688 | 2.972 | 2.9478 | 3.1682 |
Claude-sonnet-4.6 | 3.0719 | 2.8159 | 3.3619 | 3.1191 | 3.622 | 3.0645 | 3.3271 | 2.7493 | 2.9482 | 2.9773 | 3.1262 |
Deepseek-v3.2 | 2.6536 | 2.5936 | 2.5558 | 2.3426 | 2.6388 | 2.4324 | 2.5576 | 2.5272 | 2.6462 | 2.5903 | 2.5518 |
GLM-4.7 | 2.8156 | 2.7517 | 3.0237 | 2.8089 | 3.1913 | 3.1423 | 3.2166 | 2.4618 | 2.9373 | 2.6377 | 2.8201 |
GLM-5.1 | 2.9743 | 3.1265 | 3.1676 | 3.0665 | 3.2953 | 3.0381 | 3.2561 | 2.6182 | 3.0416 | 2.7205 | 3.0959 |
Gemini-3-flash-preview | 3.0713 | 3.0608 | 3.107 | 2.6955 | 3.0183 | 3.0539 | 2.9513 | 2.7208 | 2.6366 | 2.6772 | 2.8926 |
Gemini-3.1-flash-lite-preview | 2.7071 | 2.9008 | 2.7554 | 2.6945 | 3.0595 | 2.9101 | 3.1382 | 2.5291 | 2.6783 | 2.7213 | 2.5328 |
Gemini-3.1-pro-preview | 3.1791 | 3.2185 | 3.2664 | 2.9015 | 3.2574 | 2.8581 | 3.2808 | 2.5238 | 3.1878 | 3.1821 | 2.698 |
Gemma-4-26b-a4b-it | 2.2699 | 2.5902 | 2.6241 | 2.4963 | 2.5757 | 2.7545 | 2.8982 | 2.1141 | 2.4297 | 2.5124 | 2.9172 |
Gemma-4-31b-it | 2.8553 | 2.7678 | 2.9285 | 2.5916 | 2.7969 | 2.7997 | 2.7872 | 2.2097 | 2.6403 | 2.6407 | 3.0559 |
Gpt-5.4 | 3.102 | 3.0287 | 3.2705 | 3.0277 | 3.0711 | 3.2832 | 3.0165 | 2.5045 | 2.922 | 3.0751 | 3.0246 |
Gpt-5.4-mini | 3.3417 | 3.0365 | 2.8875 | 2.5461 | 3.2261 | 3.1224 | 2.791 | 2.393 | 2.9426 | 2.8658 | 2.9075 |
Gpt-5.4-nano | 2.9039 | 2.764 | 2.9425 | 2.269 | 2.913 | 2.7384 | 2.7862 | 2.5623 | 2.9551 | 2.9552 | 2.7823 |
Gpt-oss-120b | 2.8299 | 2.9759 | 2.8175 | 2.6792 | 2.9697 | 2.6809 | 2.6406 | 2.3854 | 2.8958 | 2.6491 | 2.7639 |
Gpt-oss-20b | 2.4834 | 2.9056 | 2.65 | 2.4672 | 2.7635 | 2.6706 | 2.5309 | 2.4596 | 2.8156 | 2.8078 | 2.6495 |
Grok-4.1-fast | 2.7256 | 2.7798 | 2.8956 | 2.7138 | 2.9297 | 2.9768 | 3.0699 | 2.4118 | 2.8696 | 3.0586 | 2.8449 |
Grok-4.20 | 2.8101 | 2.992 | 2.8084 | 2.8254 | 3.069 | 3.0622 | 2.9404 | 2.6096 | 3.0932 | 2.8865 | 2.9155 |
Kimi-K2.5 | 2.7529 | 2.9016 | 3.1202 | 2.8565 | 3.0409 | 3.2633 | 3.1816 | 2.5213 | 2.9282 | 2.7499 | 2.9215 |
Llama-4-maverick | 2.2207 | 2.1623 | 2.2359 | 2.2829 | 2.1537 | 2.4089 | 2.1973 | 2.2927 | 2.3858 | 2.4429 | 2.2695 |
Mimo-V2-Pro | 3.165 | 3.1059 | 3.0574 | 2.8061 | 3.2456 | 2.9782 | 3.2639 | 2.4309 | 2.8583 | 2.8374 | 2.9878 |
Minimax-m2.5 | 2.6819 | 2.5663 | 2.7079 | 2.49 | 3.1716 | 2.8785 | 2.9233 | 2.4104 | 2.6426 | 2.5862 | 2.7226 |
Minimax-m2.7 | 2.9146 | 3.1791 | 3.111 | 2.733 | 3.1253 | 2.7982 | 3.1518 | 2.5031 | 2.6978 | 2.7335 | 2.903 |
Mistral-large-2512 | 2.4911 | 2.9176 | 2.6082 | 2.4303 | 2.5595 | 2.5928 | 2.9121 | 2.4414 | 2.7411 | 2.5481 | 2.6249 |
Mistral-small-4 | 2.6428 | 2.7426 | 2.6579 | 2.5289 | 2.7196 | 2.669 | 2.785 | 2.6008 | 2.861 | 2.636 | 2.6874 |
Nemotron-3-nano-30b-a3b | 2.6756 | 2.6616 | 2.387 | 2.5064 | 2.8724 | 2.6711 | 2.6363 | 2.5052 | 2.7506 | 2.574 | 2.6313 |
Nemotron-3-super-120b-a12b | 2.5963 | 2.6885 | 2.7757 | 2.6367 | 2.5973 | 2.9963 | 2.5992 | 2.7177 | 2.7433 | 2.7751 | 2.6956 |
Nova-2-lite-v1 | 2.4537 | 2.6576 | 2.7964 | 2.5301 | 2.6316 | 2.8817 | 2.6522 | 2.4999 | 2.8745 | 2.6604 | 2.6603 |
Qwen3.5-122b-a10b | 2.6317 | 2.7519 | 2.9903 | 2.613 | 3.1279 | 2.8175 | 3.0919 | 2.5094 | 2.8085 | 2.9913 | 2.8363 |
Qwen3.5-35b-a3b | 2.9033 | 2.7141 | 2.9996 | 2.5789 | 2.9716 | 2.8106 | 2.9955 | 2.3478 | 2.9025 | 3.0803 | 2.8262 |
Qwen3.6-plus | 2.9985 | 3.0064 | 3.0355 | 2.8342 | 3.114 | 2.9565 | 3.192 | 2.7408 | 2.9821 | 3.041 | 2.9923 |
About AutoBench
AutoBench is an LLM benchmark where Large Language Models (LLMs) evaluate and rank the responses generated by other LLMs. The questions themselves are also generated by LLMs across a diverse set of domains and ranked for quality.
Methodology
- Question Generation: High-quality questions across various domains (Coding, History, Science, etc.) are generated by selected LLMs.
- Response Generation: The models being benchmarked generate answers to these questions.
- Ranking: Ranking LLMs rank the responses from different models for each question, on a 1-5 scale.
- Aggregation: Scores are averaged across multiple questions and domains to produce the final AutoBench rank.
Metrics
- AutoBench Score (AB): The average rank received by a model's responses across all questions/domains (higher is better).
- Avg Cost (USD Cents/response): Estimated average cost to generate one response based on model provider pricing (input+output tokens). Lower is better.
- Avg Latency (s): Average time taken by the model to generate a response. Lower is better.
- P99 Latency (s): The 99th percentile of response time, indicating worst-case latency. Lower is better.
- Chatbot Arena / Artificial Analysis Intelligence Index / MMLU: Scores from other well-known benchmarks for comparison (where available).
Data
This leaderboard reflects a run completed on April 23, 2025. Models included recently released models such as o4-mini, Gpt-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thikning, etc..
Links
Disclaimer: Benchmark results provide one perspective on model capabilities. Performance can vary based on specific tasks, prompts, and API conditions. Costs are estimates and subject to change by providers. Latency depends on server load and geographic location.