AutoBench LLM Leaderboard
Interactive leaderboard for AutoBench, where LLMs rank LLMs' responses. Includes performance, cost, and latency metrics. Use the dropdown below to navigate between different benchmark runs.
Choose a benchmark run to view its results
Current Run: AutoBench Run 3 - August 2025 (2025-08-14) - 34 models
Overall Model Performance
Models ranked by AutoBench score. Lower cost ($ Cents), latency (s), and fail rate (%) are better. Iterations shows the number of evaluations per model.
Benchmark Correlations: AutoBench features 86.85% with LMArena, 92.17% with Artificial Analysis Intelligence Index, 75.44% with MMLU.
Overall Rankings
|  llama-3_1-Nemotron-Ultra-253B-v1  |  4.511567341  |  4.368  |  89.99818067  |  277.6722  |  19.27%  |  385  | 
Benchmark Comparison
Comparison of AutoBench scores with other popular benchmarks. AutoBench features 82.51% correlation with Chatbot Arena, 83.74% with Artificial Analysis Intelligence Index, and 71.51% with MMLU. Models sorted by AutoBench score.
Benchmark Comparison
|  llama-3_1-Nemotron-Ultra-253B-v1  |  4.511567341  |  1481  |  68950  |  0.871  | 
|  gpt-5   |  4.511567341   |  1481   |  68950   |  0.871   | 
|  gpt-5-mini   |  4.486571107   |  null   |  63700   |  0.828   | 
|  gpt-oss-120b   |  4.479287977   |  1356   |  61340   |  0.808   | 
|  gemini-2.5-pro   |  4.416904571   |  1458   |  64630   |  0.862   | 
|  o3   |  4.409851586   |  1451   |  67070   |  0.853   | 
|  Qwen3-235B-A22B-Thinking-2507   |  4.394399183   |  1401   |  63590   |  0.843   | 
|  gpt-5-nano   |  4.325926956   |  null   |  53780   |  0.772   | 
|  gemini-2.5-flash   |  4.32099389   |  1409   |  58430   |  0.759   | 
|  grok-4   |  4.308828831   |  1430   |  67520   |  0.866   | 
|  o4-mini   |  4.27410734   |  1398   |  65050   |  0.832   | 
|  claude-opus-4-1   |  4.239909895   |  1446   |  58830   |  null   | 
|  deepSeek-R1-0528   |  4.18112906   |  1418   |  58740   |  0.849   | 
|  Kimi-K2-Instruct   |  4.177138663   |  1420   |  48560   |  0.824   | 
|  GLM-4.5   |  4.176558031   |  1414   |  56080   |  0.835   | 
|  claude-sonnet-4   |  4.171968576   |  1399   |  61000   |  0.842   | 
|  gpt-4.1   |  4.165890881   |  1406   |  46770   |  0.806   | 
|  grok-3-mini   |  4.055940505   |  1360   |  58010   |  0.828   | 
|  llama-3_1-Nemotron-Ultra-253B-v1   |  4.020264345   |  1345   |  46420   |  0.825   | 
|  gemini-2.5-flash-lite   |  4.017202952   |  1351   |  44348   |  0.832   | 
|  GLM-4.5-Air   |  3.98464018   |  1379   |  49475   |  0.815   | 
|  Qwen3-14B   |  3.976245179   |  null   |  45235   |  0.774   | 
|  Qwen3-30B-A3B   |  3.952481327   |  1380   |  42340   |  0.777   | 
|  deepSeek-V3-0324   |  3.945669087   |  1390   |  43990   |  0.819   | 
|  llama-3_3-Nemotron-Super-49B-v1   |  3.883310532   |  1324   |  40473   |  0.698   | 
|  gemma-3-27b-it   |  3.881640548   |  1363   |  25220   |  0.669   | 
|  mistral-large-2411   |  3.714675671   |  1313   |  27013   |  0.697   | 
|  magistral-small-2506   |  3.713933337   |  1347   |  35950   |  0.746   | 
|  phi-4   |  3.657791802   |  1258   |  27950   |  0.714   | 
|  llama-4-maverick   |  3.640194992   |  1330   |  41730   |  0.809   | 
|  llama-4-Scout   |  3.614481399   |  1318   |  33060   |  0.752   | 
|  claude-3.5-haiku   |  3.586292962   |  1317   |  23326   |  0.634   | 
|  nova-lite-v1   |  3.538201832   |  1262   |  24540   |  0.59   | 
|  nova-pro-v1   |  3.490835422   |  1289   |  28830   |  0.691   | 
Performance Visualizations
Exploring relationships between AutoBench Rank, Latency, and Cost.
Rank vs. Average Cost
Rank vs. Average Latency
Rank vs. P99 Latency
Performance vs. Cost/Latency Trade-offs
Cost Breakdown per Domain ($ Cents/Response)
Cost Breakdown
|  llama-3_1-Nemotron-Ultra-253B-v1  |  18.544  |  0.596  |  0.783  |  0.688  |  0.701  |  0.829  |  7.756  |  0.865  |  0.762  |  0.771  |  0.826  | 
|  claude-3.5-haiku   |  1.433   |  0.596   |  0.783   |  0.688   |  0.701   |  0.829   |  0.67   |  0.865   |  0.762   |  0.771   |  0.826   | 
|  claude-opus-4-1   |  18.544   |  5.806   |  7.954   |  7.093   |  6.338   |  9.152   |  7.756   |  8.968   |  8.114   |  8.625   |  9.126   | 
|  claude-sonnet-4   |  3.736   |  0.987   |  1.471   |  1.164   |  1.134   |  1.594   |  1.523   |  1.812   |  1.523   |  1.549   |  1.71   | 
|  deepSeek-R1-0528   |  1.315   |  0.241   |  0.362   |  0.289   |  0.322   |  0.349   |  1.36   |  1.57   |  0.335   |  0.346   |  0.638   | 
|  deepSeek-V3-0324   |  0.18   |  0.077   |  0.108   |  0.077   |  0.096   |  0.102   |  0.177   |  0.162   |  0.102   |  0.101   |  0.12   | 
|  gemini-2.5-flash   |  1.009   |  0.15   |  0.386   |  0.235   |  0.328   |  0.373   |  0.4   |  0.651   |  0.42   |  0.426   |  0.451   | 
|  gemini-2.5-flash-lite   |  0.163   |  0.023   |  0.043   |  0.027   |  0.036   |  0.074   |  0.284   |  0.3   |  0.047   |  0.076   |  0.105   | 
|  gemini-2.5-pro   |  2.775   |  0.727   |  1.51   |  1.123   |  1.218   |  1.489   |  1.518   |  2.214   |  1.611   |  1.516   |  1.587   | 
|  gemma-3-27b-it   |  0.045   |  0.018   |  0.028   |  0.019   |  0.022   |  0.027   |  0.026   |  0.041   |  0.027   |  0.027   |  0.028   | 
|  GLM-4.5   |  1.215   |  0.225   |  0.374   |  0.3   |  0.374   |  0.376   |  1.21   |  1.441   |  0.379   |  0.408   |  0.63   | 
|  GLM-4.5-Air   |  0.647   |  0.115   |  0.197   |  0.137   |  0.31   |  0.189   |  0.668   |  0.811   |  0.244   |  0.27   |  0.361   | 
|  gpt-4.1   |  1.278   |  0.476   |  0.8   |  0.527   |  0.724   |  0.747   |  1.396   |  1.695   |  0.758   |  0.749   |  0.914   | 
|  gpt-5   |  6.007   |  2.748   |  3.787   |  3.068   |  3.329   |  3.684   |  6.199   |  7.589   |  3.761   |  3.823   |  4.368   | 
|  gpt-5-mini   |  0.837   |  0.361   |  0.562   |  0.423   |  0.503   |  0.521   |  0.885   |  1.099   |  0.558   |  0.585   |  0.632   | 
|  gpt-5-nano   |  0.322   |  0.188   |  0.18   |  0.158   |  0.258   |  0.217   |  0.351   |  0.382   |  0.171   |  0.193   |  0.241   | 
|  gpt-oss-120b   |  0.204   |  0.075   |  0.14   |  0.103   |  0.088   |  0.125   |  0.159   |  0.207   |  0.12   |  0.137   |  0.136   | 
|  grok-3-mini   |  0.135   |  0.054   |  0.074   |  0.052   |  0.075   |  0.074   |  0.146   |  0.131   |  0.071   |  0.074   |  0.09   | 
|  grok-4   |  5.103   |  1.489   |  2.296   |  1.615   |  2.758   |  2.201   |  5.123   |  5.97   |  2.106   |  2.263   |  2.92   | 
|  Kimi-K2-Instruct   |  0.288   |  0.175   |  0.236   |  0.222   |  0.169   |  0.289   |  0.276   |  0.216   |  0.241   |  0.23   |  0.238   | 
|  llama-3_1-Nemotron-Ultra-253B-v1   |  0.519   |  0.21   |  0.167   |  0.144   |  0.233   |  0.206   |  0.771   |  0.84   |  0.209   |  0.188   |  0.345   | 
|  Llama-3_3-Nemotron-Super-49B-v1   |  0.061   |  0.03   |  0.039   |  0.036   |  0.038   |  0.043   |  0.068   |  0.059   |  0.041   |  0.037   |  0.046   | 
|  llama-4-maverick   |  0.074   |  0.031   |  0.044   |  0.035   |  0.045   |  0.046   |  0.068   |  0.069   |  0.042   |  0.041   |  0.05   | 
|  llama-4-Scout-17B-16E-Instruct   |  0.053   |  0.029   |  0.041   |  0.037   |  0.038   |  0.043   |  0.044   |  0.049   |  0.038   |  0.037   |  0.041   | 
|  magistral-small-2506   |  0.21   |  0.099   |  0.101   |  0.074   |  0.106   |  0.102   |  0.614   |  0.52   |  0.101   |  0.1   |  0.198   | 
|  mistral-large-2411   |  0.964   |  0.461   |  0.609   |  0.46   |  0.456   |  0.609   |  0.568   |  0.73   |  0.577   |  0.586   |  0.61   | 
|  nova-lite-v1   |  0.029   |  0.013   |  0.017   |  0.013   |  0.015   |  0.017   |  0.021   |  0.025   |  0.016   |  0.016   |  0.018   | 
|  nova-pro-v1   |  0.287   |  0.146   |  0.17   |  0.127   |  0.182   |  0.161   |  0.167   |  0.248   |  0.151   |  0.146   |  0.18   | 
|  o3   |  1.83   |  0.943   |  1.516   |  1.015   |  1.051   |  1.34   |  3.759   |  5.158   |  1.259   |  1.355   |  1.85   | 
|  o4-mini   |  1.034   |  0.629   |  0.715   |  0.607   |  0.735   |  0.687   |  1.598   |  1.248   |  0.668   |  0.744   |  0.87   | 
|  phi-4   |  0.035   |  0.019   |  0.021   |  0.017   |  0.019   |  0.023   |  0.024   |  0.038   |  0.021   |  0.021   |  0.024   | 
|  Qwen3-14B   |  0.099   |  0.037   |  0.045   |  0.044   |  0.056   |  0.047   |  0.186   |  0.189   |  0.05   |  0.048   |  0.079   | 
|  Qwen3-235B-A22B-Thinking-2507   |  0.447   |  0.278   |  0.44   |  0.357   |  0.411   |  0.378   |  0.674   |  0.661   |  0.37   |  0.372   |  0.417   | 
|  Qwen3-30B-A3B   |  0.121   |  0.039   |  0.046   |  0.038   |  0.055   |  0.047   |  0.15   |  0.185   |  0.046   |  0.045   |  0.076   | 
Average Latency Breakdown per Domain (Seconds)
Average Latency Breakdown
|  llama-3_1-Nemotron-Ultra-253B-v1  |  238.2024  |  31.7501  |  13.1592  |  11.2295  |  18.9687  |  13.7099  |  302.2976  |  271.5234  |  10.8395  |  11.2754  |  11.51902452  | 
|  claude-3.5-haiku   |  17.2096   |  9.5021   |  13.1592   |  11.2295   |  9.2722   |  13.7099   |  7.9069   |  9.5013   |  10.8395   |  11.2754   |  11.51902452   | 
|  claude-opus-4-1   |  97.988   |  31.7501   |  44.9092   |  36.3513   |  30.166   |  53.8647   |  38.4643   |  32.1546   |  48.1318   |  52.2831   |  48.62490598   | 
|  claude-sonnet-4   |  75.4533   |  18.119   |  34.977   |  23.5653   |  18.9687   |  35.1883   |  19.9602   |  24.1877   |  36.5889   |  34.5893   |  33.66639032   | 
|  deepSeek-R1-0528   |  238.2024   |  38.1537   |  62.9681   |  48.2227   |  51.7076   |  61.9942   |  302.2976   |  271.5234   |  62.5406   |  66.1298   |  119.174235   | 
|  deepSeek-V3-0324   |  55.0977   |  17.8165   |  32.0701   |  23.041   |  24.2812   |  31.001   |  88.9007   |  51.2641   |  29.3776   |  43.4225   |  40.30336432   | 
|  gemini-2.5-flash   |  95.8841   |  16.7289   |  32.1003   |  18.3471   |  24.8943   |  29.1208   |  133.8567   |  52.6201   |  31.5185   |  36.1925   |  48.7078753   | 
|  gemini-2.5-flash-lite   |  26.1777   |  2.8563   |  5.6249   |  3.6215   |  3.9845   |  6.8374   |  86.5258   |  41.5112   |  6.4054   |  9.5453   |  19.15509939   | 
|  gemini-2.5-pro   |  83.393   |  29.2572   |  49.1594   |  36.6978   |  36.9932   |  47.2089   |  166.9207   |  93.4631   |  49.4694   |  52.5929   |  65.03115036   | 
|  gemma-3-27b-it   |  62.2708   |  12.4686   |  27.5139   |  18.3155   |  19.2946   |  25.0176   |  24.3704   |  40.4719   |  35.3548   |  23.2773   |  29.7215   | 
|  GLM-4.5   |  147.0394   |  20.9164   |  43.4854   |  31.8055   |  38.4103   |  42.5121   |  224.8967   |  165.5218   |  43.1608   |  49.7589   |  80.74437254   | 
|  GLM-4.5-Air   |  105.9142   |  13.7172   |  31.2173   |  19.5371   |  45.6936   |  30.9208   |  206.3031   |  140.9465   |  38.1786   |  44.156   |  68.34050587   | 
|  gpt-4.1   |  58.0164   |  11.4923   |  24.717   |  14.4706   |  15.9619   |  23.1579   |  80.4132   |  46.8127   |  21.9879   |  23.3983   |  32.86274006   | 
|  gpt-5   |  120.9672   |  50.0373   |  78.8355   |  55.0956   |  56.6508   |  73.3966   |  156.5955   |  151.2006   |  83.2029   |  76.5455   |  89.99818067   | 
|  gpt-5-mini   |  102.5153   |  25.7197   |  48.4212   |  31.5236   |  35.4217   |  44.406   |  159.7168   |  84.2748   |  52.1692   |  56.7156   |  65.89701176   | 
|  gpt-5-nano   |  98.4218   |  38.2174   |  52.23   |  38.2242   |  48.2844   |  54.4789   |  136.6573   |  86.4832   |  48.3263   |  53.912   |  66.4959839   | 
|  gpt-oss-120b   |  49.9796   |  11.3648   |  28.9885   |  18.2911   |  14.4024   |  22.6241   |  25.4515   |  36.6005   |  27.8547   |  28.8135   |  27.00733404   | 
|  grok-3-mini   |  45.7707   |  12.9635   |  17.7188   |  13.6096   |  20.4385   |  20.4502   |  40.5232   |  32.54   |  25.8136   |  23.9   |  26.12147499   | 
|  grok-4   |  92.6961   |  28.4581   |  49.7663   |  33.7181   |  41.4523   |  48.4394   |  138.4335   |  124.3858   |  48.7183   |  50.103   |  60.95525411   | 
|  Kimi-K2-Instruct   |  69.4739   |  29.2635   |  65.4032   |  45.9564   |  46.6415   |  75.2439   |  114.3645   |  58.6696   |  72.5513   |  57.1711   |  65.0222057   | 
|  llama-3_1-Nemotron-Ultra-253B-v1   |  88.5095   |  28.8905   |  29.2174   |  22.7454   |  32.2121   |  36.1473   |  174.681   |  134.8929   |  38.2931   |  32.3295   |  61.53657957   | 
|  Llama-3_3-Nemotron-Super-49B-v1   |  46.3574   |  15.1811   |  28.7779   |  22.9057   |  21.5446   |  27.2551   |  67.4261   |  33.0439   |  31.8722   |  25.0215   |  32.63831081   | 
|  llama-4-maverick   |  21.5707   |  4.76   |  7.8189   |  5.8389   |  6.1681   |  8.3187   |  21.0604   |  11.3275   |  8.7158   |  6.8338   |  10.65014104   | 
|  llama-4-Scout-17B-16E-Instruct   |  20.2177   |  6.4339   |  9.1924   |  7.529   |  7.9851   |  9.8133   |  11.1171   |  13.0721   |  12.2545   |  8.2341   |  10.86684261   | 
|  magistral-small-2506   |  11.4551   |  7.1952   |  7.5178   |  5.7532   |  6.1051   |  8.6988   |  79.2722   |  37.9617   |  7.0901   |  7.2786   |  17.53939687   | 
|  mistral-large-2411   |  51.7739   |  14.2815   |  23.6025   |  17.3517   |  13.3736   |  25.8432   |  18.1355   |  24.6234   |  25.1484   |  21.9005   |  24.36368715   | 
|  nova-lite-v1   |  7.1014   |  4.7882   |  5.846   |  4.7061   |  4.3402   |  5.5093   |  4.8806   |  4.861   |  4.9134   |  5.3275   |  5.288625128   | 
|  nova-pro-v1   |  12.4833   |  7.5838   |  7.52   |  5.6658   |  6.5254   |  7.3418   |  5.8645   |  7.2712   |  6.6838   |  6.7792   |  7.528192069   | 
|  o3   |  70.4202   |  25.9427   |  46.4619   |  29.613   |  26.5293   |  42.6644   |  194.8085   |  112.9362   |  41.2548   |  46.7826   |  63.89621339   | 
|  o4-mini   |  56.98   |  16.3976   |  26.7274   |  19.8134   |  21.7084   |  23.2641   |  116.3436   |  41.5349   |  28.6513   |  26.4233   |  39.05469579   | 
|  phi-4   |  10.8373   |  5.9498   |  6.7808   |  6.3085   |  5.9981   |  7.1457   |  7.7569   |  12.1669   |  7.0431   |  7.4096   |  7.744667446   | 
|  Qwen3-14B   |  67.7342   |  19.9239   |  31.3204   |  32.178   |  32.2363   |  31.2024   |  197.4205   |  132.5492   |  40.1656   |  31.5221   |  61.11544056   | 
|  Qwen3-235B-A22B-Thinking-2507   |  180.1429   |  33.7386   |  65.2237   |  45.3004   |  54.6603   |  53.7109   |  122.6611   |  138.2941   |  60.427   |  72.9058   |  78.79346155   | 
|  Qwen3-30B-A3B   |  119.9895   |  27.907   |  34.7837   |  25.8461   |  38.8109   |  37.0577   |  204.2344   |  157.6709   |  38.1969   |  41.6341   |  72.64171253   | 
P99 Latency Breakdown per Domain (Seconds)
P99 Latency Breakdown
|  llama-3_1-Nemotron-Ultra-253B-v1  |  336.5838  |  127.0559  |  115.4247  |  155.9084  |  183.8738  |  122.7531  |  934.9063  |  354.6674  |  147.6845  |  173.4233  |  246.0464  | 
|  GLM-4.5   |  336.5838   |  46.1539   |  115.4247   |  61.5081   |  183.8738   |  94.3667   |  955.208   |  354.6674   |  147.6845   |  164.993   |  246.0464   | 
|  GLM-4.5-Air   |  294.8103   |  49.5964   |  75.2865   |  47.3232   |  188.0317   |  122.7531   |  934.9063   |  326.1799   |  192.6793   |  173.4233   |  240.499   | 
|  Kimi-K2-Instruct   |  328.1743   |  127.0559   |  498.215   |  155.9084   |  411.2174   |  374.8414   |  919.1317   |  342.6293   |  464.4898   |  283.0215   |  390.4685   | 
|  Llama-3_3-Nemotron-Super-49B-v1   |  215.9947   |  28.1436   |  78.589   |  55.0476   |  69.6513   |  64.6465   |  672.344   |  82.6381   |  150.7472   |  96.6115   |  151.4413   | 
|  Qwen3-14B   |  291.1851   |  58.0622   |  88.5603   |  117.7344   |  88.0704   |  115.429   |  952.7636   |  353.0349   |  203.2735   |  124.2363   |  239.235   | 
|  Qwen3-235B-A22B-Thinking-2507   |  666.7428   |  62.6109   |  184.4234   |  141.1792   |  188.1752   |  130.8262   |  431.5208   |  488.4891   |  231.8124   |  312.6447   |  283.8425   | 
|  Qwen3-30B-A3B   |  302.3182   |  64.1154   |  92.9827   |  77.8548   |  121.5989   |  100.3155   |  973.6302   |  352.3788   |  165.0523   |  180.4958   |  243.0743   | 
|  claude-3.5-haiku   |  41.3124   |  14.9481   |  33.5752   |  18.5466   |  17.7297   |  36.6532   |  15.8231   |  40.0595   |  17.8959   |  16.9296   |  25.3473   | 
|  claude-opus-4-1   |  411.8235   |  66.277   |  99.7769   |  67.8099   |  73.13   |  140.073   |  240.5076   |  85.4884   |  194.4168   |  172.176   |  155.1479   | 
|  claude-sonnet-4   |  372.4893   |  48.2756   |  92.4428   |  50.91   |  54.3981   |  124.7965   |  53.7468   |  57.4402   |  181.6048   |  159.8677   |  119.5972   | 
|  deepSeek-R1-0528   |  516.7523   |  74.2918   |  114.0859   |  72.9274   |  112.0078   |  127.0512   |  839.4571   |  432.4486   |  182.3988   |  184.1239   |  265.5545   | 
|  deepSeek-V3-0324   |  186.8884   |  51.2986   |  88.1105   |  69.8525   |  65.7204   |  121.314   |  755.2438   |  202.0879   |  100.6333   |  355.9609   |  199.711   | 
|  gemini-2.5-flash   |  712.3231   |  38.3328   |  103.7981   |  43.5623   |  74.993   |  117.5583   |  944.4005   |  122.8572   |  135.3642   |  147.9211   |  244.1111   | 
|  gemini-2.5-flash-lite   |  190.8247   |  6.2927   |  19.3252   |  12.1569   |  13.2804   |  47.8915   |  602.4712   |  222.2597   |  49.1969   |  110.7355   |  127.4435   | 
|  gemini-2.5-pro   |  240.7699   |  52.3828   |  97.3472   |  57.9532   |  71.3161   |  111.1321   |  714.4278   |  300.57   |  170.1939   |  177.3255   |  199.3419   | 
|  gemma-3-27b-it   |  375.5933   |  26.3165   |  60.8123   |  46.1448   |  66.8804   |  99.0397   |  180.0912   |  228.1774   |  193.2272   |  68.8631   |  134.5146   | 
|  gpt-4.1   |  373.0435   |  20.9727   |  103.6544   |  34.6418   |  45.1164   |  68.0267   |  580.148   |  268.4173   |  151.755   |  161.6432   |  180.7419   | 
|  gpt-5   |  379.0229   |  104.1814   |  151.0171   |  109.2785   |  163.0493   |  141.3236   |  655.5064   |  536.5325   |  304.2283   |  232.5815   |  277.6722   | 
|  gpt-5-mini   |  420.1856   |  55.5139   |  107.6471   |  65.403   |  94.1406   |  96.4831   |  710.367   |  304.143   |  221.4093   |  238.5054   |  231.3798   | 
|  gpt-5-nano   |  452.6803   |  63.8721   |  123.3952   |  95.2713   |  98.3822   |  131.3145   |  649.7221   |  349.2375   |  145.5956   |  209.7373   |  231.9208   | 
|  gpt-oss-120b   |  219.9099   |  40.9979   |  88.2543   |  59.7898   |  50.0398   |  66.35   |  154.4168   |  213.958   |  154.396   |  143.3909   |  119.1503   | 
|  grok-3-mini   |  324.7266   |  28.8678   |  38.3573   |  27.4405   |  58.7259   |  56.7006   |  303.2076   |  79.7071   |  164.6423   |  78.6302   |  116.1006   | 
|  grok-4   |  330.6722   |  73.9998   |  112.4368   |  75.3662   |  148.4834   |  118.3984   |  908.2874   |  484.3592   |  205.0356   |  168.1656   |  262.5205   | 
|  llama-3_1-Nemotron-Ultra-253B-v1   |  299.2445   |  64.7177   |  80.6145   |  53.2406   |  111.7416   |  114.9641   |  677.2227   |  364.4893   |  179.9651   |  73.4696   |  201.967   | 
|  llama-4-Scout-17B-16E-Instruct   |  119.4443   |  17.908   |  21.721   |  15.137   |  15.5893   |  21.4368   |  21.6394   |  35.6977   |  109.5036   |  18.1442   |  39.6221   | 
|  llama-4-maverick   |  258.2904   |  12.3067   |  28.3245   |  14.1619   |  15.9541   |  23.4693   |  237.7464   |  50.0955   |  44.6007   |  26.4254   |  71.1375   | 
|  magistral-small-2506   |  50.6671   |  23.6896   |  23.0028   |  17.8342   |  14.257   |  27.6318   |  461.2929   |  227.3066   |  22.4139   |  27.139   |  89.5235   | 
|  mistral-large-2411   |  320.7227   |  28.5833   |  76.1094   |  50.5788   |  34.0307   |  104.2414   |  69.1314   |  52.9922   |  161.3657   |  71.1036   |  96.8859   | 
|  nova-lite-v1   |  17.362   |  9.0387   |  11.6702   |  10.1896   |  8.0435   |  9.778   |  8.2672   |  7.7491   |  9.4719   |  11.1956   |  10.2766   | 
|  nova-pro-v1   |  55.831   |  13.3815   |  14.2866   |  9.7714   |  24.3369   |  15.6141   |  14.2894   |  23.7601   |  15.2509   |  15.0352   |  20.1557   | 
|  o3   |  370.6262   |  215.2039   |  126.7157   |  84.4048   |  96.7106   |  130.4733   |  970.1118   |  427.7559   |  179.8835   |  165.4601   |  276.7346   | 
|  o4-mini   |  317.1998   |  49.1399   |  78.8689   |  45.2952   |  67.6997   |  52.9028   |  768.0834   |  246.0076   |  143.2397   |  86.9799   |  185.5417   | 
|  phi-4   |  28.1176   |  10.3654   |  12.3853   |  13.4812   |  13.4604   |  12.3491   |  14.116   |  39.9468   |  13.6159   |  34.0316   |  19.1869   | 
Performance Across Different Domains
Model ranks within specific knowledge or task areas. Higher is better.
Domain Performance
|  llama-3_1-Nemotron-Ultra-253B-v1  |  3.4733  |  3.8572  |  3.7443  |  4.4302  |  3.7371  |  3.8701  |  2.8162  |  2.7809  |  4.4164  |  3.8134  |  3.586292962  | 
|  claude-3.5-haiku   |  3.4733   |  3.8572   |  3.7443   |  3.995   |  3.7371   |  3.8701   |  2.8162   |  2.7809   |  3.736   |  3.8134   |  3.586292962   | 
|  claude-opus-4-1   |  4.2931   |  4.5071   |  4.3035   |  4.4302   |  4.2258   |  4.441   |  3.5738   |  3.5758   |  4.4164   |  4.4833   |  4.239909895   | 
|  claude-sonnet-4   |  4.1894   |  4.3647   |  4.3026   |  4.3258   |  4.2532   |  4.3497   |  3.5475   |  3.4817   |  4.3953   |  4.3884   |  4.171968576   | 
|  deepSeek-R1-0528   |  3.9481   |  4.3147   |  4.3493   |  4.4062   |  4.3139   |  4.4007   |  3.5649   |  3.6287   |  4.3876   |  4.4032   |  4.18112906   | 
|  deepSeek-V3-0324   |  3.8756   |  4.1946   |  4.0724   |  4.0561   |  4.0467   |  4.0888   |  3.3667   |  3.4401   |  4.1442   |  4.0976   |  3.945669087   | 
|  gemini-2.5-flash   |  4.4225   |  4.1694   |  4.3729   |  4.3287   |  4.3781   |  4.4165   |  4.0091   |  4.2283   |  4.4283   |  4.3877   |  4.32099389   | 
|  gemini-2.5-flash-lite   |  4.1092   |  4.1468   |  4.0836   |  4.1655   |  4.0513   |  4.1563   |  3.3399   |  3.546   |  4.2419   |  4.1944   |  4.017202952   | 
|  gemini-2.5-pro   |  4.5248   |  4.3916   |  4.4224   |  4.4873   |  4.4508   |  4.477   |  4.0868   |  4.2425   |  4.5154   |  4.516   |  4.416904571   | 
|  gemma-3-27b-it   |  3.5655   |  4.2891   |  4.112   |  4.1677   |  3.973   |  4.1788   |  3.0395   |  3.0841   |  4.1903   |  4.1951   |  3.881640548   | 
|  GLM-4.5   |  3.8921   |  4.2566   |  4.3827   |  4.4219   |  4.3206   |  4.4692   |  3.4848   |  3.4666   |  4.4781   |  4.4931   |  4.176558031   | 
|  GLM-4.5-Air   |  3.8049   |  3.993   |  4.1921   |  4.193   |  4.0296   |  4.2921   |  3.4191   |  3.3372   |  4.2467   |  4.2717   |  3.98464018   | 
|  gpt-4.1   |  4.2419   |  4.3243   |  4.2551   |  4.186   |  4.2089   |  4.2291   |  3.7372   |  3.7882   |  4.2677   |  4.3183   |  4.165890881   | 
|  gpt-5   |  4.5821   |  4.5178   |  4.5866   |  4.6213   |  4.359   |  4.6423   |  4.2072   |  4.1718   |  4.6466   |  4.6634   |  4.511567341   | 
|  gpt-5-mini   |  4.545   |  4.5442   |  4.4995   |  4.5239   |  4.442   |  4.5635   |  4.1788   |  4.2467   |  4.6203   |  4.6302   |  4.486571107   | 
|  gpt-5-nano   |  4.4143   |  4.3848   |  4.3524   |  4.403   |  4.3042   |  4.4475   |  3.88   |  4.134   |  4.3676   |  4.5159   |  4.325926956   | 
|  gpt-oss-120b   |  4.612   |  4.4161   |  4.5248   |  4.4229   |  4.4453   |  4.5703   |  4.162   |  4.2461   |  4.6282   |  4.634   |  4.479287977   | 
|  grok-3-mini   |  4.0184   |  4.1848   |  4.1622   |  4.2055   |  4.1589   |  4.2079   |  3.5142   |  3.4923   |  4.2585   |  4.263   |  4.055940505   | 
|  grok-4   |  4.3075   |  4.3543   |  4.3302   |  4.3823   |  4.3368   |  4.3983   |  4.0111   |  3.8461   |  4.4058   |  4.4033   |  4.308828831   | 
|  Kimi-K2-Instruct   |  4.119   |  4.5362   |  4.2929   |  4.3457   |  4.1929   |  4.5009   |  3.3958   |  3.5199   |  4.4592   |  4.4092   |  4.177138663   | 
|  llama-3_1-Nemotron-Ultra-253B-v1   |  3.7715   |  4.268   |  4.1844   |  4.2204   |  4.0862   |  4.2596   |  3.4729   |  3.434   |  4.2285   |  4.2351   |  4.020264345   | 
|  Llama-3_3-Nemotron-Super-49B-v1   |  3.8343   |  4.0472   |  4.0449   |  4.1217   |  3.9872   |  4.1624   |  3.0436   |  3.2692   |  4.1175   |  4.1322   |  3.883310532   | 
|  llama-4-maverick   |  3.5884   |  3.7355   |  3.7029   |  3.7833   |  3.8314   |  3.8241   |  3.1303   |  3.0961   |  3.8317   |  3.7879   |  3.640194992   | 
|  llama-4-Scout-17B-16E-Instruct   |  3.3725   |  3.8585   |  3.6597   |  3.8316   |  3.8462   |  3.8386   |  3.0535   |  3.0033   |  3.8224   |  3.8368   |  3.614481399   | 
|  magistral-small-2506   |  3.7448   |  3.2301   |  3.9232   |  3.8931   |  3.8409   |  3.9707   |  3.2159   |  3.2791   |  4.0028   |  3.941   |  3.713933337   | 
|  mistral-large-2411   |  3.4967   |  3.9724   |  3.8286   |  3.9329   |  3.7992   |  3.919   |  3.1123   |  3.1132   |  3.9871   |  3.9484   |  3.714675671   | 
|  nova-lite-v1   |  3.322   |  3.7767   |  3.7078   |  3.7683   |  3.5057   |  3.7565   |  2.9917   |  2.9507   |  3.8237   |  3.7503   |  3.538201832   | 
|  nova-pro-v1   |  3.3633   |  3.8403   |  3.5455   |  3.723   |  3.5315   |  3.633   |  2.9588   |  2.8514   |  3.7492   |  3.6051   |  3.490835422   | 
|  o3   |  4.4254   |  4.2963   |  4.5626   |  4.4951   |  4.3871   |  4.5722   |  3.9576   |  4.1618   |  4.579   |  4.6123   |  4.409851586   | 
|  o4-mini   |  4.3056   |  4.2389   |  4.3587   |  4.3787   |  4.2565   |  4.3495   |  3.8986   |  3.8362   |  4.4838   |  4.5318   |  4.27410734   | 
|  phi-4   |  3.4825   |  3.9651   |  3.7302   |  3.849   |  3.6624   |  3.8171   |  3.1286   |  3.1995   |  3.8704   |  3.8465   |  3.657791802   | 
|  Qwen3-14B   |  3.833   |  4.2818   |  4.1127   |  4.0911   |  4.0235   |  4.149   |  3.4441   |  3.3376   |  4.2349   |  4.1606   |  3.976245179   | 
|  Qwen3-235B-A22B-Thinking-2507   |  4.3112   |  4.4415   |  4.4798   |  4.5551   |  4.4452   |  4.5117   |  3.8366   |  3.9413   |  4.4376   |  4.5403   |  4.394399183   | 
|  Qwen3-30B-A3B   |  3.8019   |  4.1652   |  4.0949   |  4.1569   |  3.9359   |  4.145   |  3.4857   |  3.3944   |  4.1322   |  4.1468   |  3.952481327   | 
About AutoBench
AutoBench is an LLM benchmark where Large Language Models (LLMs) evaluate and rank the responses generated by other LLMs. The questions themselves are also generated by LLMs across a diverse set of domains and ranked for quality.
Methodology
- Question Generation: High-quality questions across various domains (Coding, History, Science, etc.) are generated by selected LLMs.
- Response Generation: The models being benchmarked generate answers to these questions.
- Ranking: Ranking LLMs rank the responses from different models for each question, on a 1-5 scale.
- Aggregation: Scores are averaged across multiple questions and domains to produce the final AutoBench rank.
Metrics
- AutoBench Score (AB): The average rank received by a model's responses across all questions/domains (higher is better).
- Avg Cost (USD Cents/response): Estimated average cost to generate one response based on model provider pricing (input+output tokens). Lower is better.
- Avg Latency (s): Average time taken by the model to generate a response. Lower is better.
- P99 Latency (s): The 99th percentile of response time, indicating worst-case latency. Lower is better.
- Chatbot Arena / Artificial Analysis Intelligence Index / MMLU: Scores from other well-known benchmarks for comparison (where available).
Data
This leaderboard reflects a run completed on April 23, 2025. Models included recently released models such as o4-mini, Gpt-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thikning, etc..
Links
Disclaimer: Benchmark results provide one perspective on model capabilities. Performance can vary based on specific tasks, prompts, and API conditions. Costs are estimates and subject to change by providers. Latency depends on server load and geographic location.