AutoBench LLM Leaderboard

Interactive leaderboard for AutoBench, where LLMs rank LLMs' responses. Includes performance, cost, and latency metrics. Use the dropdown below to navigate between different benchmark runs.

📊 Select AutoBench Run

Choose a benchmark run to view its results

Current Run: AutoBench Run 5 - December 2025 (2025-12-19) - 38 models

Overall Model Performance

Models ranked by AutoBench score. Lower cost ($ Cents), latency (s), and fail rate (%) are better. Iterations shows the number of evaluations per model.

Benchmark Correlations: AutoBench features 69.19% with LMArena, 89.52% with Artificial Analysis Intelligence Index, 82.64% with MMLU.

Overall Rankings

Overall Rankings

Llama-3.3-nemotron-super-49b-v1.5	4.476206	81.882	261.3839264	783.8191	0.101587302	303


Gpt-5.2-pro	4.476206	81.882	261.3839264	783.8191	null	303
Gpt-5.2	4.430061	7.356	130.0950453	434.1457	null	312
Gemini-3-pro-preview	4.405224	6.85	76.10772546	186.1538	null	312
Claude-opus-4.5	4.39496	17.26	144.0064821	373.3072	null	313
Gpt-5.1	4.38364	10.8	227.425965	627.3321	null	310
Kimi-k2-thinking	4.315342	1.856	247.9683578	729.1158	null	287
Gemini-3-flash-preview	4.303068	1.946	45.56031499	136.5714	null	313
Claude-sonnet-4.5	4.30218	11.393	169.7268622	476.7493	null	307
Gemini-2.5-pro	4.294065	6.479	86.79510542	222.4742	null	313
Gpt-5-mini	4.287269	0.914	93.48742858	257.8269	null	312
Grok-4.1-fast-thinking	4.206201	0.272	69.23819567	207.0219	null	306
Grok-4	4.197064	8.124	180.1110452	562.4568	null	293
Qwen3-235B-A22B-Thinking-2507	4.196769	0.317	316.8201599	810.8855	0.101587302	283
Gpt-oss-120b	4.181097	0.115	75.47582399	291.8432	null	292
Gemini-2.5-flash	4.171935	2.122	65.61706446	173.9386	null	312
Claude-haiku-4.5	4.170821	3.787	110.9457146	316.9223	null	312
Deepseek-v3.2-speciale	4.141433	0.467	310.3903417	832.6711	null	288
GLM-4.6	4.132794	1.254	187.4263836	630.4876	null	306
DeepSeek-R1-0528	4.118577	0.987	171.4999365	476.6504	null	308
Deepseek-v3.2	4.109586	0.089	124.5749929	410.4636	null	311
Kimi-k2-0905	4.107852	0.334	82.79755057	329.0056	null	312
Gpt-5-nano	4.060397	0.339	99.62428955	268.8315	null	309
Nova-2-lite-v1	4.059981	3.944	61.45748847	131.6013	null	277
Qwen3-next-80b-a3b-thinking	4.031744	0.749	77.75939135	226.6753	0.00952381	312
Nemotron-3-nano-30b-a3b	4.028261	0	30.08154062	97.8707	null	314
Minimax-m2	3.990557	0.712	136.9629944	472.8137	null	308
Qwen3-235b-a22b-2507	3.98095	0.192	104.7811018	337.2145	0.041269841	302
Gemini-2.5-flash-lite	3.94904	0.214	20.41722765	69.0879	null	313
Mistral-large-2512	3.935105	0.512	89.96343603	198.1299	null	307
Grok-4.1-fast	3.877365	0.084	23.59507062	57.4378	null	312
GLM-4.5-Air	3.864646	0.536	163.1509648	425.284	null	306
Olmo-3.1-32b-think	3.85021	0	122.4217364	270.4384	null	307
Mistral-medium-3.1	3.811798	0.375	52.24553215	146.9272	null	306
Llama-3.3-nemotron-super-49b-v1.5	3.783612	0.183	76.47567281	240.3163	null	311
Gpt-oss-20b	3.779105	0.068	38.76845801	183.0218	null	310
Ministral-8b-2512	3.570151	0.049	31.40083599	154.1382	null	306
Nemotron-nano-9b-v2	3.500291	0.082	66.77738031	211.7337	null	311
Nova-premier-v1	3.473742	1.295	51.84074232	134.5408	null	312

Benchmark Comparison

Comparison of AutoBench scores with other popular benchmarks. AutoBench features 82.51% correlation with Chatbot Arena, 83.74% with Artificial Analysis Intelligence Index, and 71.51% with MMLU. Models sorted by AutoBench score.

Benchmark Comparison

Benchmark Comparison

Llama-3.3-nemotron-super-49b-v1.5	4.476206	null	null	null


Gpt-5.2-pro	4.476206	null	73	null
Gpt-5.2	4.430061	null	null	null
Gemini-3-pro-preview	4.405224	1492	73	null
Claude-opus-4.5	4.39496	1470	70	null
Gpt-5.1	4.38364	1457	70	0.87
Kimi-k2-thinking	4.315342	1429	67	null
Gemini-3-flash-preview	4.303068	null	71	null
Claude-sonnet-4.5	4.30218	1450	63	null
Gemini-2.5-pro	4.294065	1451	60	null
Gpt-5-mini	4.287269	1392	64	null
Grok-4.1-fast-thinking	4.206201	null	64	null
Grok-4	4.197064	1478	65	null
Qwen3-235B-A22B-Thinking-2507	4.196769	1397	57	0.84
Gpt-oss-120b	4.181097	1352	61	null
Gemini-2.5-flash	4.171935	1408	51	null
Claude-haiku-4.5	4.170821	1402	55	null
Deepseek-v3.2-speciale	4.141433	1418	59	null
GLM-4.6	4.132794	1425	56	null
DeepSeek-R1-0528	4.118577	1395	52	null
Deepseek-v3.2	4.109586	1414	52	null
Kimi-k2-0905	4.107852	1416	50	null
Gpt-5-nano	4.060397	1339	51	null
Nova-2-lite-v1	4.059981	1334	47	null
Qwen3-next-80b-a3b-thinking	4.031744	1367	54	0.82
Nemotron-3-nano-30b-a3b	4.028261	null	52	null
Minimax-m2	3.990557	1345	61	null
Qwen3-235b-a22b-2507	3.98095	1374	45	0.83
Gemini-2.5-flash-lite	3.94904	1378	40	null
Mistral-large-2512	3.935105	1415	38	null
Grok-4.1-fast	3.877365	null	38	null
GLM-4.5-Air	3.864646	1370	49	null
Olmo-3.1-32b-think	3.85021	null	null	null
Mistral-medium-3.1	3.811798	1411	35	null
Llama-3.3-nemotron-super-49b-v1.5	3.783612	1340	45	null
Gpt-oss-20b	3.779105	1318	52	null
Ministral-8b-2512	3.570151	null	28	null
Nemotron-nano-9b-v2	3.500291	null	37	null
Nova-premier-v1	3.473742	null	32	null

Performance vs. Cost/Latency Trade-offs

Cost Breakdown per Domain ($ Cents/Response)

Cost Breakdown

Cost Breakdown

Llama-3.3-nemotron-super-49b-v1.5	106.988	14.105	12.831	10.773	11.872	13.533	150.087	125.455	57.316	15.803	11.393


Claude-haiku-4.5	4.797	3.239	3.108	2.638	3.073	2.659	6.255	5.932	2.863	3.348	3.787
Claude-opus-4.5	29.227	14.105	12.831	13.55	11.872	13.533	25.805	22.584	13.63	15.803	17.26
Claude-sonnet-4.5	14.974	7.718	10.377	9.328	7.578	8.188	19.307	17.325	8.588	10.121	11.393
DeepSeek-R1-0528	1.876	0.304	0.476	0.432	0.501	0.404	2.016	2.311	0.674	0.71	0.987
Deepseek-v3.2	0.109	0.052	0.082	0.07	0.075	0.076	0.153	0.117	0.079	0.072	0.089
Deepseek-v3.2-speciale	0.672	0.676	0.32	0.289	0.445	0.283	1.029	0.628	0.329	0.28	0.467
Gemini-2.5-flash	3.803	1.178	1.362	1.224	1.834	1.217	3.863	3.681	1.541	1.496	2.122
Gemini-2.5-flash-lite	0.277	0.084	0.132	0.11	0.168	0.119	0.468	0.45	0.172	0.134	0.214
Gemini-2.5-pro	8.474	3.17	4.509	4.218	4.99	4.269	13.245	11.265	5.586	4.711	6.479
Gemini-3-flash-preview	2.436	1.362	0.74	0.733	0.987	0.712	6.846	4.256	0.925	0.75	1.946
Gemini-3-pro-preview	9.642	4.151	4.069	4.042	5.292	3.837	16.871	12.024	5.093	4.228	6.85
GLM-4.5-Air	0.821	0.131	0.317	0.338	0.378	0.329	1.024	1.275	0.322	0.34	0.536
GLM-4.6	1.4	0.463	0.891	1.057	1.133	1.131	2.189	2.269	1.009	0.842	1.254
Gpt-5.1	13.882	7.678	9.722	10.773	9.562	9.245	16.249	11.152	8.273	12.519	10.8
Gpt-5.2	13.035	4.91	5.833	3.989	6.299	4.543	14.882	10.131	5.419	5.41	7.356
Gpt-5.2-pro	106.988	59.091	69.359	54.749	80.446	59.671	150.087	125.455	57.316	63.747	81.882
Gpt-5-mini	1.102	0.666	0.806	0.686	0.856	0.952	1.375	0.994	0.846	0.856	0.914
Gpt-5-nano	0.381	0.252	0.246	0.244	0.397	0.255	0.575	0.494	0.304	0.257	0.339
Gpt-oss-120b	0.117	0.051	0.282	0.084	0.077	0.094	0.147	0.104	0.089	0.105	0.115
Gpt-oss-20b	0.083	0.036	0.051	0.046	0.055	0.049	0.136	0.119	0.053	0.053	0.068
Grok-4	13.011	3.173	4.243	4.009	6.735	4.289	19.896	18.167	5.777	5.267	8.124
Grok-4.1-fast	0.114	0.044	0.09	0.081	0.082	0.08	0.099	0.076	0.082	0.09	0.084
Grok-4.1-fast-thinking	0.414	0.118	0.132	0.122	0.207	0.129	0.747	0.541	0.216	0.164	0.272
Kimi-k2-0905	0.4	0.201	0.369	0.327	0.22	0.368	0.514	0.312	0.287	0.342	0.334
Kimi-k2-thinking	2.401	1.381	1.217	1.033	1.515	1.052	3.772	3.795	1.289	1.325	1.856
Llama-3.3-nemotron-super-49b-v1.5	0.301	0.122	0.098	0.1	0.109	0.107	0.382	0.366	0.14	0.113	0.183
Minimax-m2	1.213	0.553	0.321	0.29	0.458	0.287	1.795	1.39	0.527	0.355	0.712
Ministral-8b-2512	0.046	0.026	0.054	0.05	0.042	0.049	0.08	0.051	0.042	0.049	0.049
Mistral-large-2512	0.472	0.196	0.595	0.555	0.493	0.553	0.543	0.508	0.54	0.568	0.512
Mistral-medium-3.1	0.336	0.168	0.473	0.417	0.33	0.419	0.487	0.327	0.349	0.414	0.375
Nemotron-3-nano-30b-a3b	0	0	0	0	0	0	0	0	0	0	0
Nemotron-nano-9b-v2	0.154	0.058	0.044	0.04	0.053	0.045	0.167	0.157	0.054	0.052	0.082
Nova-2-lite-v1	7.547	4.1	3.17	3.098	3.591	2.763	4.36	4.759	3.771	3.259	3.944
Nova-premier-v1	1.678	0.763	1.238	1.206	1.235	1.14	1.487	1.551	1.24	1.255	1.295
Olmo-3.1-32b-think	0	0	0	0	0	0	0	0	0	0	0
Qwen3-235b-a22b-2507	0.184	0.044	0.159	0.121	0.209	0.146	0.36	0.33	0.166	0.172	0.192
Qwen3-235B-A22B-Thinking-2507	0.55	0.113	0.202	0.202	0.223	0.211	0.761	0.691	0.217	0.207	0.317
Qwen3-next-80b-a3b-thinking	0.997	0.609	0.549	0.516	0.769	0.489	1.233	1.06	0.707	0.593	0.749

Average Latency Breakdown per Domain (Seconds)

Average Latency Breakdown

Average Latency Breakdown

Llama-3.3-nemotron-super-49b-v1.5	108.2169	373.5313	139.8157	119.3168	103.0795	118.5421	146.1804	110.9826	119.3727	128.0674	110.9457146


Claude-haiku-4.5	108.2169	77.9828	139.8157	119.3168	96.8819	118.5421	146.1804	93.1935	85.5203	128.0674	110.9457146
Claude-opus-4.5	205.1172	91.8738	160.2862	154.8248	103.0795	161.9298	168.8554	110.9826	119.3727	160.8702	144.0064821
Claude-sonnet-4.5	194.2746	97.991	221.7836	197.1376	123.5208	172.1705	224.1097	153.1433	126.5125	185.5291	169.7268622
DeepSeek-R1-0528	301.0484	49.4935	117.8214	101.4108	88.6982	100.734	306.8704	351.139	122.3086	142.4731	171.4999365
Deepseek-v3.2	170.3223	50.1187	130.1619	103.7197	86.2436	130.6725	219.3999	157.6304	86.1883	108.4466	124.5749929
Deepseek-v3.2-speciale	454.4097	373.5313	230.8446	213.1212	265.5541	201.2352	651.1625	388.4446	216.2238	251.4362	310.3903417
Gemini-2.5-flash	122.2311	28.9762	68.5994	58.9803	57.4608	53.4325	77.5255	70.8783	47.9233	65.6251	65.61706446
Gemini-2.5-flash-lite	27.6891	7.4738	22.1568	13.5681	17.9023	19.6096	31.563	27.4738	17.0569	16.9643	20.41722765
Gemini-2.5-pro	109.9939	37.5364	99.5693	87.053	65.1453	74.7684	125.0731	98.0517	75.3436	88.0257	86.79510542
Gemini-3-flash-preview	53.7734	32.3758	26.7758	22.85	25.7441	21.9813	151.4909	80.1863	24.0318	25.9412	45.56031499
Gemini-3-pro-preview	111.2193	41.256	68.7167	67.4612	63.8528	59.1828	138.2422	94.61	57.7557	63.3303	76.10772546
GLM-4.5-Air	250.8921	33.182	130.3234	118.605	102.7298	125.2574	275.6033	337.8757	103.0877	122.1762	163.1509648
GLM-4.6	187.9063	59.6837	192.016	172.9087	143.9757	161.7514	273.6536	325.9543	153.79	160.8635	187.4263836
Gpt-5.1	311.144	145.0151	239.826	201.6391	180.4055	222.0834	400.0297	232.0924	177.3329	190.6357	227.425965
Gpt-5.2	215.3879	72.5414	154.2086	108.3396	115.2593	120.9876	211.1343	111.2079	91.3683	111.1999	130.0950453
Gpt-5.2-pro	391.1151	208.1175	226.3653	193.8801	223.697	204.9777	576.6401	317.3142	156.8561	194.8504	261.3839264
Gpt-5-mini	130.2666	49.2126	109.1312	89.203	88.5769	85.6279	103.3383	82.3936	86.7269	102.7151	93.48742858
Gpt-5-nano	111.2797	56.9614	97.445	93.6769	104.4473	90.7587	149.9171	109.4526	89.6858	94.049	99.62428955
Gpt-oss-120b	85.5721	30.2059	107.3813	77.7369	54.7208	80.8943	119.5328	48.964	66.8172	85.8429	75.47582399
Gpt-oss-20b	48.2448	14.5999	37.9083	32.8224	38.5637	29.7853	64.6509	47.4871	31.6468	40.1316	38.76845801
Grok-4	298.5129	65.5425	144.6898	119.8769	154.5694	121.9038	391.5985	277.8614	135.1043	151.079	180.1110452
Grok-4.1-fast	22.5872	10.5228	30.5169	29.9234	22.2147	27.6289	22.9019	14.2277	23.0097	30.2953	23.59507062
Grok-4.1-fast-thinking	99.8118	27.5942	50.9788	55.0636	58.7765	53.4979	147.2731	93.0965	58.3833	58.0907	69.23819567
Kimi-k2-0905	96.0417	49.851	87.0009	91.0837	57.9927	103.5506	129.4537	56.8817	67.2098	94.3569	82.79755057
Kimi-k2-thinking	357.8956	211.4411	187.6676	174.5778	156.1618	192.8184	406.495	400.0537	189.5434	217.8827	247.9683578
Llama-3.3-nemotron-super-49b-v1.5	115.7753	45.765	54.3973	51.8132	44.5401	58.9116	150.4951	135.1039	57.9696	50.2643	76.47567281
Minimax-m2	258.3904	80.0205	108.2872	89.6492	87.1081	80.6304	210.7194	236.3157	116.3629	82.6208	136.9629944
Ministral-8b-2512	28.8863	22.0955	41.2578	28.7331	21.6026	29.6067	54.5365	24.612	32.0725	33.7791	31.40083599
Mistral-large-2512	79.2901	28.5084	126.2813	111.757	80.9586	115.3766	70.4993	54.8645	93.7809	116.9721	89.96343603
Mistral-medium-3.1	37.2227	15.6724	81.6154	70.4164	46.7696	67.9571	43.1978	23.2783	55.3481	69.7366	52.24553215
Nemotron-3-nano-30b-a3b	32.6547	32.7213	24.9018	19.692	22.4484	21.2114	61.7155	40.3934	22.7944	26.9905	30.08154062
Nemotron-nano-9b-v2	124.962	41.4807	47.0715	36.6855	47.2949	41.3541	127.955	107.3714	40.8651	55.1499	66.77738031
Nova-2-lite-v1	86.8579	56.826	67.0827	61.455	57.903	54.8436	51.9657	53.2645	60.4673	59.8649	61.45748847
Nova-premier-v1	59.2369	19.7099	61.1146	61.1911	38.9898	55.9562	49.0279	50.0873	49.4701	62.2549	51.84074232
Olmo-3.1-32b-think	177.482	130.5138	74.441	78.2192	108.5078	71.7146	215.9375	202.7508	97.1318	80.6868	122.4217364
Qwen3-235b-a22b-2507	133.3347	35.1341	78.9824	71.2207	115.1465	81.8462	163.3372	184.0751	89.4845	78.7041	104.7811018
Qwen3-235B-A22B-Thinking-2507	548.9492	102.3888	246.6013	235.7818	240.6059	257.2741	631.6219	589.9518	236.4708	208.5225	316.8201599
Qwen3-next-80b-a3b-thinking	105.9201	48.549	77.5874	69.3386	85.093	65.7956	98.0509	85.3588	71.0135	68.6137	77.75939135

P99 Latency Breakdown per Domain (Seconds)

P99 Latency Breakdown

P99 Latency Breakdown

Llama-3.3-nemotron-super-49b-v1.5	1311.9582	172.9266	283.9706	555.5108	254.7644	325.9682	1755.1338	1043.4488	281.6953	289.4724	316.9223


Claude-haiku-4.5	303.6944	172.9266	283.9706	514.833	246.103	325.9682	418.1896	361.1314	252.934	289.4724	316.9223
Claude-opus-4.5	834.1169	210.0999	329.939	377.647	254.7644	382.0941	493.5189	264.7656	281.6953	304.4306	373.3072
Claude-sonnet-4.5	847.6184	216.3561	502.248	555.5108	406.2059	564.1001	526.6079	428.4763	340.1804	380.1887	476.7493
DeepSeek-R1-0528	844.7599	99.4614	344.6565	310.4039	362.7391	220.5598	737.7579	1043.4488	548.9713	253.7458	476.6504
Deepseek-v3.2	631.2052	193.2873	356.0249	313.1029	393.9501	395.7449	541.8973	691.5566	278.3706	309.4961	410.4636
Deepseek-v3.2-speciale	1311.9582	650.2337	528.4946	503.1869	607.8602	403.1232	1755.1338	1282.7077	701.6598	582.3526	832.6711
Gemini-2.5-flash	475.328	58.3303	184.8024	139.9909	178.2687	125.5033	154.3759	178.7707	112.877	131.1387	173.9386
Gemini-2.5-flash-lite	78.696	11.6849	128.7149	38.8007	44.9652	110.2367	86.6394	74.8385	84.655	31.6472	69.0879
Gemini-2.5-pro	306.2334	63.3816	314.2017	209.8994	160.4949	139.1723	328.0977	268.0161	228.2436	207.0017	222.4742
Gemini-3-flash-preview	151.2741	87.0896	51.96	38.3654	70.0249	37.0887	549.6887	257.3478	84.5966	38.2785	136.5714
Gemini-3-pro-preview	381.7002	98.8641	168.2995	140.8396	157.0393	98.373	342.9425	247.1862	129.185	97.1085	186.1538
GLM-4.5-Air	945.3131	88.5651	379.6257	236.088	270.8741	253.7883	632.8253	900.4997	274.4039	270.8565	425.284
GLM-4.6	726.6297	152.2205	537.891	591.0155	502.2694	618.9771	898.3246	1317.976	546.1143	413.4574	630.4876
Gpt-5.1	991.0433	369.4258	416.6634	430.177	560.7524	468.2914	1079.6459	1077.5916	512.7422	366.9878	627.3321
Gpt-5.2	983.2315	173.1477	458.684	401.1408	278.51	244.5978	743.5915	574.3997	283.5269	200.6274	434.1457
Gpt-5.2-pro	1424.653	435.1419	477.5699	586.2364	582.9081	431.5706	1706.3481	1289.0466	425.5168	479.1999	783.8191
Gpt-5-mini	560.1197	166.9813	239.9502	185.6802	267.7198	200.7012	246.3061	214.9289	269.2209	226.6608	257.8269
Gpt-5-nano	282.1782	98.8659	238.9069	231.8139	302.9173	185.8154	490.2644	440.4728	259.7014	157.3791	268.8315
Gpt-oss-120b	309.681	98.5909	308.8411	258.6252	313.478	194.0417	684.8875	180.4313	380.1071	189.748	291.8432
Gpt-oss-20b	279.1046	37.3872	221.6344	94.4435	235.0365	117.9081	253.5315	225.409	147.6625	218.1006	183.0218
Grok-4	1010.6335	281.4145	304.5292	211.6889	448.4332	189.3991	1420.3039	1052.7989	377.3943	327.9724	562.4568
Grok-4.1-fast	49.7465	20.1795	54.4293	78.2517	46.273	49.4893	77.6144	71.8666	48.1681	78.3591	57.4378
Grok-4.1-fast-thinking	285.175	73.4358	100.4626	138.1754	177.5368	99.1367	516.521	266.009	246.663	167.1038	207.0219
Kimi-k2-0905	537.4	227.5026	434.3624	300.1772	262.8326	233.5223	537.1856	355.6894	188.4261	212.9576	329.0056
Kimi-k2-thinking	1063.4055	599.8131	401.0736	574.3742	506.6306	437.6233	1223.912	1507.2595	520.558	456.5078	729.1158
Llama-3.3-nemotron-super-49b-v1.5	408.0621	271.4401	160.5405	133.0068	155.2934	142.2843	299.6682	423.4154	257.5173	151.9354	240.3163
Minimax-m2	991.8198	589.4947	238.879	288.4827	400.5846	201.5793	562.5588	746.3536	374.692	333.6929	472.8137
Ministral-8b-2512	108.0489	233.4817	256.4196	82.3999	54.9958	104.629	423.6525	68.7241	92.101	116.9296	154.1382
Mistral-large-2512	318.5243	67.4724	277.4699	230.3884	182.5314	212.874	132.5931	141.9202	204.2604	213.2644	198.1299
Mistral-medium-3.1	101.3717	25.4349	207.9478	212.7246	182.3555	175.8245	108.1197	63.577	166.3996	225.5168	146.9272
Nemotron-3-nano-30b-a3b	71.8239	159.5903	48.939	33.3004	107.5626	42.9702	177.9534	172.5468	110.8246	53.1962	97.8707
Nemotron-nano-9b-v2	338.6557	195.0248	111.1827	64.8506	178.3646	92.1459	473.7478	272.3648	169.0054	221.995	211.7337
Nova-2-lite-v1	173.4527	103.964	157.1997	159.4808	133.8659	108.7504	116.0531	119.6456	135.6125	107.9884	131.6013
Nova-premier-v1	177.9887	35.0321	225.562	188.5122	93.0407	103.8183	83.1841	94.8349	145.6836	197.7514	134.5408
Olmo-3.1-32b-think	345.2511	273.287	135.2763	136.9126	222.3814	107.5568	467.3806	507.0392	302.2275	207.0715	270.4384
Qwen3-235b-a22b-2507	388.6708	256.5265	274.0394	224.9808	386.1726	290.7797	524.4854	507.0452	284.5746	234.8702	337.2145
Qwen3-235B-A22B-Thinking-2507	1366.0363	248.1267	506.3298	540.1404	515.3773	580.0747	1677.411	1660.089	582.4473	432.8225	810.8855
Qwen3-next-80b-a3b-thinking	497.6184	107.4429	187.0656	154.3664	247.5986	120.2682	189.55	343.7824	279.0175	140.0428	226.6753

Performance Across Different Domains

Model ranks within specific knowledge or task areas. Higher is better.

Domain Performance

Domain Performance

Llama-3.3-nemotron-super-49b-v1.5	4.3253	4.1798	4.4195	4.5162	4.3002	4.4807	3.7367	3.5258	4.1594	4.5444	4.170821


Claude-haiku-4.5	3.896	4.1798	4.4195	4.5162	4.3002	4.4807	3.7367	3.5258	4.1594	4.5444	4.170821
Claude-opus-4.5	4.3253	4.3694	4.1679	4.6497	4.4774	4.4952	4.1393	4.214	4.5653	4.5178	4.39496
Claude-sonnet-4.5	4.1231	4.5017	4.4807	4.5748	4.4423	4.4153	3.5709	3.8669	4.4844	4.5519	4.30218
DeepSeek-R1-0528	3.7053	4.0733	4.3708	4.4378	4.1089	4.4467	3.5519	3.6497	4.3068	4.4322	4.118577
Deepseek-v3.2	3.741	4.0014	4.5128	4.4394	4.1404	4.3174	3.4795	3.7235	4.201	4.3906	4.109586
Deepseek-v3.2-speciale	4.0964	3.7169	4.1727	4.2891	4.2365	4.3793	3.558	3.9619	4.3353	4.3067	4.141433
Gemini-2.5-flash	4.01	4.046	4.3665	4.215	4.3058	4.2863	3.8277	4.024	4.1208	4.4398	4.171935
Gemini-2.5-flash-lite	3.9154	4.0888	4.2033	4.15	3.7796	4.3708	3.2679	3.3294	4.0502	4.2879	3.94904
Gemini-2.5-pro	4.0197	4.1507	4.3678	4.4328	4.3717	4.4027	4.0175	4.2377	4.4451	4.3803	4.294065
Gemini-3-flash-preview	3.9148	4.2529	4.4172	4.5766	4.3414	4.5388	4.0164	4.0574	4.4399	4.4483	4.303068
Gemini-3-pro-preview	4.2254	4.352	4.5077	4.6974	4.2012	4.5579	4.123	4.2957	4.5101	4.4766	4.405224
GLM-4.5-Air	3.4011	3.6283	4.0442	4.4026	3.6135	4.1639	3.6893	3.466	3.8959	4.2471	3.864646
GLM-4.6	3.9502	4.154	4.1144	4.4669	4.1101	4.2879	3.4484	4.0146	4.2906	4.3261	4.132794
Gpt-5.1	4.1923	4.3363	4.4623	4.6766	4.5045	4.4358	4.0829	4.1333	4.4884	4.4885	4.38364
Gpt-5.2	4.2989	4.4394	4.5439	4.3957	4.3679	4.5977	4.1779	4.2614	4.5517	4.5929	4.430061
Gpt-5.2-pro	4.3745	4.4942	4.5626	4.6854	4.5249	4.5944	4.3153	4.2901	4.3928	4.5531	4.476206
Gpt-5-mini	4.18	4.3856	4.4526	4.4846	4.2918	4.2895	3.8962	4.0472	4.3236	4.5178	4.287269
Gpt-5-nano	3.9829	3.847	4.2798	4.2864	3.9098	4.1801	3.6059	3.873	4.2301	4.2294	4.060397
Gpt-oss-120b	4.2552	4.1312	4.0549	4.2073	4.0356	4.3422	3.8138	3.8927	4.4529	4.508	4.181097
Gpt-oss-20b	3.9278	3.8177	3.4139	3.777	3.5187	3.3964	3.8702	3.8189	4.1819	4.0654	3.779105
Grok-4	4.1127	4.2019	4.2275	4.258	4.2027	4.3413	4.1751	3.8096	4.2996	4.3325	4.197064
Grok-4.1-fast	3.2714	4.1437	4.1517	4.4933	4.0447	4.3487	3.1697	2.8397	4.162	4.22	3.877365
Grok-4.1-fast-thinking	3.8834	4.2013	4.3318	4.4323	4.178	4.3843	3.7149	3.8893	4.4811	4.3875	4.206201
Kimi-k2-0905	3.6093	4.2553	4.4514	4.6494	4.0671	4.5102	3.4043	3.4026	4.2795	4.4418	4.107852
Kimi-k2-thinking	4.0332	4.3655	4.4615	4.6033	4.2796	4.5443	3.8119	4.129	4.2333	4.5553	4.315342
Llama-3.3-nemotron-super-49b-v1.5	3.2916	3.8451	4.129	4.3489	3.6246	4.2599	3.1084	3.1014	3.8956	4.1353	3.783612
Minimax-m2	3.4588	3.6467	4.3288	4.2996	4.0075	4.1983	3.3369	3.9375	4.1415	4.2852	3.990557
Ministral-8b-2512	2.9477	3.5676	4.0966	4.1553	3.2352	3.9678	2.9408	2.7773	3.7756	4.1015	3.570151
Mistral-large-2512	3.5392	3.8875	4.135	4.4486	4.1053	4.3948	3.4126	3.1081	4.0114	4.2951	3.935105
Mistral-medium-3.1	3.025	3.9361	4.2094	4.0717	4.0296	4.4697	3.2035	2.8908	3.9827	4.2469	3.811798
Nemotron-3-nano-30b-a3b	3.8756	3.6321	4.1365	4.2908	3.8024	4.1842	3.6777	3.8097	4.1924	4.4894	4.028261
Nemotron-nano-9b-v2	3.0723	3.0018	3.9185	4.0366	2.9714	3.9167	2.8622	3.0697	3.7831	4.0139	3.500291
Nova-2-lite-v1	3.7409	3.8832	4.2516	4.2551	4.0813	4.3824	3.0429	3.641	4.2797	4.3358	4.059981
Nova-premier-v1	2.8353	3.5547	3.7279	3.9576	3.3375	3.9765	2.5511	2.8474	3.8154	3.9793	3.473742
Olmo-3.1-32b-think	3.3149	3.9322	4.1472	4.2858	3.5331	4.1608	3.2607	3.4826	4.0137	4.2631	3.85021
Qwen3-235b-a22b-2507	3.6516	3.842	4.0547	4.3213	3.9642	4.2631	3.2727	3.7025	4.289	4.2662	3.98095
Qwen3-235B-A22B-Thinking-2507	3.7017	4.2155	4.1925	4.5712	4.2408	4.4447	3.8836	3.7789	4.2573	4.4556	4.196769
Qwen3-next-80b-a3b-thinking	3.7023	3.7423	4.0543	4.4018	4.1316	4.2153	3.7258	3.7883	4.24	4.2008	4.031744

AutoBench LLM Leaderboard

Overall Model Performance

Benchmark Comparison

Performance Visualizations

Rank vs. Average Cost

Rank vs. Average Latency

Rank vs. P99 Latency

Performance vs. Cost/Latency Trade-offs

Cost Breakdown per Domain ($ Cents/Response)

Average Latency Breakdown per Domain (Seconds)

P99 Latency Breakdown per Domain (Seconds)

Performance Across Different Domains

About AutoBench

Methodology

Metrics

Data

Links