AI model comparison — quality, price and open source

The leading AI models from the US, Europe and China, compared by quality (market benchmarks), cost in USD per million tokens and open-source status.

Data as of 2026-07-01 · automated research (Artificial Analysis, LMArena, official pricing) — verify before deciding.

📊 How quality is measured — three indices

We show quality three complementary ways. Here is how each index is built before you read the charts:

① Zendoric Quality (0-100) = equal-thirds average of SWE-bench-Pro (33% · real software development, checked against the maker) + LMArena (33% · human preference, normalised Elo) + Terminal-Bench (33% · agentic terminal capability). If a model lacks one of the three, its weight is shared among those present (at least two required).

② AA Index (Artificial Analysis Intelligence Index, 0-100) = a broader composite index (reasoning, science, code, maths). It gives a second reading: depending on how you measure, the maker ranking changes.

③ Cybersecurity (0-100) = capability on expert cyber tasks (hard «unguided pass@1» protocol: vuln-research and realistic exploitation). We use a non-saturated metric (the top is around 71, not 100, leaving headroom), not the Cybench «pass@k» the frontier already saturates. Sources: UK AISI, NIST-CAISI, CVE-Bench. We frame it as capability and risk, not an offensive ranking; where there is no direct eval it is estimated «est.».

📈 Zendoric Quality over time (frontier makers)

Quality index (0-100) of the top makers (their best model), last 24 months. Dashed line = quality estimated from the AA Index (labs without SWE-bench-Pro). Updated daily.

AnthropicOpenAIxAIZhipu AIAlibabaMoonshot AIDeepSeekGoogleMicrosoftMistral AIMeta

📈 AA Index over time (frontier makers)

AA Index (Artificial Analysis Intelligence Index, 0-100) of the top makers (their best model), last 24 months. It is a broader composite index (reasoning, science, code, maths) than ours. The historical series is reconstructed by anchoring each maker's trajectory to its current AA. Updated daily.

AnthropicOpenAIxAIZhipu AIAlibabaDeepSeekMoonshot AIGoogleMicrosoftMistral AIMeta

🛡️ Cybersecurity over time (frontier makers)

Cybersecurity index (0-100) of each maker's best model, last 24 months. Metric: EXPERT cyber tasks under a hard «unguided pass@1» protocol (no hints, one attempt; vuln-research and realistic exploitation). We pick it because it is NOT saturated — the top is around 71, not 100, so it discriminates and shows headroom (we drop Cybench «pass@k», where the frontier already scores ~100%). Sources: UK AISI (GPT-5.5 71.4% vs Anthropic preview 68.6%), NIST-CAISI, CVE-Bench. High confidence only for OpenAI/Anthropic (measured by AISI); the rest imputed by proximity → the whole series is marked «est.». We frame it as capability and RISK to govern, not an offensive ranking. Updated daily.

OpenAIAnthropicZhipu AIGoogleMoonshot AIMicrosoftAlibabaxAIDeepSeekMistral AIMeta

💰 Zendoric Quality vs cost

Flagship models of the top makers by quality (a maker may have several, e.g. Anthropic: Opus 4.8 and Fable 5). HIGHER = more quality; LEFT = cheaper (log axis). Hollow dot = quality estimated (AA Index). Colour by maker.

OpenAIAnthropicGoogleDeepSeekAlibabaMoonshot AIZhipu AIxAIMistral AIMeta

💰 AA Index vs cost

Same format as the quality/cost chart, but the vertical axis is the AA Index. HIGHER = more capability; LEFT = cheaper (log axis). Hollow dot = estimated AA (Terminal-Bench/SWE-Pro). Colour by maker.

OpenAIAnthropicGoogleMetaxAIMistral AIDeepSeekAlibabaMoonshot AIZhipu AI

🏆 Zendoric Quality (SW dev + arena + agentic)

Model	Zendoric Quality	SWE-bench-Pro	LMArena	Terminal-Bench	LiveCodeBench	GPQA	ARC-AGI-2
🇺🇸 Claude Fable 5Anthropic · USA	90.1	80.3	1515	—	89.78	92.6	—
🇺🇸 GPT-5.6 Sol (preview)OpenAI · USA	78.9	63.0	1470	88.8	—	87	—
🇨🇳 GLM-5.2Zhipu AI · China	76.9	62.1	1475	81.0	82.8	78	7
🇺🇸 Claude Opus 4.8Anthropic · USA	76.5	69.2	1455	82.7	88.8	84	14
🇺🇸 GPT-5.5OpenAI · USA	76.3	58.6	1475	82.7	—	85	16
🇨🇳 Qwen3.7-MaxAlibaba · China	72.6	60.6	1475	69.7	91.6	81	7
🇺🇸 Claude Sonnet 5Anthropic · USA	71.8	63.2	—	80.4	—	83	12
🇨🇳 Kimi K2.6Moonshot AI · China	68.4	58.6	1460	66.7	89.6	78	9
🇨🇳 DeepSeek V4-ProDeepSeek · China	66.1	55.4	1450	67.9	93.5	82	9
🇺🇸 Gemini 3 ProGoogle · USA	65.8	43.3	1501	54.2	—	84	15
🇺🇸 Claude Sonnet 4.6Anthropic · USA	62.0	—	1430	59.1	—	80	9
🇺🇸 MAI-1-previewMicrosoft · USA	49.4	52.8	—	46.0	87.7	84.2	—
🇺🇸 Claude Mythos 5Anthropic · USA	—	80.0	—	—	—	—	—
🇺🇸 Llama 4 Maverick (llama-4-maverick-17b-128e-instruct)Meta · USA	—	—	1370	—	43.4	70	5
🇺🇸 Grok 4.3xAI · USA	—	—	1496	—	79.4	84	16
🇪🇺 Mistral Large 3Mistral AI · Europa	—	—	1418	—	34.4	72	6
🇪🇺 Magistral Medium 1.2Mistral AI · Europa	—	—	—	—	75.0	76.26	4

Quality = equal-thirds average of SWE-bench-Pro (SW development) + LMArena (human preference) + Terminal-Bench (agentic capability), the three with reliable sources (Zendoric Quality); if one is missing its weight is shared among those present (at least two; otherwise «—»). LiveCodeBench and GPQA are shown for reference (indicative, may be incomplete) but are NOT in the index; ARC-AGI-2 (arcprize.org) tracks AGI progress: models score VERY low → still far from AGI. %, except LMArena (Elo).

💵 Economics (USD / 1M tokens)

Model	Input	Cache	Output
🇺🇸 Claude Fable 5Anthropic · USA	$10.0	$1.0	$50.0
🇺🇸 GPT-5.6 Sol (preview)OpenAI · USA	$5.0	$0.5	$30.0
🇨🇳 GLM-5.2Zhipu AI · China	$0.6	$0.26	$2.2
🇺🇸 Claude Opus 4.8Anthropic · USA	$5.0	$0.5	$25.0
🇺🇸 GPT-5.5OpenAI · USA	$5.0	$0.5	$30.0
🇨🇳 Qwen3.7-MaxAlibaba · China	$1.2	$0.25	$6.0
🇺🇸 Claude Sonnet 5Anthropic · USA	until Aug 31, 2026 $2.0 from Sep 1, 2026 $3.0	until Aug 31, 2026 $0.2 from Sep 1, 2026 $0.3	until Aug 31, 2026 $10.0 from Sep 1, 2026 $15.0
🇨🇳 Kimi K2.6Moonshot AI · China	$0.6	$0.16	$2.5
🇨🇳 DeepSeek V4-ProDeepSeek · China	$0.28	$0.03	$0.87
🇺🇸 Gemini 3 ProGoogle · USA	$1.25	$0.31	$10.0
🇺🇸 Claude Sonnet 4.6Anthropic · USA	$3.0	$0.3	$15.0
🇺🇸 MAI-1-previewMicrosoft · USA	—	—	—
🇺🇸 Claude Mythos 5Anthropic · USA	$10.0	$1.0	$50.0
🇺🇸 Llama 4 Maverick (llama-4-maverick-17b-128e-instruct)Meta · USA	$0.2	—	$0.6
🇺🇸 Grok 4.3xAI · USA	$3.0	$0.75	$15.0
🇪🇺 Mistral Large 3Mistral AI · Europa	$2.0	—	$6.0
🇪🇺 Magistral Medium 1.2Mistral AI · Europa	$0.5	—	$1.5

Claude Sonnet 5: scheduled price increase (same model) — reduced pricing until Aug 31, 2026 and standard pricing from Sep 1, 2026.

🔓 Open source & type

Model	Open source	License	Type
🇺🇸 Claude Fable 5Anthropic · USA	No	Proprietary	Proprietary (API only)
🇺🇸 GPT-5.6 Sol (preview)OpenAI · USA	No	Proprietary	Proprietary (API only)
🇨🇳 GLM-5.2Zhipu AI · China	Yes	MIT	Open-weight
🇺🇸 Claude Opus 4.8Anthropic · USA	No	Proprietary	Proprietary (API only)
🇺🇸 GPT-5.5OpenAI · USA	No	Proprietary	Proprietary (API only)
🇨🇳 Qwen3.7-MaxAlibaba · China	No	Proprietary	Proprietary (API only)
🇺🇸 Claude Sonnet 5Anthropic · USA	No	Proprietary	Proprietary (API only)
🇨🇳 Kimi K2.6Moonshot AI · China	Yes	Modified MIT	Open-weight
🇨🇳 DeepSeek V4-ProDeepSeek · China	Yes	MIT	Open-weight
🇺🇸 Gemini 3 ProGoogle · USA	No	Proprietary	Proprietary (API only)
🇺🇸 Claude Sonnet 4.6Anthropic · USA	No	Proprietary	Proprietary (API only)
🇺🇸 MAI-1-previewMicrosoft · USA	No	Proprietary	Proprietary (API only)
🇺🇸 Claude Mythos 5Anthropic · USA	No	Proprietary	Proprietary (API only)
🇺🇸 Llama 4 Maverick (llama-4-maverick-17b-128e-instruct)Meta · USA	Yes	Llama 4 Community	Open-weight
🇺🇸 Grok 4.3xAI · USA	No	Proprietary	Proprietary (API only)
🇪🇺 Mistral Large 3Mistral AI · Europa	Yes	Apache-2.0	Open-weight
🇪🇺 Magistral Medium 1.2Mistral AI · Europa	Yes	Apache-2.0	Open-weight

🖥️ Open source you can self-host

Small/medium models you can run on your own machine (laptop/PC/Mac). Quality = Artificial Analysis Intelligence Index (0-100; output quality), the measure with best coverage of small open models (LMArena does not list sub-32B). Memory estimated at 4-bit (Q4) and 8-bit (Q8) quantization; on Apple Silicon it is UNIFIED memory (RAM=VRAM).

Model	Quality (AA Index)	GPQA	Params	RAM Q4	RAM Q8	GPU	CPU / Mac	License
Qwen3.5-27BAlibaba	42	85.5	27B	17 GB	32 GB	≥24 GB	Limitado (mejor GPU/Mac ≥32 GB)	Apache-2.0
Gemma 4 31BGoogle	39	84.3	31B	18 GB	35 GB	≥24 GB	Limitado (mejor GPU/Mac ≥32 GB)	Gemma
Qwen3.5-35B-A3BAlibaba	37	84.2	35B	21 GB	40 GB	≥24 GB	Limitado (mejor GPU/Mac ≥32 GB)	Apache-2.0
Gemma 4 26B A4BGoogle	31	82.3	26B	15 GB	29 GB	≥16 GB	Limitado (mejor GPU/Mac ≥32 GB)	Gemma
NVIDIA Nemotron-Cascade-2-30B-A3BNVIDIA	28	76.1	30B	18 GB	34 GB	≥24 GB	Limitado (mejor GPU/Mac ≥32 GB)	NVIDIA Open Model
gpt-oss-20bOpenAI	24	71.5	20B	13 GB	25 GB	≥16 GB	Limitado (mejor GPU/Mac ≥32 GB)	Apache-2.0
Gemma 4 12BGoogle	22	78.8	12B	8 GB	15 GB	≥8 GB	Sí (CPU lento · Mac 16 GB)	Gemma
Gemma 4 E4BGoogle	19	58.6	4B	6 GB	10 GB	≥8 GB	Sí (CPU/Mac, fluido)	Gemma
Gemma 4 E2BGoogle	15	43.4	2B	4 GB	7 GB	≥8 GB	Sí (CPU/Mac, fluido)	Gemma

🗄️ Large open source (server / multi-GPU)

Powerful open models that need a server or multiple GPUs. Quality = LMArena Elo (human preference over output, source lmarena.ai), which does cover large models. For MoE, memory counts total parameters (all experts are loaded). Memory estimated at 4-bit (Q4) and 8-bit (Q8) quantization; on Apple Silicon it is UNIFIED memory (RAM=VRAM).

Model	Quality (LMArena)	GPQA	Params	RAM Q4	RAM Q8	GPU	CPU / Mac	License
DeepSeek-V4-ProDeepSeek	1465	90.1	1600B	882 GB	1762 GB	12× 80 GB (servidor)	No (servidor GPU)	MIT
Kimi K2.6Moonshot AI	1460	90.5	1100B	552 GB	1102 GB	7× 80 GB (servidor)	No (servidor GPU)	Modified MIT
Qwen3.5-397B-A17BAlibaba	1450	88.4	397B	220 GB	438 GB	3× 80 GB (servidor)	No (servidor GPU)	Apache-2.0
Llama 4 Maverick (llama-4-maverick-17b-128e-instruct)Meta	1420	69.8	400B	222 GB	442 GB	3× 80 GB (servidor)	No (servidor GPU)	Llama 4 Community
Mistral Large 3Mistral AI	1416	43.9	675B	373 GB	744 GB	5× 80 GB (servidor)	No (servidor GPU)	Apache-2.0
GLM-5.2Zhipu AI	1360	91.2	744B	411 GB	820 GB	6× 80 GB (servidor)	No (servidor GPU)	MIT
gpt-oss-120bOpenAI	1353	80.1	117B	66 GB	130 GB	≥80 GB	No (servidor GPU)	Apache-2.0