A fresh wave of benchmark results is reshaping the AI landscape. Alibaba's Qwen3 family of large language models has claimed top spots across five leading performance tests, beating both Anthropic's Claude Opus 4 and DeepSeek-V3.1 in key areas. The results show Qwen3-Max and Qwen3-235B-A22B pushing Alibaba into the global top tier, delivering one of the strongest performances ever seen from a non-US model family.
Benchmark Results: Qwen3-Max Leads Across the Board
Recent data shared by DeepLearning.AI shows how Qwen3-Max-Instruct-Preview consistently outperforms its competitors across five major benchmark suites.
The comparison includes leading models from Anthropic, DeepSeek, and other top developers, measuring reasoning, coding skill, and overall task performance:
- SuperGPQA: Qwen3-Max scores 64.6, ahead of Qwen3-235B's 62.6 and DeepSeek-V3.1's 59.8
- AIME25: Qwen3-Max dominates with 80.6, far surpassing both Claude Opus 4 and DeepSeek at 49.8
- LiveCodeBench v6: Posts 57.5, the strongest coding performance in the group
- Arena-Hard v2: Achieves 86.1, well above DeepSeek's 61.5 and Kimi K2's 66.1
- LiveBench (2024): Maintains 79.3, ahead of Claude at 74.6 and DeepSeek at 71.3
The Qwen3-235B-A22B model follows closely behind, proving that Alibaba's open-weight architecture can compete directly with closed systems from companies like Anthropic and OpenAI.
What Makes Qwen3 Different
The Qwen3 lineup represents one of the most complete AI ecosystems outside the United States. Qwen3-Max is a trillion-parameter Mixture-of-Experts model built for cost-efficient inference with a 262k-token context window.
Qwen3-VL-235B-A22B handles text, images, and video with a context length reaching up to a million tokens. Meanwhile, Qwen3-Omni-30B-A3B leads 22 of 36 audio and audiovisual benchmarks globally. Together, these models combine massive computational efficiency with cutting-edge multimodal capabilities, offering researchers and developers a powerful alternative to Western systems.
A Shift in Global AI Power
For years, companies like OpenAI and Anthropic have dominated performance charts. Qwen3's latest scores suggest that China's AI research is now matching—and in several areas surpassing—its Western counterparts. The most striking result comes from AIME25, a challenging mathematical reasoning benchmark previously led by models like GPT-4 Turbo.
Qwen3-Max's score of 80.6 sets a new standard for efficiency-focused training. As multimodal and long-context capabilities converge, performance per compute dollar is becoming the new measure of success, and Alibaba appears to be optimizing precisely for that metric. These results hint at a broader trend where the global AI race is shifting from raw scale to strategic intelligence.
Alex Dudov
Alex Dudov