OpenAI Rolls Out GPT-5.4 Thinking With 92.8% on GPQA Diamond

OpenAI has launched GPT-5.4 Thinking and GPT-5.4 Pro across ChatGPT, the API, and Codex, posting standout scores on reasoning, coding, and computer-use benchmarks - and taking direct aim at rivals like Claude Opus 4.6 and Gemini 3.1 Pro.

Contents

75.0% on OSWorld, 83.0% on GDPval - the Numbers Behind the Hype
Pro Version Leads on Agentic Tasks as Competition With Claude and Gemini Heats Up

OpenAI has begun rolling out GPT-5.4 Thinking and GPT-5.4 Pro across its full ecosystem - ChatGPT, the OpenAI API, and Codex. The new models combine advances in reasoning, coding performance, and agent-style workflows into a single frontier system, marking another step in the company's push toward AI capable of handling complex, multi-step tasks autonomously. The release arrives as the broader AI industry accelerates, with major labs competing not just on chat quality but on real-world task execution - the kind that involves clicking through websites, writing production code, and navigating operating systems.

75.0% on OSWorld, 83.0% on GDPval - the Numbers Behind the Hype

Benchmark results released alongside the announcement paint a detailed picture of where GPT-5.4 stands relative to the field. On OSWorld-Verified, a computer-use benchmark that tests how well a model interacts with real software environments, GPT-5.4 Thinking scored 75.0%. On WebArena-Verified, which evaluates web browsing task completion, it reached 67.3%. These benchmarks matter because they measure something beyond language fluency - they assess whether an AI can actually do things in the same interfaces humans use every day.

GPT-5.4 integrates advances in reasoning, coding, and agentic workflows within a single frontier model.

Knowledge work and scientific reasoning scores were equally strong. GPT-5.4 Thinking posted 83.0% on GDPval and 57.7% on SWE-Bench Pro, a software engineering benchmark that requires solving real GitHub issues. Most notably, it reached 92.8% on GPQA Diamond, an expert-level reasoning benchmark designed to challenge even domain specialists.

Pro Version Leads on Agentic Tasks as Competition With Claude and Gemini Heats Up

The Pro version pushed those numbers further in several categories - 94.4% on GPQA Diamond and 89.3% on BrowseComp, an agentic browsing benchmark. These figures place GPT-5.4 directly alongside Anthropic's Claude Opus 4.6 and Google's Gemini 3.1 Pro in the comparison tables that now shape how developers and enterprises choose AI platforms.

The rollout reflects a broader shift in how AI models are evaluated and deployed. Raw chat quality is no longer the primary differentiator - what matters increasingly is how well a model performs in coding pipelines, research workflows, and enterprise automation. GPT-5.4's expanded API access signals that OpenAI is positioning the model not just as a consumer product but as infrastructure for software teams and AI-native businesses building on top of its platform.

News Source

#AI News #GPT-5.4

Eseandre Mordi E-mail

Eseandre Mordi - writer covering crypto, blockchain, and AI with a global perspective and a strong voice for women in tech.