Claude Opus 4.5 Outperforms GPT-5.1 on Tasks Lasting 8+ Hours

Fresh benchmark data reveals Claude Opus 4.5 holds up better than GPT-5.1-Codex-Max when tasks stretch into multi-hour territory, with the performance gap widening as duration increases.

⬤ Recent benchmark tests put two major language models head-to-head on long-running tasks, and the results show some clear differences in staying power. The comparison between Claude Opus 4.5 and GPT-5.1-Codex-Max tracked success rates across task lengths ranging from seconds to several days. Claude Opus 4.5 keeps its performance more consistent as tasks drag on, while GPT-5.1-Codex-Max starts losing steam faster once you get past the four-hour mark.

⬤ Looking at the data, Claude Opus 4.5 manages to hover around 40 percent success probability even when tasks run between eight and sixteen hours. Sure, performance drops as things get longer—that's expected—but the decline is more gradual. GPT-5.1-Codex-Max tells a different story. Once tasks push beyond one to four hours, success rates take a sharper nosedive, falling off more quickly than its competitor.

⬤ When tasks are short or medium-length, both models actually perform pretty similarly near the top of the scale. The real separation happens during those marathon sessions. The trend lines make it obvious—Opus 4.5 hangs in there longer while GPT-5.1-Codex-Max shows earlier signs of fatigue, suggesting it has a harder time maintaining consistency when workloads stretch out.

Claude Opus 4.5 Dominates with 4+ Hour Task Performance on METR Benchmark

A METR benchmark chart reveals Claude Opus 4.5 achieving over 4-hour task completion times, significantly outperforming OpenAI's latest models on complex software engineering challenges.

⬤ Because AI is increasingly being used for things that take hours—complex coding projects, deep research dives, extended reasoning chains, and automation workflows that don't wrap up in minutes. Models that keep higher success rates over long stretches are naturally better fits for work that needs continuous context and reliability. As benchmarks start measuring endurance alongside raw capability, data like this could shape how people choose models for agent-style tasks and long-form applications.

News Source

#AI News #GPT-5.1 #Claude Opus 4.5

Eseandre Mordi E-mail

Eseandre Mordi - writer covering crypto, blockchain, and AI with a global perspective and a strong voice for women in tech.