Fifty-six researchers from 32 universities across the US, China, and the UK have built something the AI field hasn't seen before: a video reasoning benchmark with roughly 2.015 million samples. They called it the Very Big Video Reasoning Suite - VBVR for short - and the results it's producing are hard to ignore.
The dataset covers five core cognitive areas: spatiality, transformation, knowledge, abstraction, and perception. At around 1,000 times the size of existing video reasoning collections, it's less a test and more a stress test - one designed to find out whether today's AI actually understands how physical objects behave, or whether it's just very good at faking it. Broader shifts in how AI processes and reasons about the world are part of an ongoing conversation, as explored in Agent World Model Points to Next AI Architecture Shift.
AI Scores 54% Where Humans Score 97% - a Gap No One Can Ignore
When top commercial AI systems sat down with VBVR, they averaged about 54% accuracy. Human participants scored above 97%. That's not a small gap - it's a fundamental one.
Many video generation systems focus primarily on visual realism while lacking robust understanding of spatial rules and causality.
Even training an open model directly on the VBVR dataset pushed scores up, but not nearly enough to close the distance. The benchmark evaluates models across multiple cognitive dimensions and tests how performance shifts between in-domain and out-of-domain conditions - exactly the kind of real-world variation that trips up systems optimized for controlled settings.
What VBVR Means for the Future of Video AI Development
The challenge connects to deeper questions about AI memory and efficiency discussed in AI Memory Evolution: 10x Efficiency Gains as RAG Systems Become Obsolete. What VBVR makes clear is that scaling model size or improving visual fidelity won't fix causal reasoning on its own.
As video models continue competing on realism and performance - as seen with Lightricks LTX2 Tops Open Video Model Rankings With 1198 ELO Score - benchmarks like VBVR may increasingly define what "capable" actually means. The next generation of AI won't just need to look smart. It'll need to reason like it.
Usman Salis
Usman Salis