Meituan's LongCat-Video Sets New Standard for Open AI Models

Chinese delivery giant Meituan has released LongCat-Video, a 13.6B-parameter video generation model that handles text-to-video, image-to-video, and video continuation in one system. Published on Hugging Face under an MIT license, it represents one of China's most significant open-source AI releases.

● In a recent post, Rohan Paul shared news that Meituan—China's answer to DoorDash—has dropped LongCat-Video, a powerful video generation model packing 13.6 billion parameters. Released under the MIT open-source license on Hugging Face, this model is turning heads in China's AI scene. It's a unified framework that handles Text-to-Video, Image-to-Video, and Video-Continuation all in one go, using a diffusion-based approach.

● LongCat-Video runs on a Diffusion Transformer (DiT) architecture with 48 layers and 4096 hidden width. The technical backbone includes 3D attention and cross-attention mechanisms, 3D rotary embeddings, and RMSNorm. For compression, it uses a WAN 2.1-style variational autoencoder (VAE) and brings in umT5 for text encoding. The whole setup is designed to generate minutes-long, high-quality 720p videos at 30fps while keeping things temporally consistent—something that's notoriously tricky in diffusion-based video work.

● On the performance side, Meituan's engineers cooked up a 3D Block Sparse Attention mechanism that cuts computational load to less than 10% of what you'd normally see with standard diffusion operations. This means they're getting 10× faster 720p inference on just a single NVIDIA H800 GPU. The model works in stages: it first renders a 480p 15fps sequence, then upscales and refines it to 720p 30fps using a LoRA-based refinement expert. It's a smart trade-off between speed and visual quality, helping avoid the flickering and motion drift that can plague long video sequences.

● The training approach is equally interesting. The team used multi-reward GRPO reinforcement learning, which combines a frame quality scorer, motion coherence evaluator, and text-video match judge. This helps boost realism and narrative flow without falling into the "reward hacking" trap. On VBench 2.0, LongCat-Video hit 62.11% overall and 70.94% on commonsense accuracy—putting it roughly on par with leading closed-source models.

● Commenting on the launch, Vaibhav Srivastav pulled a quote from Meituan's paper:

We introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across Text-to-Video, Image-to-Video, and Video-Continuation tasks.

● Meanwhile, Teortaxes▶️ called the release "Meituan's revenge on Alibaba," applauding the company for doing "serious research beyond me-too training" and noting how quickly they've ramped up their AI R&D game.

● With LongCat-Video, Meituan is staking its claim as a serious AI research player, going toe-to-toe with global leaders by releasing transparent, high-performance open-source models that balance efficiency, quality, and scale. It's a move that's reshaping how we think about China's role in the generative AI race.

#AI News #@rohanpaul_ai #@teortaxesTex #LongCat-Video #@reach_vb

Peter Smith E-mail

Peter Smith - web3.0 projects expert and writer exploring the intersection of blockchain, AI, and online entertainment.