New CAPO Method Shows 40% Improvement in AI Reasoning Stability

Researchers from Xiaomi besides Tsinghua University built CAPO, a reinforcement learning method that sharpens AI reasoning. The system trains in two steps - it first learns from correct examples then slowly adds wrong ones.

⬤ Researchers from Xiaomi besides Tsinghua University have introduced a new way to train AI systems that helps reasoning models become more reliable and adaptable. The method called Curriculum Advantage Policy Optimization (CAPO), uses reinforcement learning in a two step process - it first shows models what correct responses look like then gradually introduces examples of what to avoid. This approach establishes a stable foundation before the system encounters outputs of varying quality.

⬤ CAPO stands out because it works with existing reinforcement learning frameworks like PPO, GRPO or RLOO. Teams can integrate it into their current setups without rebuilding anything. The method generates outputs from the model, evaluates their quality and separates them into positive plus negative groups. In the initial training phase, CAPO uses only positive examples. This allows the model to learn preferred behavior without the confusion caused - conflicting signals. After the baseline is secure, negative examples are introduced to refine the model's ability to identify strong responses versus weak ones.

⬤ A frequent issue in AI training is that early negative feedback can disrupt learning and lead to unpredictable behavior. CAPO's step-by-step approach lowers this risk while keeping progress steady throughout training. The technique performs especially well in mathematical reasoning but also complex interface tasks, where consistency is as important as accuracy.

AI Research Highlights Shared Subspaces Across 500+ Models Despite Criticism

New analysis suggests that neural networks share low dimensional weight subspaces, but critics argue that a sample of only 500 models is too small to support broad conclusions.

⬤ As AI systems handle more complex real world tasks training methods that balance stability with performance become crucial. CAPO's plug-and-play design can speed up both research and deployment - giving teams a simple way to improve model reliability without replacing their entire training setup.

News Source

#AI News #CAPO

Peter Smith E-mail

Peter Smith - web3.0 projects expert and writer exploring the intersection of blockchain, AI, and online entertainment.