⬤ Ant Group just dropped LLaDA2.0, a new framework designed to push diffusion-based language models into frontier territory. The paper "LLaDA2.0: Scaling Up Diffusion Language Models to 100B" lays out a three-phase training scheme that takes existing auto-regressive models and converts them into discrete diffusion models. Tthe approach keeps all the knowledge already baked into auto-regressive models while opening up new efficiencies in both training and inference.
⬤ What makes LLaDA2.0 interesting is how it sidesteps one of the biggest headaches in diffusion language modeling—training massive models from scratch. Instead of starting over, the method grabs pre-trained auto-regressive models and gradually shifts them into diffusion-based architectures through staged training. This lets models hang onto their original capabilities while unlocking parallel decoding, which can seriously cut down inference time compared to the old sequential token-by-token generation.
⬤ Ant Group rolled out two models using this approach: LLaDA2.0-mini with 16 billion parameters and LLaDA2.0-flash scaling up to 100 billion parameters. Both models reportedly beat earlier diffusion-based versions in performance and efficiency at similar scales. Getting diffusion modeling to work at the 100B parameter level is a big deal, since diffusion approaches have traditionally been tougher to scale efficiently compared to auto-regressive architectures.
⬤ This matters because scaling efficiency has become the name of the game in large language model development. Methods that reuse existing models, reduce inference latency, and improve computational efficiency directly impact how these massive systems get deployed and maintained in the real world. By showing a clear path to frontier-scale diffusion language models, LLaDA2.0 adds fuel to the ongoing experimentation with alternative architectures beyond traditional auto-regressive designs.
Artem Voloskovets
Artem Voloskovets