⬤ Microsoft just made VibeVoice publicly available — a real-time text-to-speech system that starts producing audio in roughly 300 milliseconds. The company dropped the code under an MIT license and threw in a WebSocket demo showing how the system handles live audio generation. What makes this interesting is that it's built specifically for streaming conversations, not just reading pre-written text out loud.
⬤ The model can keep going for up to 90 minutes straight without falling apart, and it handles as many as four different speakers in one session. That's a pretty big deal for anyone building voice assistants or conversational AI, since most systems start getting wonky when you push them into longer, multi-speaker territory. VibeVoice manages to keep turn-taking natural throughout extended back-and-forth exchanges.
⬤ Under the hood, the system speeds things up by compressing audio into semantic and acoustic tokens running at about 7.5 Hz instead of processing every single audio frame. A language model figures out the conversation flow while a diffusion component fills in the acoustic details. When text comes in as a stream, the system can start speaking within that 300-millisecond window — fast enough to feel genuinely responsive.
⬤ Microsoft's decision to open-source VibeVoice shows they're serious about pushing real-time AI audio forward. By releasing it for research, they're essentially handing developers and researchers a production-grade tool to experiment with multi-speaker voice tech, which is exactly what the AI communication space needs right now.
Saad Ullah
Saad Ullah