Microsoft Open-Sources VibeVoice: 60-Minute Audio Processing and 300ms Voice Output

Microsoft has open-sourced VibeVoice, a voice AI system capable of processing long-form audio and real-time speech tasks. The model supports structured transcription, multilingual capabilities, and low-latency voice generation.

Contents

VibeVoice ASR: 60-Minute Audio in a Single Pass, 50+ Languages
Real-Time TTS with 300ms Latency and 90-Minute Multi-Speaker Output
What VibeVoice Supports: Key Capabilities at a Glance

Microsoft has introduced VibeVoice, an open-source voice AI framework that combines automatic speech recognition (ASR) and text-to-speech (TTS) into a single pipeline. As reported by Hasan Toor, the system is built to handle long-form audio inputs and return structured outputs - including speaker identification, timestamps, and full transcripts - without requiring users to stitch together multiple tools.

VibeVoice ASR: 60-Minute Audio in a Single Pass, 50+ Languages

The ASR component of VibeVoice can process up to 60 minutes of audio in one pass, producing structured outputs that capture who spoke, when, and exactly what was said. It supports more than 50 languages and eliminates the need for audio chunking - a common workaround that tends to introduce inconsistencies across longer recordings.

Developers looking for comparable benchmarks can also review Alibaba Tongyi Lab's Qwen3-TTS with 3-second voice cloning, which represents a parallel push in speech AI.

Real-Time TTS with 300ms Latency and 90-Minute Multi-Speaker Output

Beyond transcription, VibeVoice ships with real-time text-to-speech capabilities that reach approximately 300 milliseconds to first audio output - fast enough for near-instant voice generation in live applications.

The broader framework supports long-form multi-speaker speech synthesis, with models capable of generating up to 90 minutes of conversational audio in a single pass.

The open-source release details are covered in full in Microsoft's VibeVoice open-source voice AI announcement.

What VibeVoice Supports: Key Capabilities at a Glance

ASR with up to 60 minutes of audio per pass
Speaker diarization and timestamping in a unified pipeline
Support for 50+ languages without chunking
Real-time TTS with approximately 300ms to first audio
Multi-speaker synthesis for up to 90 minutes of continuous audio
Open-source release for developer integration

The launch reflects a broader industry shift toward scalable, long-context AI systems as demand for voice interfaces grows across enterprise and consumer products.

Advances in processing duration, latency, and multilingual support are becoming central to next-generation AI infrastructure - and open-source releases like this one accelerate adoption across the industry.

That trend runs alongside major infrastructure investments like OpenAI's $500B Stargate project, underscoring how efficient, large-scale AI deployment has moved to the center of the competitive landscape.

News Source

#AI #AI News #Microsoft #VibeVoice

Saad Ullah E-mail Twitter Facebook

Saad Ullah - engineer and writer passionate about AI, blockchain, and the disruptive technologies driving fintech innovation.