
Microsoft’s VibeVoice Breaks Ground with Long-Form Multi-Speaker Speech Synthesis
Microsoft’s VibeVoice is an open-source text-to-speech model using continuous speech tokenization and next-token diffusion to compress audio, generate high-quality long-form speech up to 90 minutes with four speakers, and empower developers and creators to produce expressive audio content.