

Picture by Creator | Canva
# Introduction
Open-source AI is experiencing a big second. With developments in giant language fashions, common machine studying, and now speech applied sciences, open-source fashions are quickly narrowing the hole with proprietary programs. Some of the thrilling entrants on this house is Microsoft’s open-source voice stack, VibeVoice. This mannequin household is designed for pure, expressive, and interactive dialog, rivaling the standard of top-tier industrial choices.
On this article, we’ll discover VibeVoice, obtain the mannequin, and run inference on Google Colab utilizing the GPU runtime. Moreover, we’ll handle troubleshooting frequent points that will come up whereas operating mannequin inference.
# Introduction to VibeVoice
VibeVoice is a next-generation Textual content-to-Speech (TTS) framework for creating expressive, long-form, multi-speaker audio corresponding to podcasts and dialogues. In contrast to conventional TTS, it excels in scalability, speaker consistency, and pure turn-taking.
Its core innovation lies in steady acoustic and semantic tokenizers working at 7.5 Hz, paired with a Giant Language Mannequin (Qwen2.5-1.5B) and a diffusion head for producing high-fidelity audio. This design permits as much as 90 minutes of speech with 4 distinct audio system, surpassing prior programs.
VibeVoice is accessible as an open-source mannequin on Hugging Face, with community-maintained code for simple experimentation and use.


# Getting Began with VibeVoice-1.5B
On this information, we’ll learn to clone the VibeVoice repository and run the demo by offering it with a textual content file to generate multi-speaker pure speech. It solely takes round 5 minutes from setup to producing the audio.
// 1. Clone the neighborhood repository & set up
First, clone the neighborhood model of the VibeVoice repository (vibevoice-community/VibeVoice), set up the required Python packages, and likewise set up the Hugging Face Hub library to obtain the mannequin utilizing the Python API.
Observe: Earlier than beginning the Colab session, guarantee your runtime sort is ready to T4 GPU.
!git clone -q --depth 1 https://github.com/vibevoice-community/VibeVoice.git /content material/VibeVoice
%pip set up -q -e /content material/VibeVoice
%pip set up -q -U huggingface_hub
// 2. Obtain the mannequin snapshot from Hugging Face
Obtain the mannequin repository utilizing the Hugging Face snapshot API. It will obtain all of the information from the microsoft/VibeVoice-1.5B
repository.
from huggingface_hub import snapshot_download
snapshot_download(
"microsoft/VibeVoice-1.5B",
local_dir="/content material/fashions/VibeVoice-1.5B",
local_dir_use_symlinks=False
)
// 3. Create a transcript with audio system
We’ll create a textual content file inside Google Colab. For that, we’ll use the magic operate %%writefile
to supply the content material. Beneath is a pattern dialog between two audio system about KDnuggets.
%%writefile /content material/my_transcript.txt
Speaker 1: Have you ever learn the newest article on KDnuggets?
Speaker 2: Sure, it is the most effective sources for information science and AI.
Speaker 1: I like how KDnuggets all the time retains up with the newest developments.
Speaker 2: Completely, it is a go-to platform for anybody within the AI neighborhood.
// 4. Run inference (multi-speaker)
Now, we’ll run the demo Python script throughout the VibeVoice repository. The script requires the mannequin path, textual content file path, and speaker names.
Run #1: Map Speaker 1 → Alice, Speaker 2 → Frank
!python /content material/VibeVoice/demo/inference_from_file.py
--model_path /content material/fashions/VibeVoice-1.5B
--txt_path /content material/my_transcript.txt
--speaker_names Alice Frank
Because of this, you will note the next output. The mannequin will use CUDA to generate the audio, with Frank and Alice as the 2 audio system. It can additionally present a abstract that you need to use for evaluation.
Utilizing system: cuda
Discovered 9 voice information in /content material/VibeVoice/demo/voices
Accessible voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Studying script from: /content material/my_transcript.txt
Discovered 4 speaker segments:
1. Speaker 1
Textual content preview: Speaker 1: Have you ever learn the newest article on KDnuggets?...
2. Speaker 2
Textual content preview: Speaker 2: Sure, it is the most effective sources for information science and AI....
3. Speaker 1
Textual content preview: Speaker 1: I like how KDnuggets all the time retains up with the newest developments....
4. Speaker 2
Textual content preview: Speaker 2: Completely, it is a go-to platform for anybody within the AI neighborhood....
Speaker mapping:
Speaker 2 -> Frank
Speaker 1 -> Alice
Speaker 1 ('Alice') -> Voice: en-Alice_woman.wav
Speaker 2 ('Frank') -> Voice: en-Frank_man.wav
Loading processor & mannequin from /content material/fashions/VibeVoice-1.5B
==================================================
GENERATION SUMMARY
==================================================
Enter file: /content material/my_transcript.txt
Output file: ./outputs/my_transcript_generated.wav
Speaker names: ['Alice', 'Frank']
Variety of distinctive audio system: 2
Variety of segments: 4
Prefilling tokens: 368
Generated tokens: 118
Whole tokens: 486
Era time: 28.27 seconds
Audio length: 15.47 seconds
RTF (Actual Time Issue): 1.83x
==================================================
Play the audio in pocket book:
We’ll now use the IPython operate to take heed to the generated audio inside Colab.
from IPython.show import Audio, show
out_path = "/content material/outputs/my_transcript_generated.wav"
show(Audio(out_path))


It took 28 seconds to generate the audio, and it sounds clear, pure, and easy. I find it irresistible!
Attempt once more with completely different voice actors.
Run #2: Attempt completely different voices (Mary for Speaker 1, Carter for Speaker 2)
!python /content material/VibeVoice/demo/inference_from_file.py
--model_path /content material/fashions/VibeVoice-1.5B
--txt_path /content material/my_transcript.txt
--speaker_names Mary Carter
The audio generated was even higher, with background music at first and a easy transition between audio system.
Discovered 9 voice information in /content material/VibeVoice/demo/voices
Accessible voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Studying script from: /content material/my_transcript.txt
Discovered 4 speaker segments:
1. Speaker 1
Textual content preview: Speaker 1: Have you ever learn the newest article on KDnuggets?...
2. Speaker 2
Textual content preview: Speaker 2: Sure, it is the most effective sources for information science and AI....
3. Speaker 1
Textual content preview: Speaker 1: I like how KDnuggets all the time retains up with the newest developments....
4. Speaker 2
Textual content preview: Speaker 2: Completely, it is a go-to platform for anybody within the AI neighborhood....
Speaker mapping:
Speaker 2 -> Carter
Speaker 1 -> Mary
Speaker 1 ('Mary') -> Voice: en-Mary_woman_bgm.wav
Speaker 2 ('Carter') -> Voice: en-Carter_man.wav
Loading processor & mannequin from /content material/fashions/VibeVoice-1.5B
Tip: In case you are uncertain which names can be found, the script prints “Accessible voices:” on startup.
Widespread ones embody:
en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
# Troubleshooting
// 1. Repo Doesn’t Have Demo Scripts?
The official Microsoft VibeVoice repository has been pulled and reset. Group stories point out that some code and demos have been eliminated or are not accessible within the authentic location. For those who discover that the official repository is lacking inference examples, please verify a neighborhood mirror or archive that has preserved the unique demos and directions: https://github.com/vibevoice-community/VibeVoice
// 2. Sluggish Era or CUDA Errors in Colab
Confirm you might be on a GPU runtime: Runtime → Change runtime sort → {Hardware} accelerator: GPU (T4 or any obtainable GPU).
// 3. CUDA OOM (Out of Reminiscence)
To attenuate the load, you possibly can take a number of steps. Start by shortening the enter textual content and lowering the era size. Contemplate reducing the audio pattern fee and/or adjusting inside chunk sizes if the script permits it. Set the batch measurement to 1 and go for a smaller mannequin variant.
// 4. No Audio or Lacking Outputs Folder
The script sometimes prints the ultimate output path within the console; scroll as much as discover the precise location
discover /content material -name "*generated.wav"
// 5. Voice Names Not Discovered?
Copy the precise names listed below Accessible voices. Use the alias names (Alice, Frank, Mary, Carter) proven within the demo. They correspond to the .wav
property.
# Remaining Ideas
For a lot of initiatives, I might select an open-source stack like VibeVoice over paid APIs on account of a number of compelling causes. At first, it’s straightforward to combine and presents flexibility for personalization, making it appropriate for a variety of purposes. Moreover, it’s surprisingly mild on GPU necessities, which is usually a important benefit in resource-constrained environments.
VibeVoice is open supply, which means that sooner or later, you possibly can count on higher frameworks that allow quicker era even on CPUs.
Abid Ali Awan (@1abidaliawan) is a licensed information scientist skilled who loves constructing machine studying fashions. At present, he’s specializing in content material creation and writing technical blogs on machine studying and information science applied sciences. Abid holds a Grasp’s diploma in expertise administration and a bachelor’s diploma in telecommunication engineering. His imaginative and prescient is to construct an AI product utilizing a graph neural community for college kids scuffling with psychological sickness.