I built a live, streaming voice assistant that lets Uzbek-speaking farmers ask
agriculture questions and get spoken answers — then ran a full research track to find (or build)
the best Uzbek text-to-speech voice, since Uzbek is a low-resource language with very few options.
5
TTS models benchmarked
60 h
studio corpus mined
7,080
fine-tune steps trained
<2 s
target voice latency
🔊 Listen — Uzbek TTS demos
Same sentences synthesized by the best off-the-shelf model vs. my fine-tuned voice.
All audio is generated from text — no human recording.
Agronomist greeting
«Ассалому алайкум, соғ-саломатмисиз? Мен агрономман, қандайсиз?»
Best quality · MMS
My fine-tune · FeruzaSpeech
Wheat-rust disease
«Буғдой майдонида қўнғир занг касаллиги кўринди.»
MMSFine-tune
Dosage with numbers
«Фунгицидни йигирма беш фоиз концентрацияда, бир гектарга икки литр солинг.»
MMSFine-tune
🧭 The approach
Built the voice agent first — a streaming pipeline (speech-in → recognition → LLM →
speech-out) with voice-activity detection, instant filler words, and barge-in (the user can
interrupt), targeting under ~2 s to first audio.
Benchmarked Uzbek TTS — Uzbek has almost no good open voices, so I tested 5 options
and A/B-compared the audio.
Built a 60-hour training corpus — cloned and cleaned the FeruzaSpeech studio dataset,
automated end-to-end training on Kaggle GPUs via the API.
Fine-tuned a VITS/MMS model on 5.2 h of clean single-speaker Uzbek (7,080 steps,
~4 h on a Tesla P100), debugging a deep stack of environment issues (CUDA/GPU-arch
mismatches, library version conflicts, GAN training quirks).
Measured honestly — the strong base model actually beat my fine-tune on limited data,
a realistic and useful negative result.
Right tool > biggest tool. A small purpose-built VITS beat a 0.5 B voice-cloning
model for a single-speaker language task.
Start from a model that already knows the language. Teaching Uzbek from zero failed;
adapting an Uzbek-capable base was the workable path.
Licensing is a first-class constraint. The best open weights are non-commercial
(CC-BY-NC) — production needs a commercial voice or owned data.
⚠️ Research/educational project. Open Uzbek datasets & models used here are
non-commercial (CC-BY-NC / academic) licensed; the audio demos are for evaluation only.