🎙️ Uzbek Voice AI

Real-time Uzbek-language voice assistant for farmers — TTS research, model fine-tuning & a streaming voice agent.

Text-to-Speech VITS / MMS Speech Recognition LLM dialog FastAPI · WebSocket PyTorch Kaggle GPU Low-resource NLP

I built a live, streaming voice assistant that lets Uzbek-speaking farmers ask agriculture questions and get spoken answers — then ran a full research track to find (or build) the best Uzbek text-to-speech voice, since Uzbek is a low-resource language with very few options.

TTS models benchmarked

60 h

studio corpus mined

7,080

fine-tune steps trained

<2 s

target voice latency

🔊 Listen — Uzbek TTS demos

Same sentences synthesized by the best off-the-shelf model vs. my fine-tuned voice. All audio is generated from text — no human recording.

Agronomist greeting

«Ассалому алайкум, соғ-саломатмисиз? Мен агрономман, қандайсиз?»

Best quality · MMS

My fine-tune · FeruzaSpeech

Wheat-rust disease

«Буғдой майдонида қўнғир занг касаллиги кўринди.»

MMS

Fine-tune

Dosage with numbers

«Фунгицидни йигирма беш фоиз концентрацияда, бир гектарга икки литр солинг.»

MMS

Fine-tune

🧭 The approach

Built the voice agent first — a streaming pipeline (speech-in → recognition → LLM → speech-out) with voice-activity detection, instant filler words, and barge-in (the user can interrupt), targeting under ~2 s to first audio.
Benchmarked Uzbek TTS — Uzbek has almost no good open voices, so I tested 5 options and A/B-compared the audio.
Built a 60-hour training corpus — cloned and cleaned the FeruzaSpeech studio dataset, automated end-to-end training on Kaggle GPUs via the API.
Fine-tuned a VITS/MMS model on 5.2 h of clean single-speaker Uzbek (7,080 steps, ~4 h on a Tesla P100), debugging a deep stack of environment issues (CUDA/GPU-arch mismatches, library version conflicts, GAN training quirks).
Measured honestly — the strong base model actually beat my fine-tune on limited data, a realistic and useful negative result.

📊 Models evaluated

Model	Approach	Result
Meta MMS (Uzbek, Cyrillic)	Off-the-shelf VITS	✅ Best quality
MMS + FeruzaSpeech	My VITS fine-tune	◑ Adapted voice, slightly rougher
Chatterbox v3	LoRA fine-tune	✗ Poor (no Uzbek base)
Community Latin VITS	Off-the-shelf	✗ Garbled
Yandex SpeechKit «nigora»	Commercial API	✅ Production choice

🛠️ Tech stack

Python · PyTorch · HuggingFace Transformers · VITS / Meta MMS · Chatterbox · FastAPI + WebSockets · gRPC (Yandex SpeechKit STT/TTS v3) · Silero/energy VAD · Kaggle GPU automation (Kaggle API) · soundfile / datasets.

💡 Key takeaways

Right tool > biggest tool. A small purpose-built VITS beat a 0.5 B voice-cloning model for a single-speaker language task.
Start from a model that already knows the language. Teaching Uzbek from zero failed; adapting an Uzbek-capable base was the workable path.
Licensing is a first-class constraint. The best open weights are non-commercial (CC-BY-NC) — production needs a commercial voice or owned data.

⚠️ Research/educational project. Open Uzbek datasets & models used here are non-commercial (CC-BY-NC / academic) licensed; the audio demos are for evaluation only.