Link: https://github.com/acatovic/ova
Models used:
ASR: NVIDIA parakeet-tdt-0.6b-v3 600M LLM: Mistral ministral-3 3b 4-bit quantized TTS (Simple): Hexgrad Kokoro 82M TTS (With Voice Cloning): Qwen3-TTS
It implements a classic ASR -> LLM -> TTS architecture:
1. Frontend captures user's audio and sends a blob of bytes to the backend /chat endpoint
2. Backend parses the bytes, extracts sample rate (SR) and channels, then:
2.1. Transcribes the audio to text using an automatic speech recognition (ASR) model
2.2. Sends the transcribed text to the LLM, i.e. "the brain"
2.3. Sends the LLM response to a text-to-speech (TTS) model
2.4. Performs normalization of TTS output, converts it to bytes, and sends the bytes back to frontend
3. The frontend plays the response audio back to the user
I've had a number of people try it out with great success and you can potentially take it any direction, e.g. give it more capabilities so it can offload "hard" tasks to larger models or agents, enable voice streaming, give it skills or knowledge, etc.
Enjoy!
0 comments