Show HN: Local Voice Assistant

Several weeks ago I built a fully-local voice assistant demo with a FastAPI backend and a simple HTML front-end. All the models (ASR / LLM / TTS) are open weight and running locally, i.e. no data is being sent to the Internet nor any API. It's intended to demonstrate how easy it is to run a fully-local AI setup on affordable commodity hardware, while also demonstrating the uncanny valley and teasing out the ethical considerations of such a setup - it allows you to perform voice cloning.

Link: https://github.com/acatovic/ova

Models used:

ASR: NVIDIA parakeet-tdt-0.6b-v3 600M LLM: Mistral ministral-3 3b 4-bit quantized TTS (Simple): Hexgrad Kokoro 82M TTS (With Voice Cloning): Qwen3-TTS

It implements a classic ASR -> LLM -> TTS architecture:

1. Frontend captures user's audio and sends a blob of bytes to the backend /chat endpoint

2. Backend parses the bytes, extracts sample rate (SR) and channels, then:

2.1. Transcribes the audio to text using an automatic speech recognition (ASR) model

2.2. Sends the transcribed text to the LLM, i.e. "the brain"

2.3. Sends the LLM response to a text-to-speech (TTS) model

2.4. Performs normalization of TTS output, converts it to bytes, and sends the bytes back to frontend

3. The frontend plays the response audio back to the user

I've had a number of people try it out with great success and you can potentially take it any direction, e.g. give it more capabilities so it can offload "hard" tasks to larger models or agents, enable voice streaming, give it skills or knowledge, etc.

Enjoy!

2 points | by armcat 2 hours ago

0 comments