How to Set Up Voice Chat with OpenClaw
Voice Changes Everything Text based AI assistants are powerful, but they have a friction problem. You have to open a chat window, type your question, wait for a response, then read it. That is fine for complex tasks, bu…

Voice Changes Everything
Text-based AI assistants are powerful, but they have a friction problem. You have to open a chat window, type your question, wait for a response, then read it. That is fine for complex tasks, but for quick questions, reminders, or when your hands are busy, it sucks.
Voice chat fixes that. When you can just speak to your AI assistant and hear the response, everything changes. You can use it while cooking, driving, or when your hands are covered in dough. It also makes OpenClaw more accessible — voice is inherently more inclusive than typing.
This guide walks you through setting up voice chat with OpenClaw. We cover the full pipeline — speech-to-text (STT), text-to-speech (TTS), and how to wire them together. By the end, you will have a working voice loop where you speak, OpenClaw listens, thinks, and responds out loud.
Key Takeaways
- Voice chat adds a hands-free interaction mode to OpenClaw
- The pipeline is: microphone → STT → LLM → TTS → speaker
- Deepgram Nova-3 + ElevenLabs gives the best balance of speed and quality
- Local options (Whisper + Piper) are free but require more setup
- The Voice Companion skill handles all the wiring for you
How Voice Chat Works
The voice pipeline has four stages:
- You speak into your microphone
- STT (speech-to-text) converts your audio to text
- OpenClaw processes the text and generates a response
- TTS (text-to-speech) converts the response back to audio
Each stage adds latency. The goal is keeping total end-to-end time under one second. Above that, conversations feel sluggish. Below that, they feel natural.
The three big decisions are:
- Which STT provider to use
- Which TTS provider to use
- How to connect them to OpenClaw
STT: Converting Speech to Text
Your first choice is how to turn your voice into text. You have four main options:
Deepgram Nova-3 (Recommended)
Deepgram Nova-3 is the fastest cloud STT option. Its streaming support means it starts transcribing while you are still talking — critical for keeping conversations feeling responsive.
Cost: ~$4.30 per 1,000 minutes. Free tier available.
Setup:
pip install deepgram-sdk
export DEEPGRAM_API_KEY="your-key-here"
Why it wins: Speed. The streaming capability alone saves 300-500ms compared to sending whole audio chunks.
Local Whisper (Free, Private)
If you do not want to pay for cloud STT or want maximum privacy, run Whisper locally. The faster-whisper implementation makes it viable on GPUs.
Cost: $0. Requires a GPU for real-time speed.
Setup:
pip install faster-whisper
Model sizes:
- tiny: ~1GB VRAM, ~50ms latency
- base: ~1GB VRAM, ~100ms latency
- small: ~2GB VRAM, ~200ms latency
- medium: ~5GB VRAM, ~400ms latency
For conversation, base or small hits the sweet spot between speed and accuracy.
Groq Whisper
Groq runs Whisper on their LPU hardware. It is fast and cheap — about $0.10 per 1,000 minutes.
Downside: No streaming support. You send the full audio chunk and wait.
OpenAI Whisper
You might already have an OpenAI key. It works fine but costs more ($6/1000 minutes) and has no streaming.
Which STT Should You Pick?
| Priority | Pick |
|---|---|
| Lowest latency | Deepgram Nova-3 |
| Lowest cost | Local Whisper or Groq |
| Maximum privacy | Local Whisper |
| Simplest setup | OpenAI Whisper |
TTS: Giving OpenClaw a Voice
Now the fun part — giving OpenClaw a voice. You have several options:
ElevenLabs (Best Quality)
ElevenLabs produces the most natural-sounding TTS available. They have dozens of premade voices, or you can clone your own.
Cost: ~$0.18 per 1,000 characters. Free tier: 10,000 characters/month.
Setup:
pip install elevenlabs
export ELEVENLABS_API_KEY="your-key-here"
Example:
from elevenlabs import ElevenLabs, play
client = ElevenLabs()
audio = client.text_to_speech.convert(
text="Hello, I am your OpenClaw assistant.",
voice_id="JBFqnCBsd6RMkjVDRZzb", # George
model_id="eleven_turbo_v2_5",
)
play(audio)
Pro tip: Use eleven_turbo_v2_5 for voice chat. It is optimized for low latency. The multilingual models sound better but add ~200ms.
Cartesia Sonic (Lowest Latency)
Cartesia Sonic is designed specifically for real-time voice applications. It has the lowest latency of any cloud TTS option.
Cost: ~$0.10 per 1,000 characters. Free tier available.
Cool feature: Streaming output means you hear the first words before the full response is generated.
Piper (Local, Free)
Piper runs entirely on your CPU. No API key, no cost. Quality is noticeably below ElevenLabs, but it is instant and free.
Setup:
pip install piper-tts
You also need to download a voice model (~60MB).
OpenAI TTS
Simple if you already have an OpenAI key. Six built-in voices. Costs $15 per million characters.
Which TTS Should You Pick?
| Priority | Pick |
|---|---|
| Best voice quality | ElevenLabs |
| Lowest latency | Cartesia Sonic |
| Maximum privacy | Piper (local) |
| Simplest setup | OpenAI TTS |
Prerequisites
Before you start, make sure you have:
- Microphone: Built-in laptop mics work, but a headset is better
- HTTPS or localhost: Browsers block microphone access on plain HTTP. Localhost works fine; remote deployments need HTTPS
- API keys: For your chosen cloud providers
- Python 3.10+: For the voice bridge server
- GPU (optional): Only needed for local STT/TTS models
The Voice Companion Skill
Setting up voice chat manually means configuring STT, TTS, WebSocket connections, latency tuning, and more. The Voice Companion skill handles all of that for you.
For $9, you get:
- Pre-configured STT and TTS wiring
- Optimized latency settings for conversation
- Support for all major providers
- Conversation memory across voice sessions
- Automatic silence detection tuning
You still need API keys for cloud providers, but the skill eliminates the manual configuration and includes optimizations for conversational latency out of the box.
Next Steps
- Choose your providers. Cloud (Deepgram + ElevenLabs) for fastest setup and best quality. Local (Whisper + Piper) for free and private.
- Sign up and grab API keys. Most have free tiers for testing.
- Install the Voice Companion skill for the fast path, or configure manually for full control.
- Test with a simple conversation. Say "hello" and see what happens.
- Optimize once it works. Get a working pipeline first, then tweak latency settings.
Voice chat turns OpenClaw from something you type at into something you talk to. The setup takes 10-15 minutes with cloud providers, 30-45 with local. Either way, it is worth the time.
Recommended for this post
