How to Set Up Voice Chat with OpenClaw

Voice Changes Everything

Text-based AI assistants are powerful, but they have a friction problem. You have to open a chat window, type your question, wait for a response, then read it. That is fine for complex tasks, but for quick questions, reminders, or when your hands are busy, it sucks.

Voice chat fixes that. When you can just speak to your AI assistant and hear the response, everything changes. You can use it while cooking, driving, or when your hands are covered in dough. It also makes OpenClaw more accessible — voice is inherently more inclusive than typing.

This guide walks you through setting up voice chat with OpenClaw. We cover the full pipeline — speech-to-text (STT), text-to-speech (TTS), and how to wire them together. By the end, you will have a working voice loop where you speak, OpenClaw listens, thinks, and responds out loud.

Key Takeaways

Voice chat adds a hands-free interaction mode to OpenClaw
The pipeline is: microphone → STT → LLM → TTS → speaker
Deepgram Nova-3 + ElevenLabs gives the best balance of speed and quality
Local options (Whisper + Piper) are free but require more setup
The Voice Companion skill handles all the wiring for you

How Voice Chat Works

The voice pipeline has four stages:

You speak into your microphone
STT (speech-to-text) converts your audio to text
OpenClaw processes the text and generates a response
TTS (text-to-speech) converts the response back to audio

Each stage adds latency. The goal is keeping total end-to-end time under one second. Above that, conversations feel sluggish. Below that, they feel natural.

The three big decisions are:

Which STT provider to use
Which TTS provider to use
How to connect them to OpenClaw

STT: Converting Speech to Text

Your first choice is how to turn your voice into text. You have four main options:

Deepgram Nova-3 (Recommended)

Deepgram Nova-3 is the fastest cloud STT option. Its streaming support means it starts transcribing while you are still talking — critical for keeping conversations feeling responsive.

Cost: ~$4.30 per 1,000 minutes. Free tier available.

Setup:

pip install deepgram-sdk
export DEEPGRAM_API_KEY="your-key-here"

Why it wins: Speed. The streaming capability alone saves 300-500ms compared to sending whole audio chunks.

Local Whisper (Free, Private)

If you do not want to pay for cloud STT or want maximum privacy, run Whisper locally. The faster-whisper implementation makes it viable on GPUs.

Cost: $0. Requires a GPU for real-time speed.

Setup:

pip install faster-whisper

Model sizes:

tiny: ~1GB VRAM, ~50ms latency
base: ~1GB VRAM, ~100ms latency
small: ~2GB VRAM, ~200ms latency
medium: ~5GB VRAM, ~400ms latency

For conversation, base or small hits the sweet spot between speed and accuracy.

Groq Whisper

Groq runs Whisper on their LPU hardware. It is fast and cheap — about $0.10 per 1,000 minutes.

Downside: No streaming support. You send the full audio chunk and wait.

OpenAI Whisper

You might already have an OpenAI key. It works fine but costs more ($6/1000 minutes) and has no streaming.

Which STT Should You Pick?

Priority	Pick
Lowest latency	Deepgram Nova-3
Lowest cost	Local Whisper or Groq
Maximum privacy	Local Whisper
Simplest setup	OpenAI Whisper

TTS: Giving OpenClaw a Voice

Now the fun part — giving OpenClaw a voice. You have several options:

ElevenLabs (Best Quality)

ElevenLabs produces the most natural-sounding TTS available. They have dozens of premade voices, or you can clone your own.

Cost: ~$0.18 per 1,000 characters. Free tier: 10,000 characters/month.

Setup:

pip install elevenlabs
export ELEVENLABS_API_KEY="your-key-here"

Example:

from elevenlabs import ElevenLabs, play

client = ElevenLabs()
audio = client.text_to_speech.convert(
    text="Hello, I am your OpenClaw assistant.",
    voice_id="JBFqnCBsd6RMkjVDRZzb",  # George
    model_id="eleven_turbo_v2_5",
)
play(audio)

Pro tip: Use eleven_turbo_v2_5 for voice chat. It is optimized for low latency. The multilingual models sound better but add ~200ms.

Cartesia Sonic (Lowest Latency)

Cartesia Sonic is designed specifically for real-time voice applications. It has the lowest latency of any cloud TTS option.

Cost: ~$0.10 per 1,000 characters. Free tier available.

Cool feature: Streaming output means you hear the first words before the full response is generated.

Piper (Local, Free)

Piper runs entirely on your CPU. No API key, no cost. Quality is noticeably below ElevenLabs, but it is instant and free.

Setup:

pip install piper-tts

You also need to download a voice model (~60MB).

OpenAI TTS

Simple if you already have an OpenAI key. Six built-in voices. Costs $15 per million characters.

Which TTS Should You Pick?

Priority	Pick
Best voice quality	ElevenLabs
Lowest latency	Cartesia Sonic
Maximum privacy	Piper (local)
Simplest setup	OpenAI TTS

Prerequisites

Before you start, make sure you have:

Microphone: Built-in laptop mics work, but a headset is better
HTTPS or localhost: Browsers block microphone access on plain HTTP. Localhost works fine; remote deployments need HTTPS
API keys: For your chosen cloud providers
Python 3.10+: For the voice bridge server
GPU (optional): Only needed for local STT/TTS models

The Voice Companion Skill

Setting up voice chat manually means configuring STT, TTS, WebSocket connections, latency tuning, and more. The Voice Companion skill handles all of that for you.

For $9, you get:

Pre-configured STT and TTS wiring
Optimized latency settings for conversation
Support for all major providers
Conversation memory across voice sessions
Automatic silence detection tuning

You still need API keys for cloud providers, but the skill eliminates the manual configuration and includes optimizations for conversational latency out of the box.

Next Steps

Choose your providers. Cloud (Deepgram + ElevenLabs) for fastest setup and best quality. Local (Whisper + Piper) for free and private.
Sign up and grab API keys. Most have free tiers for testing.
Install the Voice Companion skill for the fast path, or configure manually for full control.
Test with a simple conversation. Say "hello" and see what happens.
Optimize once it works. Get a working pipeline first, then tweak latency settings.

Voice chat turns OpenClaw from something you type at into something you talk to. The setup takes 10-15 minutes with cloud providers, 30-45 with local. Either way, it is worth the time.

How to Set Up Voice Chat with OpenClaw

Voice Changes Everything

Key Takeaways

How Voice Chat Works

STT: Converting Speech to Text

Deepgram Nova-3 (Recommended)

Local Whisper (Free, Private)

Groq Whisper

OpenAI Whisper

Which STT Should You Pick?

TTS: Giving OpenClaw a Voice

ElevenLabs (Best Quality)

Cartesia Sonic (Lowest Latency)

Piper (Local, Free)

OpenAI TTS

Which TTS Should You Pick?

Prerequisites

The Voice Companion Skill

Next Steps

Voice Companion

More From the Blog

How to Make Money With Your OpenClaw: A Guide to Selling Skills on Claw Mart

YouTube Access for Agents: Give Your OpenClaw a YouTube Research Assistant

How to Turn Your OpenClaw Into an Autonomous Social Media Manager