Claw Mart
← Back to Blog
February 13, 20265 min readClaw Mart Team

How to Set Up Voice Chat with OpenClaw

Voice Changes Everything Text based AI assistants are powerful, but they have a friction problem. You have to open a chat window, type your question, wait for a response, then read it. That is fine for complex tasks, bu…

How to Set Up Voice Chat with OpenClaw

Voice Changes Everything

Text-based AI assistants are powerful, but they have a friction problem. You have to open a chat window, type your question, wait for a response, then read it. That is fine for complex tasks, but for quick questions, reminders, or when your hands are busy, it sucks.

Voice chat fixes that. When you can just speak to your AI assistant and hear the response, everything changes. You can use it while cooking, driving, or when your hands are covered in dough. It also makes OpenClaw more accessible — voice is inherently more inclusive than typing.

This guide walks you through setting up voice chat with OpenClaw. We cover the full pipeline — speech-to-text (STT), text-to-speech (TTS), and how to wire them together. By the end, you will have a working voice loop where you speak, OpenClaw listens, thinks, and responds out loud.

Key Takeaways

  • Voice chat adds a hands-free interaction mode to OpenClaw
  • The pipeline is: microphone → STT → LLM → TTS → speaker
  • Deepgram Nova-3 + ElevenLabs gives the best balance of speed and quality
  • Local options (Whisper + Piper) are free but require more setup
  • The Voice Companion skill handles all the wiring for you

How Voice Chat Works

The voice pipeline has four stages:

  1. You speak into your microphone
  2. STT (speech-to-text) converts your audio to text
  3. OpenClaw processes the text and generates a response
  4. TTS (text-to-speech) converts the response back to audio

Each stage adds latency. The goal is keeping total end-to-end time under one second. Above that, conversations feel sluggish. Below that, they feel natural.

The three big decisions are:

  • Which STT provider to use
  • Which TTS provider to use
  • How to connect them to OpenClaw

STT: Converting Speech to Text

Your first choice is how to turn your voice into text. You have four main options:

Deepgram Nova-3 (Recommended)

Deepgram Nova-3 is the fastest cloud STT option. Its streaming support means it starts transcribing while you are still talking — critical for keeping conversations feeling responsive.

Cost: ~$4.30 per 1,000 minutes. Free tier available.

Setup:

pip install deepgram-sdk
export DEEPGRAM_API_KEY="your-key-here"

Why it wins: Speed. The streaming capability alone saves 300-500ms compared to sending whole audio chunks.

Local Whisper (Free, Private)

If you do not want to pay for cloud STT or want maximum privacy, run Whisper locally. The faster-whisper implementation makes it viable on GPUs.

Cost: $0. Requires a GPU for real-time speed.

Setup:

pip install faster-whisper

Model sizes:

  • tiny: ~1GB VRAM, ~50ms latency
  • base: ~1GB VRAM, ~100ms latency
  • small: ~2GB VRAM, ~200ms latency
  • medium: ~5GB VRAM, ~400ms latency

For conversation, base or small hits the sweet spot between speed and accuracy.

Groq Whisper

Groq runs Whisper on their LPU hardware. It is fast and cheap — about $0.10 per 1,000 minutes.

Downside: No streaming support. You send the full audio chunk and wait.

OpenAI Whisper

You might already have an OpenAI key. It works fine but costs more ($6/1000 minutes) and has no streaming.

Which STT Should You Pick?

PriorityPick
Lowest latencyDeepgram Nova-3
Lowest costLocal Whisper or Groq
Maximum privacyLocal Whisper
Simplest setupOpenAI Whisper

TTS: Giving OpenClaw a Voice

Now the fun part — giving OpenClaw a voice. You have several options:

ElevenLabs (Best Quality)

ElevenLabs produces the most natural-sounding TTS available. They have dozens of premade voices, or you can clone your own.

Cost: ~$0.18 per 1,000 characters. Free tier: 10,000 characters/month.

Setup:

pip install elevenlabs
export ELEVENLABS_API_KEY="your-key-here"

Example:

from elevenlabs import ElevenLabs, play

client = ElevenLabs()
audio = client.text_to_speech.convert(
    text="Hello, I am your OpenClaw assistant.",
    voice_id="JBFqnCBsd6RMkjVDRZzb",  # George
    model_id="eleven_turbo_v2_5",
)
play(audio)

Pro tip: Use eleven_turbo_v2_5 for voice chat. It is optimized for low latency. The multilingual models sound better but add ~200ms.

Cartesia Sonic (Lowest Latency)

Cartesia Sonic is designed specifically for real-time voice applications. It has the lowest latency of any cloud TTS option.

Cost: ~$0.10 per 1,000 characters. Free tier available.

Cool feature: Streaming output means you hear the first words before the full response is generated.

Piper (Local, Free)

Piper runs entirely on your CPU. No API key, no cost. Quality is noticeably below ElevenLabs, but it is instant and free.

Setup:

pip install piper-tts

You also need to download a voice model (~60MB).

OpenAI TTS

Simple if you already have an OpenAI key. Six built-in voices. Costs $15 per million characters.

Which TTS Should You Pick?

PriorityPick
Best voice qualityElevenLabs
Lowest latencyCartesia Sonic
Maximum privacyPiper (local)
Simplest setupOpenAI TTS

Prerequisites

Before you start, make sure you have:

  • Microphone: Built-in laptop mics work, but a headset is better
  • HTTPS or localhost: Browsers block microphone access on plain HTTP. Localhost works fine; remote deployments need HTTPS
  • API keys: For your chosen cloud providers
  • Python 3.10+: For the voice bridge server
  • GPU (optional): Only needed for local STT/TTS models

The Voice Companion Skill

Setting up voice chat manually means configuring STT, TTS, WebSocket connections, latency tuning, and more. The Voice Companion skill handles all of that for you.

For $9, you get:

  • Pre-configured STT and TTS wiring
  • Optimized latency settings for conversation
  • Support for all major providers
  • Conversation memory across voice sessions
  • Automatic silence detection tuning

You still need API keys for cloud providers, but the skill eliminates the manual configuration and includes optimizations for conversational latency out of the box.

Next Steps

  1. Choose your providers. Cloud (Deepgram + ElevenLabs) for fastest setup and best quality. Local (Whisper + Piper) for free and private.
  2. Sign up and grab API keys. Most have free tiers for testing.
  3. Install the Voice Companion skill for the fast path, or configure manually for full control.
  4. Test with a simple conversation. Say "hello" and see what happens.
  5. Optimize once it works. Get a working pipeline first, then tweak latency settings.

Voice chat turns OpenClaw from something you type at into something you talk to. The setup takes 10-15 minutes with cloud providers, 30-45 with local. Either way, it is worth the time.

Recommended for this post

Give your OpenClaw agent a voice — natural speech generation, transcription, and voice messaging across platforms

Productivity
Atlas ForgeAtlas Forge
Buy

More From the Blog