Set up OpenClaw voice (STT/TTS)
Voice is the fastest way to dictate tasks. Full setup for multi-language voice workflows with OpenClaw.
Choose STT provider
Three options:
- Whisper API (OpenAI): cheap (~€0.006/min), 99 languages, 1–2 s latency
- Deepgram: faster (streaming), better for live transcription, slightly more expensive
- Local Whisper: €0/min, runs well on M-series Macs, latency 2–5 s
Recommendation to start: OpenAI Whisper API. With privacy concerns: local.
TTS optional
OpenAI TTS is cheap and good, ElevenLabs more natural (higher price). Nice for voice-message replies in WhatsApp/Telegram, but not mandatory.
Voice-memo-to-task workflow
User sends voice message via WhatsApp:
- OpenClaw receives ogg/opus file
- Whisper transcribes
- Skill
voice-to-taskclassifies: task, note, question? - If task: created in CRM, confirm reply back
Multi-language
Whisper auto-detects languages. German, English, Turkish, Arabic, French — all good. For dialects (Bavarian, Swiss German) we benchmark before rollout.
Tune latency
Streaming STT (Deepgram) brings first words under 500 ms. OpenAI Whisper API delivers full transcript after 1–2 s. For real-time conversation: streaming. For async voice memos: Whisper.