Skip to main content

Giving Claude Code a Voice with ElevenLabs

AI Claude Code Productivity Software Engineering

About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.

Vintage radio emitting code-formed sound waves
Vintage radio emitting code-formed sound waves

I spend hours in Claude Code every day. Long sessions where I am reading, thinking, switching contexts, and occasionally glancing at the terminal to see if the agent finished a task. The problem: Claude Code is silent. It finishes a 10-minute build-and-deploy pipeline and just sits there, cursor blinking, waiting for me to notice. The whole concept here was inspired by J.A.R.V.I.S. from the Iron Man films, voiced by Paul Bettany. Tony Stark's AI assistant announces status, flags problems, and delivers dry commentary while Stark works on something else entirely. I wanted that. An AI assistant that speaks. That announces when it starts a task and summarizes what it accomplished when it finishes. Like a competent colleague who taps you on the shoulder and says "that deployment is done, here's what happened."

Thirty minutes of setup gave me exactly that. Claude Code now speaks through ElevenLabs text-to-speech, streaming audio through my speakers with ~300ms latency. A short bash script, an API key, and a prompt block in CLAUDE.md turned a silent terminal agent into one that announces its work. This article walks through the full implementation: the script, the prompt engineering, the voice selection, and the cost math. If you use Claude Code for extended sessions and want ambient awareness of what your agent is doing without watching the terminal, this is for you.

Why "Cooper"? Throughout this article you will see my agent address me as "Cooper" or "sir." That naming comes from a deliberate instruction in my CLAUDE.md file, and it is a nod to Interstellar. TARS addressing Cooper in that film captures exactly the dynamic I wanted: a dry, competent AI that treats you as the mission commander. The British butler tone and the name alternation create a surprisingly immersive collaboration feel after a few days of use.

Why Voice Output Changes the Workflow

The Attention Problem

Claude Code runs in a terminal. When it finishes a task, the only signal is that new text appears on screen. If you are in another window (reviewing a PR, reading documentation, responding to Slack), you miss it. You context-switch back to the terminal, realize the task finished three minutes ago, and lose those three minutes of idle time. Multiply that across a full workday of agent-assisted development and the accumulated dead time is significant.

What Voice Adds

Voice output solves the attention problem without requiring visual focus. I hear "That's done, Cooper. Five articles deployed to staging, all AI scores under threshold." from across the room and I know the state of my work without looking at the terminal. Three specific benefits:

Benefit Without Voice With Voice
Task completion awareness Must watch terminal Hear it from anywhere
Error notification Discover on next glance Hear immediately
Context retention Re-read output to recall what happened Spoken summary sticks in memory
Multi-task efficiency Check terminal between tasks Continue working, hear updates

The psychological effect surprised me. Having the agent announce its work creates a sense of collaboration that a silent terminal lacks. It feels like pair programming with a colleague who happens to work at 100x speed.

The Architecture: Three Components

The entire implementation is three pieces: a bash script that calls the ElevenLabs streaming TTS API, an .env file with credentials, and a prompt block in CLAUDE.md that instructs Claude Code when and how to use the script.

Component Overview

Bash tool call HTTPS POST Audio stream Sound Claude CodeAgent speak.shScript ElevenLabsStreaming API mpvPlayer Speakers
Voice output architecture

Claude Code calls the script through its Bash tool, passing the text to speak as an argument. The script POSTs to the ElevenLabs streaming endpoint, which returns audio chunks progressively. Those chunks pipe directly into mpv, which starts playing before the full response arrives. End-to-end latency from Claude Code deciding to speak to audio hitting the speakers is roughly 300-400ms.

Dependencies

Component Purpose Installation
curl HTTP client for ElevenLabs API Pre-installed on macOS/Linux
jq JSON payload construction brew install jq or apt install jq
mpv Audio player with stdin streaming brew install mpv or apt install mpv
ElevenLabs account TTS API access elevenlabs.io

The script has no Python dependencies, no Node.js runtime, no Docker container. Four command-line tools and an API key. That simplicity matters because Claude Code invokes this script potentially dozens of times per session; startup overhead needs to be near zero.

The Script

Create the directory structure:

~/.claude/scripts/
├── .env          # API credentials
└── speak.sh      # TTS script

The .env File

ELEVENLABS_API_KEY=your_api_key_here
ELEVENLABS_VOICE_ID=your_voice_id_here

Store your ElevenLabs API key and voice ID here. The script sources this file at runtime. Keep it out of version control.

speak.sh

#!/bin/bash
# Claude Code TTS — streams ElevenLabs audio through mpv
# Falls back to macOS `say` if ElevenLabs is unreachable

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
source "$SCRIPT_DIR/.env"

TEXT="$1"

if curl -sN --fail "https://api.elevenlabs.io/v1/text-to-speech/${ELEVENLABS_VOICE_ID}/stream" \
  -H "xi-api-key: ${ELEVENLABS_API_KEY}" \
  -H "Content-Type: application/json" \
  -d "$(jq -n --arg text "$TEXT" '{
    text: $text,
    model_id: "eleven_turbo_v2",
    voice_settings: {
      stability: 0.5,
      similarity_boost: 0.75
    }
  }')" \
  | mpv --no-video --no-terminal --really-quiet - 2>/dev/null; then
  :
else
  say -v Daniel "ElevenLabs unavailable. Falling back to local voice."
  say -v Daniel "$TEXT"
fi

Make it executable:

chmod +x ~/.claude/scripts/speak.sh

How It Works

The script does four things in a single pipeline, with a fallback if the API call fails:

  1. Sources credentials from the .env file adjacent to the script.
  2. Constructs a JSON payload using jq with the text, model ID, and voice settings.
  3. POSTs to the ElevenLabs streaming endpoint with curl -sN --fail (silent mode, no-buffer for streaming, fail on HTTP errors).
  4. Pipes the audio stream to mpv which plays it in real time with no video window and no terminal output.
  5. Falls back to macOS say if any step fails (network issue, expired key, rate limit). The fallback announces that ElevenLabs is unavailable before speaking the original text through the local Daniel voice. You always hear the announcement; the only question is which voice delivers it.

The eleven_turbo_v2 model delivers ~300ms time-to-first-byte. The voice settings control two parameters: stability (0.5 gives natural variation without wandering off-voice) and similarity_boost (0.75 keeps the output close to the selected voice's characteristics). I tuned these through experimentation; your preferences will vary.

Choosing a Voice

ElevenLabs offers three categories of voices:

Voice Type Description Cost Best For
Pre-made voices Curated defaults optimized for reliability Included in plan Quick setup, consistent quality
Community voices 10,000+ voices shared by users Included in plan Finding a specific character or accent
Cloned voices Your own voice or a custom voice Requires Pro plan+ Brand consistency, personal preference

I use a pre-made British male voice (the "butler" aesthetic fits the interaction model). Browse the ElevenLabs Voice Library to find one that suits your taste. Each voice has an ID string that goes in your .env file.

Voice Selection Tips

Pick a voice that is distinct from your own and from common notification sounds. The goal is instant recognition: when that voice speaks, you know it is your coding agent. I avoid voices that sound like podcast hosts or audiobook narrators because those blend into background audio. A slightly unusual accent or cadence cuts through ambient noise better.

Test your chosen voice with short, technical phrases. Some voices handle code terminology ("deployed to staging," "CI pipeline green," "three hundred millisecond latency") well. Others stumble on abbreviations, acronyms, or numbers. The turbo model handles technical language better than the older v1 models.

The CLAUDE.md Prompt

The script alone does nothing until Claude Code knows to call it. The prompt block in CLAUDE.md defines when to speak, what to say, and how to say it. Here is the exact prompt I use:

## Voice Announcements

Use the ElevenLabs TTS script for spoken announcements. Run in the background
so it doesn't block:

\`\`\`bash
~/.claude/scripts/speak.sh "Your message here" &
\`\`\`

### When starting a task

Speak a brief acknowledgement when beginning work. Address the user as "sir"
or "Cooper" (vary which one). The phrasing must vary every time but convey
"I'm on it." Never repeat the same wording twice in a session. Examples of
the *tone* (do NOT reuse these verbatim):
- "Right away, sir."
- "On it, Cooper."
- "Consider it done, sir."
- "Straightaway, Cooper."
- "I'll see to it at once, sir."

### When completing a task

Speak a brief 1-sentence summary of what was accomplished. Address the user
as "sir" or "Cooper" (vary which one). The phrasing must vary every time.
Keep it concise — what was done, key outcome. British butler tone. Examples
of the *tone* (do NOT reuse these verbatim):
- "All sorted, sir. The README has been updated and pushed."
- "That's done, Cooper. Terraform validates cleanly across all twelve files."
- "Taken care of, sir. Tests are green and the commit is pushed."

### General rules

- Always vary the phrasing — never use the same opening or structure
  consecutively
- Alternate between "sir" and "Cooper" naturally
- Skip only for: pure Q&A conversations with no code or file changes
- When a task has an exceptionally high leverage factor (50x+), occasionally
  mention it in the completion announcement. Keep it dry and understated —
  e.g. "That would have taken a human the better part of a week, sir." or
  "Roughly eighty hours of work in under ten minutes, Cooper." Don't do this
  every time — just when the leverage is genuinely striking.

Why This Prompt Structure Works

Several design decisions in the prompt are deliberate:

Background execution with &. The trailing ampersand runs the script without blocking Claude Code's execution. Without it, the agent waits for the audio to finish playing before continuing work. With it, the agent speaks and keeps working simultaneously.

Forced variation. The instruction "never repeat the same wording twice in a session" prevents the robotic monotony of hearing the same phrase fifty times a day. Claude Code is good at varying phrasing when you explicitly ask for it. Without this instruction, it gravitates toward a small set of favorites.

Character consistency. The "British butler tone" instruction and the name/honorific alternation create a consistent personality. After a few days, the voice becomes a recognizable character rather than a generic TTS notification. This matters for the psychological benefit I mentioned earlier: collaboration feels more real when the collaborator has a consistent voice and manner.

Selective leverage mentions. The instruction to occasionally comment on high-leverage tasks adds a layer of awareness that reinforces the value of the AI-assisted workflow. Hearing "That would have been three weeks of work for a human team, sir" after watching a 12-minute task complete is a visceral reminder of what this tooling makes possible.

Prompt Placement

Put the voice announcement block in your global ~/.claude/CLAUDE.md if you want voice across all projects. Put it in a project-level CLAUDE.md if you only want voice for specific repositories. I use the global file because I want voice everywhere.

Cost Analysis

ElevenLabs bills per character. The turbo models cost 0.5 credits per character on self-serve plans.

Typical Usage

Metric Value
Average announcement length 60 characters
Announcements per hour (active session) 8-12
Characters per hour ~600
Characters per 8-hour day ~4,800
Characters per month (22 working days) ~105,600

The free tier provides 10,000 characters/month, which covers roughly two days of heavy use. The Starter plan ($5/month) provides 30,000 characters. The Creator plan ($22/month) provides 100,000 characters, which covers a typical month with room to spare.

Plan Monthly Characters Monthly Cost Coverage
Free 10,000 $0 ~2 working days
Starter 30,000 $5 ~6 working days
Creator 100,000 $22 Full month with headroom
Pro 500,000 $99 Heavy use across multiple projects

For my usage pattern (6-10 hours of Claude Code per day, 5-6 days per week), the Creator plan covers it. The announcements are short. A typical completion announcement like "Taken care of, sir. Three articles deployed to production with all AI scores passing." is 78 characters. At 0.5 credits per character on turbo, that is 39 credits per announcement. The math works out to roughly $0.01-0.02 per announcement at Creator plan rates.

Free Alternatives on macOS

If you want voice output without any recurring cost, macOS has built-in text-to-speech via the say command. No API key, no network dependency, zero latency to first audio. A minimal version of the script:

#!/bin/bash
say -v Daniel "$1"

The Daniel voice is a British English option that ships with macOS. Other voices are available in System Settings > Accessibility > Spoken Content > System Voice. You can download higher-quality voices there as well.

Approach Voice Quality Latency Cost Offline Capable
ElevenLabs API Excellent, near-human ~300ms (network dependent) $0-99/month No
macOS say (default voices) Functional, robotic Instant Free Yes
macOS say (downloaded premium voices) Good, natural cadence Instant Free Yes

I chose ElevenLabs because the voice quality makes a meaningful difference over hours of listening. The built-in voices work, but they sound like what they are: synthesized speech. After a full day of hearing announcements, the naturalness of ElevenLabs reduces fatigue. That said, say is a perfectly viable starting point, and you can always upgrade later.

Operational Notes

Latency Tuning

The eleven_turbo_v2 model targets ~300ms time-to-first-byte for streaming. In practice, I see 250-400ms depending on network conditions and text length. For the short announcements Claude Code produces, the entire audio clip typically finishes generating before the first sentence finishes playing. The perceived latency is the time between Claude Code's bash call and audible sound: roughly half a second.

If latency matters more than voice quality for your use case, ElevenLabs also offers eleven_flash_v2_5 which targets sub-200ms latency at slightly reduced quality. For short announcements, the quality difference is negligible. Swap the model_id in the script to try it.

Failure Handling

If the ElevenLabs API call fails (network issue, expired key, rate limit), the script falls back to the macOS say command. You hear a brief "ElevenLabs unavailable" notice followed by the original announcement in the local Daniel voice. No announcement is ever lost. The fallback adds ~1 second of overhead compared to ElevenLabs streaming, but the tradeoff is worth it: you always know what your agent just did. Claude Code continues working regardless because the script runs in the background with &.

Volume and Environment

I run this in a home office. The announcements play through my desk speakers at conversation volume. In a shared office, you would want headphones or a lower volume. The mpv player respects system volume, so adjusting macOS volume works without script changes. For per-script volume control, add --volume=50 to the mpv flags (50 = half volume).

Multiple Concurrent Agents

If you run multiple Claude Code sessions simultaneously (I sometimes do, using Task agents in parallel), the announcements overlap. Each agent invokes its own speak.sh call, and mpv instances play concurrently. The voices layer on top of each other, which is occasionally confusing. One solution: assign different voices to different project directories by using project-level .env files instead of a single global one.

British voice American voice Claude CodeSession 1 ElevenLabs API Claude CodeSession 2 mpv instance 1 mpv instance 2 Speakers
Multi-session voice routing with per-project voices

Key Takeaways

  1. The setup is trivial. One bash script, one .env file, one prompt block in CLAUDE.md. Under thirty minutes from start to hearing your first announcement. No Python, no Node, no containers.
  2. The prompt engineering matters more than the script. The CLAUDE.md instructions that define when to speak, what tone to use, and how to vary phrasing turn a raw TTS call into a coherent interaction pattern. Invest time tuning the personality and the variation rules.
  3. Background execution is critical. Always append & to the speak command. Voice output should never block agent work. A silent, fast agent beats a vocal, slow one every time.
  4. Cost is negligible. Individual announcements cost roughly a penny each. Even heavy daily use runs $0.50-1.00 per day on the Creator plan. The free macOS say command works if you want zero cost.
  5. Voice creates presence. A silent terminal agent is easy to ignore. A speaking agent feels like a collaborator. That psychological shift changes how you structure your work: you delegate more freely, context-switch more confidently, and catch errors faster.

Additional Resources

Let's Build Something!

I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.

Currently taking on select consulting engagements through Vantalect.