Skip to content
v0.2 · korean liveall-local voiceself-hostedMIT

Pick up the phone.
Talk to your homelab.

Papercup is a Discord voice bot that calls a Claude Code session running on your own box. Press /pickup, talk like it's a phone call, get spoken answers. No cloud STT/TTS, no audio leaves your network.

$install Papercup on your homelab
bash <(curl -fsSL https://raw.githubusercontent.com/powder-nomad/papercup/main/install.sh)
Need different engines / agent backend? Use the wizard ↓

What makes this different

Three things, all load-bearing.

01All-local voice stack

Silero VAD → faster-whisper STT → Kokoro/MeloTTS TTS, all running in Python sidecars on your hardware. Audio never leaves your LAN.

~3–8s loop on 4 cores

02Phone-call UX

Speak, pause, get a spoken reply. Hang up and resume by name later. Multilingual (English, Korean, JP, ZH, ES, FR, …) auto-routed per utterance.

9 languages today

03Subagents do real work

The speaker delegates to sandboxed background Claude Code instances via an embedded MCP server. You hang up; they keep coding.

spawn → check → list

Configure your one-liner

Agent backend
Speech-to-text (Whisper)
Text-to-speech

The Korean engine pre-warms at startup so the first KR call doesn't pause. Pick MeloTTS for lighter, XTTS for voice variety.

Install location
Your one-liner
bash <(curl -fsSL https://raw.githubusercontent.com/powder-nomad/papercup/main/install.sh)

Paste into your homelab terminal. Discord token / client ID / guild ID will be prompted interactively. Re-run with different flags any time to reconfigure.

How a call flows

┌─ Discord (phone / desktop) ─┐         ┌─────────── Homelab ───────────┐
│                             │  voice  │                               │
│  /pickup → speak → /hangup  │ ──────► │  Silero VAD → Whisper STT     │
│                             │         │       ↓                       │
│  bot speaks back            │ ◄────── │  Speaker agent (Haiku)        │
│                             │ Kokoro  │       ↓                       │
└─────────────────────────────┘         │  Kokoro / MeloTTS → audio     │
                                        │                               │
                                        │  spawn_extension(task) ───►   │
                                        │       Claude Code subagent    │
                                        │       in sandboxed dir        │
                                        └───────────────────────────────┘

The speaker handles the call directly. For anything bigger than a quick file read, it spawns a background extension — a full Claude Code instance in its own dir — and narrates progress while it works. You can hang up; resume the session by name later (/resume name:foo).

System requirements

Tested on a 4-core Linux homelab. macOS works for the base path; the MeloTTS (Korean) path is Linux-tested only.

MinimumRecommended
OSLinux x86_64, macOS (Intel or Apple Silicon)Ubuntu 22.04+
Python3.103.12
Node2020+
Disk (English-only, Kokoro)2 GB free4 GB free
Disk (with Korean / MeloTTS)4 GB free8 GB free
RAM2 GB free4 GB free
CPU2 cores4+ cores (real-time STT/TTS)
NetworkOutbound HTTPS for model downloads

apt (Linux)

sh
# Base install (Kokoro TTS only)
sudo apt-get install -y espeak-ng python3-venv

# + Korean / MeloTTS path
sudo apt-get install -y libmecab-dev mecab-ipadic-utf8 libssl-dev pkg-config

brew (macOS)

sh
brew install espeak-ng node python@3.12
# Korean path also needs:
brew install mecab mecab-ipadic openssl pkg-config

What you also need running

  • Discord bot with token, client ID, and a guild ID. Get from Discord Developer Portal → New Application → Bot. Enable MESSAGE CONTENT INTENT and VOICE STATE under "Privileged Gateway Intents".
  • One agent backend: a logged-in claude CLI (Claude Code), codex CLI (ChatGPT), or an Anthropic API key. The wizard above lets you pick.

What gets downloaded on first run

  • Whisper model (~140 MB for base)
  • Kokoro voice + voices.bin (~100 MB)
  • Silero VAD (~1.7 MB)
  • If Korean enabled: PyTorch CPU (~700 MB) at install time; Korean BERT (~440 MB) lazy-loaded on first Korean utterance

Capabilities matrix

ComponentTodayNotes
VADSileroOnly option
STTWhispersmall (multilingual, default) auto-detects 99 languages; base / base.en / small.en available
TTSKokoro + MeloTTS + XTTS-v2 (auto)Kokoro: en/ja/zh/es/fr/hi/it/pt. Korean → MeloTTS (light, monotone) or XTTS-v2 (~58 voices, voice cloning). Set via TTS_KO_ENGINE
Agent10 backends (7 CLI agents + 3 HTTP APIs)claude-code · codex · aider · gemini-cli · opencode · crush · amp · anthropic-api · openai-compat · gemini-api. Switch via /backend at runtime.
Per-session configmodel · effort · permissions · backend · streaming · reactivity · notify · modeSet via /pickup flags or hot-swap mid-session via individual slash commands
ModesVoice (phone-call prompt) + Text (vibecoding)/pickup mode:voice or mode:text. Text mode drops the system prompt → normal Claude Code behavior
Reasoning effortminimal · low · medium · high · xhigh · maxxhigh / max are Opus-only
Live progresssticky message, optional event logText mode + /streaming summary|full. Anti-bomb: edit-throttled, auto-skips short turns
Budget trackingper-day USD + tokens, daily capBOT_DAILY_BUDGET_USD or /budget set_usd:<n>; live on bot's rich-presence
Process hygienedetached spawn, group-kill cancel, boot-time reaperEach agent turn tracked in data/process-registry.json; orphans cleaned up on restart
Multi-botloop cap, reactivity modes, in-band rosterMultiple operators can co-host bots in one channel; cap prevents bot-to-bot loops
TransportDiscord voice + textBind a single channel via /bind, or @-mention anywhere

See Slash commands for the runtime surface and Components for the deep dive.

Released under the MIT License.