Speaker agent + backends

The speaker agent owns the call. It transcribes the user's audio (via STT), thinks, and produces text that the TTS layer speaks back.

What it is

SpeakerAgent (packages/bot/src/agent/speaker.ts) is a thin shim around an AgentBackend. The shim owns the system prompt + config; the backend owns conversation state and the actual model call.

Backends

Ten backends ship today, all behind the same AgentBackend interface. Switch between them at runtime with /backend name:<x>.

CLI agents (7)

Backend	`AGENT_BACKEND=`	Auth	Notes
Claude Code	`claude-code`	Existing `claude` login	Subscription tier. Streams `tool_use` / `tool_result` for `/streaming`. Owns the MCP integration.
Codex	`codex`	Existing `codex` login	OpenAI's CLI agent. Assigns its own thread UUID on first turn.
Aider	`aider-cli`	Env vars per config	`aider --message ... --no-stream --yes-always`. Per-cwd history in `.aider.chat.history.md`.
Gemini CLI	`gemini-cli`	Google CLI login	`gemini -p ... --output-format json`. Token usage extracted from JSON.
OpenCode	`opencode-cli`	OpenCode config	`opencode run --session <id> --format json`. Native session-resume.
Crush	`crush-cli`	Crush config	`crush run` from charmbracelet. Optional `--yolo` skips permission prompts.
Amp	`amp-cli`	Sourcegraph auth	`amp -x` (execute). Prompt piped via stdin. Optional in-prompt `@T-<thread>` resume.

HTTP APIs (3)

Backend	`AGENT_BACKEND=`	Auth	Coverage
Anthropic API	`anthropic-api`	`ANTHROPIC_API_KEY`	Direct API; in-memory history per session
OpenAI-compatible	`openai-compat`	`OPENAI_COMPAT_*`	One adapter, ~10 providers via base-URL config: OpenAI, Groq, Together, Fireworks, DeepSeek, OpenRouter, LiteLLM, Ollama, LM Studio, vLLM
Gemini API (native)	`gemini-api`	`GEMINI_API_KEY`	Google's `generativelanguage.googleapis.com` — native schema, not the OpenAI shim

All ten accept --model/model: and a system prompt. claude-code additionally pipes through --allowedTools and --mcp-config. The CLI-agent backends share a BaseCliBackend (detached spawn, process-registry tracking, group-kill cancel, turn timeout); HTTP backends maintain in-memory history: Turn[].

Plug-in registry

Backends self-register at module load:

import { registerBackend } from "@papercup/bot/agent/backend";
registerBackend("my-thing", () => new MyBackend());

Once registered, the new backend shows up in /backend's dropdown, listBackends(), and AGENT_BACKEND= env values. Built-ins do this from the bottom of each backend-*.ts file; third parties can add their own without touching papercup's source.

Model catalog

agent/model-catalog.ts keeps a static map of model id → backend candidates (e.g. claude-opus-4-7 → ["claude-code","anthropic-api"]) and refreshes live from each provider's /models endpoint when API keys are set. /models and /models action:refresh expose this to the operator.

Tools

The speaker has read-only built-in tools plus MCP tools for delegating real work:

--allowedTools "Read Glob Grep mcp__papercup__spawn_extension mcp__papercup__check_extension mcp__papercup__list_extensions"

The mcp__papercup__* tools come from the embedded HTTP MCP server (see Extensions). Read/Glob/Grep are restricted to directories specified in PROJECT_DIRS via --add-dir.

System prompt — mode-aware

The speaker has two modes; the prompt depends on which one the session is in.

Voice mode (`/pickup mode:voice`, default)

Full phone-call persona prompt at the top of packages/bot/src/agent/speaker.ts. Key behaviors:

Phone-call brevity (one or two sentences)
No markdown / bullets / code formatting (plain prose for TTS)
Don't read out URLs or long IDs
Reply in the same language the user spoke in
For Korean: ONE short sentence (~15 syllables), TTS is slow, ask before going long
Use Read/Glob/Grep inline for quick file lookups
Use spawn_extension for anything multi-step
Narrate before tool calls so the user isn't sitting in silence

Text mode (`/pickup mode:text`)

No system prompt. The backend behaves as a normal Claude Code (or Codex / Anthropic) session — markdown OK, multi-paragraph OK, no "this is a phone call" framing. Designed for vibecoding via Discord text, where call brevity isn't a constraint.

Tools are still the same (Read/Glob/Grep + MCP extension tools), so spawn_extension works in text mode if you want the speaker itself to delegate long-running work.

Per-session knobs

SpeakerAgentOpts lets the bot pass per-session overrides at start:

Opt	Source	Notes
`model`	`Session.model` (set via `/model` or `/pickup model:`)	Falls back to `AGENT_MODEL` env
`effort`	`Session.effort` (set via `/effort` or `/pickup effort:`)	`--effort` on Claude Code CLI; `thinking.budget_tokens` on Anthropic API; ignored by codex
`mode`	`Session.mode` (set via `/pickup mode:`)	Drives prompt selection (voice vs text)
`permissionMode`	`Session.permissionMode` (set via `/permissions` or `/pickup permission-mode:`)	`--permission-mode` on Claude Code CLI; default is mode-aware (text=`bypassPermissions`, voice=`default`)

When you /model, /effort, or /permissions mid-conversation, the bot hot-swaps the agent: stops the current backend instance, starts a new one with the updated opts, and uses backend resume so history carries over.

Why not give the voice agent Bash directly

Voice mode is on the latency-critical path. Long-running tools degrade call UX badly — a 30-second Bash call freezes the conversation. Heavy work belongs in extensions, which run async and report back when done. Text mode has more latitude; if you want bash-on-the-call behavior, /pickup mode:text permission-mode:bypassPermissions is the path.

Session state

Each /pickup creates a SessionStore record with a friendly name. Each backend stores its own native session id (backendId field):

claude-code: pre-allocated UUID, passed via --session-id (first turn) / --resume (subsequent)
codex: backend assigns a thread UUID on first turn; bot syncs back via getBackendId()
anthropic-api: no external session, history kept in-memory

/resume name:foo looks up the session, passes the right backendId to the backend, and continues. See Sessions.

Speaker agent + backends ​

What it is ​

Backends ​

CLI agents (7) ​

HTTP APIs (3) ​

Plug-in registry ​

Model catalog ​

Tools ​

System prompt — mode-aware ​

Voice mode (/pickup mode:voice, default) ​

Text mode (/pickup mode:text) ​

Per-session knobs ​

Why not give the voice agent Bash directly ​

Session state ​