Skip to content

Pipeline stages

Every conversational turn flows through these stages. Each has a distinctive log signature for diagnosis.

#StageImplementationFailure signature
1Discord gatewaydiscord.js ClientUsed disallowed intents, Invalid token
2Voice connect@discordjs/voice + DAVEstuck in connecting → signalling, never ready
3Audio capturereceiver subscription, Opus → PCM via prism-mediano [capture] first opus frame after speaking
4Resample48kHz s16 stereo → 16kHz mono f32s16 peak high but f32 peak ~0
5VADSilero ONNX, 32ms windows w/ 64-sample contextevery utterance flagged noise-only
6STTfaster-whisper (Python sidecar)[stt] sidecar exited, empty transcripts
7Speaker agentclaude-code / codex / anthropic-api[agent] llm failed, agent claims it can't do things it should
8TTSKokoro via kokoro-onnx (Python sidecar)[tts] synth failed
9Upsample24kHz mono → 48kHz stereo s16playback sounds half-speed or chipmunky
10Audio playbackcreateAudioPlayer + Discord voice[player] error, audio silent on the call

Latency budget

Per-turn loop on a 4-core CPU homelab:

StageCost
End-of-utterance silenceSILENCE_MS=600
Whisper STT (small int8)~0.5-0.8 RTF (multilingual default)
Whisper STT (base.en int8)~0.3-0.5 RTF (English-only path)
Speaker agent~5-8s for CLI backends, ~0.5-1.5s for direct API
Kokoro TTS (en/ja/zh/es/fr/hi/it/pt)~0.5-0.85 RTF
MeloTTS (Korean, monotone)~2.3 RTF
XTTS-v2 (Korean, ~58 speakers)~2.5-3.0 RTF
Playback start~200ms
Total (English path)~3-8s typical
Total (Korean path)~10-25s typical (TTS dominates)

The CLI backends dominate. If you have an Anthropic API key, AGENT_BACKEND=anthropic-api cuts ~5s off every turn.

Capture loop self-healing

The receive subscription can wedge on DAVE decryption errors (Discord's E2E voice protocol) or network glitches. Both pcmStream.error and opusStream.error trigger an immediate re-subscribe via a one-shot guard so the loop doesn't double-subscribe. Without this, the bot would just go silent after one bad packet.

Concurrency

Capture, STT, agent, and TTS are async per-utterance. The capture loop doesn't block on inference — it re-subscribes immediately after passing the buffer to runAgent(). If you talk faster than the bot can respond, multiple turns can be in flight; the AudioPlayer is single-resource so only the latest synthesized response actually plays. Phase 4 will introduce a turn-management layer.

Released under the MIT License.