Pipeline stages

Every conversational turn flows through these stages. Each has a distinctive log signature for diagnosis.

#	Stage	Implementation	Failure signature
1	Discord gateway	discord.js Client	`Used disallowed intents`, `Invalid token`
2	Voice connect	@discordjs/voice + DAVE	stuck in `connecting → signalling`, never `ready`
3	Audio capture	receiver subscription, Opus → PCM via prism-media	no `[capture] first opus frame` after speaking
4	Resample	48kHz s16 stereo → 16kHz mono f32	s16 peak high but f32 peak ~0
5	VAD	Silero ONNX, 32ms windows w/ 64-sample context	every utterance flagged `noise-only`
6	STT	faster-whisper (Python sidecar)	`[stt] sidecar exited`, empty transcripts
7	Speaker agent	claude-code / codex / anthropic-api	`[agent] llm failed`, agent claims it can't do things it should
8	TTS	Kokoro via kokoro-onnx (Python sidecar)	`[tts] synth failed`
9	Upsample	24kHz mono → 48kHz stereo s16	playback sounds half-speed or chipmunky
10	Audio playback	createAudioPlayer + Discord voice	`[player] error`, audio silent on the call

Latency budget

Per-turn loop on a 4-core CPU homelab:

Stage	Cost
End-of-utterance silence	`SILENCE_MS=600`
Whisper STT (`small` int8)	~0.5-0.8 RTF (multilingual default)
Whisper STT (`base.en` int8)	~0.3-0.5 RTF (English-only path)
Speaker agent	~5-8s for CLI backends, ~0.5-1.5s for direct API
Kokoro TTS (en/ja/zh/es/fr/hi/it/pt)	~0.5-0.85 RTF
MeloTTS (Korean, monotone)	~2.3 RTF
XTTS-v2 (Korean, ~58 speakers)	~2.5-3.0 RTF
Playback start	~200ms
Total (English path)	~3-8s typical
Total (Korean path)	~10-25s typical (TTS dominates)

The CLI backends dominate. If you have an Anthropic API key, AGENT_BACKEND=anthropic-api cuts ~5s off every turn.

Capture loop self-healing

The receive subscription can wedge on DAVE decryption errors (Discord's E2E voice protocol) or network glitches. Both pcmStream.error and opusStream.error trigger an immediate re-subscribe via a one-shot guard so the loop doesn't double-subscribe. Without this, the bot would just go silent after one bad packet.

Concurrency

Capture, STT, agent, and TTS are async per-utterance. The capture loop doesn't block on inference — it re-subscribes immediately after passing the buffer to runAgent(). If you talk faster than the bot can respond, multiple turns can be in flight; the AudioPlayer is single-resource so only the latest synthesized response actually plays. Phase 4 will introduce a turn-management layer.

Pipeline stages ​

Latency budget ​

Capture loop self-healing ​

Concurrency ​

Pipeline stages

Latency budget

Capture loop self-healing

Concurrency