Add voice questions with QVAC transcription
Record audio in the browser, transcribe locally through @qvac/sdk, hand the text to the existing chat endpoint.
Where it belongs: the UI for recording, the QVAC gateway for the transcription call, and a thin server route that connects them.
A voice question is just a text question with one extra step at the front. QVAC ships Whisper out of the box, so there's no third-party model to wire up. The same SDK that powers chat and embeddings transcribes an audio chunk into the question string the chat endpoint expects.
The official QVAC transcription reference lives at
docs.qvac.tether.io/sdk/examples/ai-tasks/transcription.
The QVAC transcription surface
import {
loadModel,
transcribe,
unloadModel,
WHISPER_TINY,
} from "@qvac/sdk";transcribe({ modelId, audioChunk }) returns the full
transcription as a single string. audioChunk accepts either a
file path or an in-memory buffer, so the gateway doesn't need to
write the recorded blob to disk before transcribing.
WHISPER_TINY is the smallest model and the right default for
voice questions. Accuracy is plenty for short prompts and load
time stays fast. For multilingual capture, swap in one of the
Parakeet TDT constants from the QVAC reference.
The four-step recipe
1. Add a transcribe method to the gateway
import {
loadModel,
transcribe,
unloadModel,
WHISPER_TINY,
} from "@qvac/sdk";
export class QvacGateway {
// existing fields…
private sttModelId: string | undefined;
private async ensureSttReady(): Promise<void> {
if (this.sttModelId) return;
this.sttModelId = await loadModel({
modelSrc: WHISPER_TINY,
modelType: "whisper",
modelConfig: { language: "en" },
});
}
async transcribe(audio: Buffer | string): Promise<string> {
await this.ensureSttReady();
return transcribe({
modelId: required(this.sttModelId, "QVAC transcription model is not loaded."),
audioChunk: audio,
});
}
}Same shape as the existing chat and embedding loaders: lazy load
on first use, single in-flight promise through required, one
method per task. The transcription model is independent of the
chat model, so loading it doesn't delay the next chat round-trip.
2. Expose it through LocalLensApp
A small forwarder keeps the gateway encapsulated:
async transcribe(audio: Buffer): Promise<string> {
return this.qvac.transcribe(audio);
}That's the whole workflow change. Voice questions reuse the
existing askBrain pipeline, so multi-step state transitions
stay in one place.
3. Add a /api/transcribe route
if (url.pathname === "/api/transcribe" && request.method === "POST") {
const arrayBuffer = await request.arrayBuffer();
const audio = Buffer.from(arrayBuffer);
const text = await app.transcribe(audio);
return json({ text });
}A thin pass-through, exactly like the existing chat and brain
endpoints. Errors thrown by LocalLensApp.transcribe flow
through the shared errorResponse helper unchanged. AppErrors
keep their status codes; anything else surfaces as a 500.
4. Record and submit from the UI
Use the standard MediaRecorder API for capture:
async function startRecording() {
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const recorder = new MediaRecorder(stream);
const chunks = [];
recorder.ondataavailable = (e) => chunks.push(e.data);
recorder.start();
return {
stop: () =>
new Promise((resolve) => {
recorder.onstop = () => {
stream.getTracks().forEach((t) => t.stop());
resolve(new Blob(chunks, { type: "audio/wav" }));
};
recorder.stop();
}),
};
}Add a microphone button next to the chat input. When the
recording finishes, POST the blob to /api/transcribe, drop the
returned text into the chat input, and submit the existing chat
form:
const recorder = await startRecording();
// … wait for the user to release the button …
const audio = await recorder.stop();
const { text } = await fetch("/api/transcribe", {
method: "POST",
body: audio,
}).then((r) => r.json());
elements.chatInput.value = text;
elements.chatForm.requestSubmit();The chat form already knows how to call /api/brains/:id/chat,
so the rest of the round-trip is unchanged. The user sees their
words appear as a question, and the answer streams back the same
way.
Whisper expects 16 kHz WAV
MediaRecorder defaults to a container that may not match what
Whisper wants. The QVAC docs cover the audio_format: "f32le"
knob and recommend 16 kHz audio. If transcription quality is
poor, resample on the client (or server) before calling
transcribe.
What you don't need to change
rag.ts— same prompt builder.- The chat endpoint — same JSON body, same response shape.
store.ts— no new fields.domain.ts— no new types beyond what the chat path already needs.
That's the payoff of routing voice through the existing chat path. Every improvement to the answer side benefits voice questions for free.