LocalLens

Add voice questions with QVAC transcription

Record audio in the browser, transcribe locally through @qvac/sdk, hand the text to the existing chat endpoint.

Where it belongs: the UI for recording, the QVAC gateway for the transcription call, and a thin server route that connects them.

A voice question is just a text question with one extra step at the front. QVAC ships Whisper out of the box, so there's no third-party model to wire up. The same SDK that powers chat and embeddings transcribes an audio chunk into the question string the chat endpoint expects.

The official QVAC transcription reference lives at docs.qvac.tether.io/sdk/examples/ai-tasks/transcription.

The QVAC transcription surface

import {
  loadModel,
  transcribe,
  unloadModel,
  WHISPER_TINY,
} from "@qvac/sdk";

transcribe({ modelId, audioChunk }) returns the full transcription as a single string. audioChunk accepts either a file path or an in-memory buffer, so the gateway doesn't need to write the recorded blob to disk before transcribing.

WHISPER_TINY is the smallest model and the right default for voice questions. Accuracy is plenty for short prompts and load time stays fast. For multilingual capture, swap in one of the Parakeet TDT constants from the QVAC reference.

The four-step recipe

1. Add a transcribe method to the gateway

src/qvac.ts
import {
  loadModel,
  transcribe,
  unloadModel,
  WHISPER_TINY,
} from "@qvac/sdk";

export class QvacGateway {
  // existing fields…
  private sttModelId: string | undefined;

  private async ensureSttReady(): Promise<void> {
    if (this.sttModelId) return;
    this.sttModelId = await loadModel({
      modelSrc: WHISPER_TINY,
      modelType: "whisper",
      modelConfig: { language: "en" },
    });
  }

  async transcribe(audio: Buffer | string): Promise<string> {
    await this.ensureSttReady();
    return transcribe({
      modelId: required(this.sttModelId, "QVAC transcription model is not loaded."),
      audioChunk: audio,
    });
  }
}

Same shape as the existing chat and embedding loaders: lazy load on first use, single in-flight promise through required, one method per task. The transcription model is independent of the chat model, so loading it doesn't delay the next chat round-trip.

2. Expose it through LocalLensApp

A small forwarder keeps the gateway encapsulated:

src/locallens.ts
async transcribe(audio: Buffer): Promise<string> {
  return this.qvac.transcribe(audio);
}

That's the whole workflow change. Voice questions reuse the existing askBrain pipeline, so multi-step state transitions stay in one place.

3. Add a /api/transcribe route

src/server.ts
if (url.pathname === "/api/transcribe" && request.method === "POST") {
  const arrayBuffer = await request.arrayBuffer();
  const audio = Buffer.from(arrayBuffer);
  const text = await app.transcribe(audio);
  return json({ text });
}

A thin pass-through, exactly like the existing chat and brain endpoints. Errors thrown by LocalLensApp.transcribe flow through the shared errorResponse helper unchanged. AppErrors keep their status codes; anything else surfaces as a 500.

4. Record and submit from the UI

Use the standard MediaRecorder API for capture:

src/ui/app.js
async function startRecording() {
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  const recorder = new MediaRecorder(stream);
  const chunks = [];

  recorder.ondataavailable = (e) => chunks.push(e.data);
  recorder.start();

  return {
    stop: () =>
      new Promise((resolve) => {
        recorder.onstop = () => {
          stream.getTracks().forEach((t) => t.stop());
          resolve(new Blob(chunks, { type: "audio/wav" }));
        };
        recorder.stop();
      }),
  };
}

Add a microphone button next to the chat input. When the recording finishes, POST the blob to /api/transcribe, drop the returned text into the chat input, and submit the existing chat form:

const recorder = await startRecording();
// … wait for the user to release the button …
const audio = await recorder.stop();

const { text } = await fetch("/api/transcribe", {
  method: "POST",
  body: audio,
}).then((r) => r.json());

elements.chatInput.value = text;
elements.chatForm.requestSubmit();

The chat form already knows how to call /api/brains/:id/chat, so the rest of the round-trip is unchanged. The user sees their words appear as a question, and the answer streams back the same way.

Whisper expects 16 kHz WAV

MediaRecorder defaults to a container that may not match what Whisper wants. The QVAC docs cover the audio_format: "f32le" knob and recommend 16 kHz audio. If transcription quality is poor, resample on the client (or server) before calling transcribe.

What you don't need to change

  • rag.ts — same prompt builder.
  • The chat endpoint — same JSON body, same response shape.
  • store.ts — no new fields.
  • domain.ts — no new types beyond what the chat path already needs.

That's the payoff of routing voice through the existing chat path. Every improvement to the answer side benefits voice questions for free.

External references

On this page