3. The QVAC gateway

src/qvac.ts is the single place in LocalLens that talks to the QVAC SDK. Everything else goes through QvacGateway. That's how the rest of the app stays SDK-free.

The shape of the gateway

src/qvac.ts

import {
  close,
  completion,
  GTE_LARGE_FP16,
  loadModel,
  QWEN3_1_7B_INST_Q4,
  QWEN3_600M_INST_Q4,
  ragCloseWorkspace,
  ragDeleteWorkspace,
  ragIngest,
  ragSearch,
} from "@qvac/sdk";
import type { ChatMessage, SearchHit, TextChunk } from "./domain.ts";

const chatModelConfig = { ctx_size: 4096, temp: 0.2, top_p: 0.9 };

export class QvacGateway {
  private chatModelId: string | undefined;
  private embeddingModelId: string | undefined;
  private readyPromise: Promise<void> | undefined;
  // … methods below
}

Three private fields, three public methods, plus lifecycle helpers. That's the whole surface.

Lazy model loading

private async ensureReady(): Promise<void> {
  if (this.chatModelId && this.embeddingModelId) return;
  this.readyPromise ??= this.loadModels().finally(() => {
    this.readyPromise = undefined;
  });
  await this.readyPromise;
}

private async loadModels(): Promise<void> {
  this.embeddingModelId ??= await loadModel({ modelSrc: GTE_LARGE_FP16 });

  if (this.chatModelId) return;

  try {
    this.chatModelId = await loadModel({
      modelSrc: QWEN3_1_7B_INST_Q4,
      modelConfig: chatModelConfig,
    });
  } catch {
    this.chatModelId = await loadModel({
      modelSrc: QWEN3_600M_INST_Q4,
      modelConfig: chatModelConfig,
    });
  }
}

Two lazy properties:

embeddingModelId loads first and never falls back. GTE_LARGE_FP16 is small enough to assume.
chatModelId tries QWEN3_1_7B_INST_Q4 first and falls back to QWEN3_600M_INST_Q4 on any load failure.

readyPromise makes concurrent calls share one in-flight load. Two requests arriving in the same tick won't both call loadModel for the same source.

Ingesting chunks

async ingestChunks(workspace: string, chunks: TextChunk[]): Promise<void> {
  await this.ensureReady();
  await this.closeWorkspace(workspace, true);

  if (chunks.length === 0) return;

  await ragIngest({
    modelId: required(this.embeddingModelId, "QVAC embedding model is not loaded."),
    workspace,
    documents: chunks.map((chunk) => formatChunkForRag(workspace, chunk)),
    chunk: false,
  });
}

Two details that matter:

closeWorkspace(workspace, true) is called before ingest. That clears any prior workspace with the same name, so re-ingesting is destructive and idempotent.
chunk: false tells QVAC the documents are already chunked. The splitting happened in rag.ts. Doing it twice would be wrong.

The chunks get reformatted into a small text envelope:

source:<relativePath>
chunk:<chunkIndex>
id:<workspace>:<chunk.id>

<chunk content>

…so that ragSearch returns them with the metadata still attached.

Searching

async search(workspace: string, question: string, topK = 5): Promise<SearchHit[]> {
  await this.ensureReady();
  const results = await ragSearch({
    modelId: required(this.embeddingModelId, "QVAC embedding model is not loaded."),
    query: question,
    topK,
    n: 3,
    workspace,
  });
  return results.map(parseRagHit);
}

parseRagHit extracts the source: and chunk: headers we wrote during ingest and strips them from the chunk content:

function parseRagHit(hit: { id: string; content: string; score: number }): SearchHit {
  return {
    id: hit.id,
    relativePath: /^source:(.+)$/m.exec(hit.content)?.[1] ?? "unknown",
    chunkIndex: Number(/^chunk:(\d+)$/m.exec(hit.content)?.[1] ?? 0),
    content: hit.content.replace(/^(?:workspace:.+\n)?source:.+\nchunk:\d+\nid:.+\n\n/m, "").trim(),
    score: hit.score,
  };
}

That's where strings come back as structured SearchHit objects.

Streaming completion

async *answer(history: ChatMessage[]): AsyncGenerator<string> {
  await this.ensureReady();
  const run = completion({
    modelId: required(this.chatModelId, "QVAC chat model is not loaded."),
    history,
    stream: true,
    captureThinking: true,
    kvCache: true,
  });

  for await (const event of run.events) if (event.type === "contentDelta") yield event.text;
  await run.final;
}

The gateway exposes inference as an AsyncGenerator<string>, so callers can stream directly to a console, an HTTP response, or a UI.

captureThinking: true keeps any reasoning tokens out of the user-visible output stream.
kvCache: true lets QVAC reuse prefix attention state across follow-up questions.

Lifecycle

async closeWorkspace(workspace: string, deleteOnClose = false): Promise<void> {
  await ragCloseWorkspace({ workspace, deleteOnClose }).catch(async () => {
    if (deleteOnClose) await ragDeleteWorkspace({ workspace }).catch(() => undefined);
  });
}

async close(): Promise<void> {
  await close();
}

closeWorkspace is forgiving. If QVAC says the workspace doesn't exist, that's fine — the caller wanted it gone anyway. close() tears down the whole QVAC runtime and is called from LocalLensApp.close() on shutdown.

Why a class and not a module?

The gateway holds state — model IDs, an in-flight load promise. A module of free functions would push that state onto every caller, or hide it in module-level mutables. A small class is the cheapest way to encapsulate it.

Next: the JSON store, where brains and chunks get persisted.