Why QVAC

QVAC is the SDK running LocalLens's local AI loop. The interesting thing about it for a project this size: one package handles five jobs that would otherwise pull in five different libraries.

The five SDK calls we use

import {
  loadModel,
  ragChunk,
  ragIngest,
  ragSearch,
  completion,
  // and the lifecycle helpers:
  ragCloseWorkspace,
  ragDeleteWorkspace,
  close,
} from "@qvac/sdk";

Call	Where it's used	What it does
`loadModel`	`src/qvac.ts → ensureReady`	Loads the embedding model (`GTE_LARGE_FP16`) and the chat model (`QWEN3_1_7B_INST_Q4` with a 600M fallback).
`ragChunk`	`src/rag.ts → chunkDocument`	Splits document text into ~220-token windows with 40-token overlap.
`ragIngest`	`src/qvac.ts → ingestChunks`	Embeds chunks and stores them in a named workspace.
`ragSearch`	`src/qvac.ts → search`	Embeds a query and returns top-K matching chunks.
`completion`	`src/qvac.ts → answer`	Streams a chat completion from the loaded model.
`ragCloseWorkspace`	`src/qvac.ts → closeWorkspace`	Closes (and optionally deletes) the workspace on disk.
`close`	`src/qvac.ts → close`	Tears down the QVAC runtime when the app exits.

That's the whole API surface LocalLens uses. No manual embedding loop, no separate vector database, no custom token splitter.

Model lifecycle

QVAC models load lazily on first use:

private async ensureReady(): Promise<void> {
  if (this.chatModelId && this.embeddingModelId) return;
  this.readyPromise ??= this.loadModels().finally(() => {
    this.readyPromise = undefined;
  });
  await this.readyPromise;
}

Two consequences worth knowing about:

The cold-start cost lands on the first question, not on boot. That keeps bun run dev snappy and gives you a clear "loading model…" moment to hang a UI hint on.
Concurrent calls share one in-flight load. Every caller awaits the same readyPromise until it resolves. Two requests arriving in the same tick won't both fire loadModel for the same source.

Why a fallback model

The default chat model is QWEN3_1_7B_INST_Q4. On older or smaller machines it can fail to load. The gateway catches that and falls back to QWEN3_600M_INST_Q4:

try {
  this.chatModelId = await loadModel({ modelSrc: QWEN3_1_7B_INST_Q4, modelConfig });
} catch {
  this.chatModelId = await loadModel({ modelSrc: QWEN3_600M_INST_Q4, modelConfig });
}

The fallback is invisible to callers. QvacGateway.answer keeps the same streaming signature either way.

Why one SDK for everything?

LocalLens could have used @xenova/transformers for embeddings, chromadb or qdrant for vectors, and llama.cpp for completion. Each is a fine pick on its own. The cost of using all three is the integration glue you'd have to write: model lifecycle, workspace lifecycle, error mapping, async iteration. QVAC ships those for you. That's most of why this app fits in eight files.

Useful upstream pages

QVAC RAG reference — ragChunk, ragIngest, ragSearch.
QVAC completion reference — streaming, KV cache, thinking capture.
QVAC model catalog — supported model IDs.

The five SDK calls we use

Model lifecycle

Why a fallback model

Useful upstream pages

On this page