3. The QVAC gateway
Load models, ingest chunks, search, stream completions — all behind one class.
src/qvac.ts is the single place in LocalLens that talks to the
QVAC SDK. Everything else goes through QvacGateway. That's how
the rest of the app stays SDK-free.
The shape of the gateway
import {
close,
completion,
GTE_LARGE_FP16,
loadModel,
QWEN3_1_7B_INST_Q4,
QWEN3_600M_INST_Q4,
ragCloseWorkspace,
ragDeleteWorkspace,
ragIngest,
ragSearch,
} from "@qvac/sdk";
import type { ChatMessage, SearchHit, TextChunk } from "./domain.ts";
const chatModelConfig = { ctx_size: 4096, temp: 0.2, top_p: 0.9 };
export class QvacGateway {
private chatModelId: string | undefined;
private embeddingModelId: string | undefined;
private readyPromise: Promise<void> | undefined;
// … methods below
}Three private fields, three public methods, plus lifecycle helpers. That's the whole surface.
Lazy model loading
private async ensureReady(): Promise<void> {
if (this.chatModelId && this.embeddingModelId) return;
this.readyPromise ??= this.loadModels().finally(() => {
this.readyPromise = undefined;
});
await this.readyPromise;
}
private async loadModels(): Promise<void> {
this.embeddingModelId ??= await loadModel({ modelSrc: GTE_LARGE_FP16 });
if (this.chatModelId) return;
try {
this.chatModelId = await loadModel({
modelSrc: QWEN3_1_7B_INST_Q4,
modelConfig: chatModelConfig,
});
} catch {
this.chatModelId = await loadModel({
modelSrc: QWEN3_600M_INST_Q4,
modelConfig: chatModelConfig,
});
}
}Two lazy properties:
embeddingModelIdloads first and never falls back.GTE_LARGE_FP16is small enough to assume.chatModelIdtriesQWEN3_1_7B_INST_Q4first and falls back toQWEN3_600M_INST_Q4on any load failure.
readyPromise makes concurrent calls share one in-flight load.
Two requests arriving in the same tick won't both call loadModel
for the same source.
Ingesting chunks
async ingestChunks(workspace: string, chunks: TextChunk[]): Promise<void> {
await this.ensureReady();
await this.closeWorkspace(workspace, true);
if (chunks.length === 0) return;
await ragIngest({
modelId: required(this.embeddingModelId, "QVAC embedding model is not loaded."),
workspace,
documents: chunks.map((chunk) => formatChunkForRag(workspace, chunk)),
chunk: false,
});
}Two details that matter:
closeWorkspace(workspace, true)is called before ingest. That clears any prior workspace with the same name, so re-ingesting is destructive and idempotent.chunk: falsetells QVAC the documents are already chunked. The splitting happened inrag.ts. Doing it twice would be wrong.
The chunks get reformatted into a small text envelope:
source:<relativePath>
chunk:<chunkIndex>
id:<workspace>:<chunk.id>
<chunk content>…so that ragSearch returns them with the metadata still attached.
Searching
async search(workspace: string, question: string, topK = 5): Promise<SearchHit[]> {
await this.ensureReady();
const results = await ragSearch({
modelId: required(this.embeddingModelId, "QVAC embedding model is not loaded."),
query: question,
topK,
n: 3,
workspace,
});
return results.map(parseRagHit);
}parseRagHit extracts the source: and chunk: headers we wrote
during ingest and strips them from the chunk content:
function parseRagHit(hit: { id: string; content: string; score: number }): SearchHit {
return {
id: hit.id,
relativePath: /^source:(.+)$/m.exec(hit.content)?.[1] ?? "unknown",
chunkIndex: Number(/^chunk:(\d+)$/m.exec(hit.content)?.[1] ?? 0),
content: hit.content.replace(/^(?:workspace:.+\n)?source:.+\nchunk:\d+\nid:.+\n\n/m, "").trim(),
score: hit.score,
};
}That's where strings come back as structured SearchHit objects.
Streaming completion
async *answer(history: ChatMessage[]): AsyncGenerator<string> {
await this.ensureReady();
const run = completion({
modelId: required(this.chatModelId, "QVAC chat model is not loaded."),
history,
stream: true,
captureThinking: true,
kvCache: true,
});
for await (const event of run.events) if (event.type === "contentDelta") yield event.text;
await run.final;
}The gateway exposes inference as an AsyncGenerator<string>, so
callers can stream directly to a console, an HTTP response, or a
UI.
captureThinking: truekeeps any reasoning tokens out of the user-visible output stream.kvCache: truelets QVAC reuse prefix attention state across follow-up questions.
Lifecycle
async closeWorkspace(workspace: string, deleteOnClose = false): Promise<void> {
await ragCloseWorkspace({ workspace, deleteOnClose }).catch(async () => {
if (deleteOnClose) await ragDeleteWorkspace({ workspace }).catch(() => undefined);
});
}
async close(): Promise<void> {
await close();
}closeWorkspace is forgiving. If QVAC says the workspace doesn't
exist, that's fine — the caller wanted it gone anyway. close()
tears down the whole QVAC runtime and is called from
LocalLensApp.close() on shutdown.
Why a class and not a module?
The gateway holds state — model IDs, an in-flight load promise. A module of free functions would push that state onto every caller, or hide it in module-level mutables. A small class is the cheapest way to encapsulate it.
Next: the JSON store, where brains and chunks get persisted.