Request flow

The question flow is LocalLens's hot path. Every chat round-trip crosses four actors in a fixed order: the caller, the workflow class, the QVAC gateway, and the prompt builder.

The flow has two halves:

Retrieval (RAG): embed the question with GTE_LARGE_FP16, run ragSearch against the workspace, return the top-K chunks. This side never writes a single word of the answer.
Generation (LLM): hand the retrieved chunks plus the question to QWEN3_1_7B_INST_Q4 (or its QWEN3_600M_INST_Q4 fallback) via completion(). This is the chat model that actually composes the answer, streaming it token by token.

LocalLensApp.askBrain(id, question) (src/locallens.ts) calls QvacGateway.search(workspace, question, 5). The gateway embeds the question with GTE_LARGE_FP16 and runs ragSearch against the brain's QVAC workspace. The result is a list of SearchHit objects:

type SearchHit = {
  id: string;
  relativePath: string;
  chunkIndex: number;
  content: string;
  score?: number;
};

Top-K is fixed at 5. That number trades prompt size against answer quality and has worked well on small brains. If you change it, watch the chat model's context window — 4096 tokens by default.

2. Build the grounded prompt (the seam)

buildGroundedHistory(question, hits) in src/rag.ts returns a two-message ChatMessage[]. This is the bridge between the RAG side (which produced the hits) and the LLM (which is about to read them):

a system message with the rules: only use facts from the excerpts, cite them with brackets, reply in the user's language, no hidden chain-of-thought;
a user message that lays out the source excerpts in a numbered block, followed by the question.

The numbered excerpts are what the model echoes back as [1] and [2] when it cites. The LLM never sees the embeddings or the search scores — only this packaged prompt.

3. Stream the LLM completion

QvacGateway.answer(history) is where the chat LLM does its work. It calls QVAC's completion() with stream: true on QWEN3_1_7B_INST_Q4 (the default 1.7B-parameter Q4-quantized Qwen3 instruct model), and yields each contentDelta event as an AsyncGenerator<string>. askBrain accumulates those tokens into the final string.

On machines that can't load the 1.7B, the gateway transparently falls back to QWEN3_600M_INST_Q4 the first time models load (see QvacGateway.loadModels). The streaming generator is the same shape either way, so nothing upstream notices the swap.

captureThinking: true keeps any internal reasoning tokens out of the visible output, and kvCache: true lets the runtime reuse prefix attention state for follow-ups.

This is the step where the actual answer gets written. Without it you'd have ranked chunks but no prose.

4. Return citations

The hits used to build the prompt go straight back to the caller as ChatAnswer.citations:

type ChatAnswer = {
  answer: string;
  citations: { id; relativePath; chunkIndex; score? }[];
};

That's what the CLI prints and what the UI renders under the answer. The answer is the LLM's output; the citations are the RAG side's evidence list. Both pieces ship together so a reader can verify the claim against the source.

What is not in this flow

No re-ingestion. Asking a question never touches the file system.
No store mutation. The JSON store is read-only on the question path.
No second LLM call. One completion request per question. The embedding model fires once per question (for the search step), the chat LLM fires once per question (for the answer).

Keeping the read path this lean is why follow-up questions feel snappy, even on slim hardware. Two model calls, no disk writes, and the LLM never sees more than the top-K chunks plus the question.

Step by step

1. Embed and search (RAG)

2. Build the grounded prompt (the seam)

3. Stream the LLM completion

4. Return citations

What is not in this flow

On this page