Request flow
A single question, traced end to end through retrieval and the LLM.
The question flow is LocalLens's hot path. Every chat round-trip crosses four actors in a fixed order: the caller, the workflow class, the QVAC gateway, and the prompt builder.
The flow has two halves:
- Retrieval (RAG): embed the question with
GTE_LARGE_FP16, runragSearchagainst the workspace, return the top-K chunks. This side never writes a single word of the answer. - Generation (LLM): hand the retrieved chunks plus the question
to
QWEN3_1_7B_INST_Q4(or itsQWEN3_600M_INST_Q4fallback) viacompletion(). This is the chat model that actually composes the answer, streaming it token by token.
Step by step
1. Embed and search (RAG)
LocalLensApp.askBrain(id, question) (src/locallens.ts)
calls QvacGateway.search(workspace, question, 5). The gateway
embeds the question with GTE_LARGE_FP16 and runs ragSearch
against the brain's QVAC workspace. The result is a list of
SearchHit objects:
type SearchHit = {
id: string;
relativePath: string;
chunkIndex: number;
content: string;
score?: number;
};Top-K is fixed at 5. That number trades prompt size against answer quality and has worked well on small brains. If you change it, watch the chat model's context window — 4096 tokens by default.
2. Build the grounded prompt (the seam)
buildGroundedHistory(question, hits) in src/rag.ts returns a
two-message ChatMessage[]. This is the bridge between the RAG
side (which produced the hits) and the LLM (which is about to read
them):
- a system message with the rules: only use facts from the excerpts, cite them with brackets, reply in the user's language, no hidden chain-of-thought;
- a user message that lays out the source excerpts in a numbered block, followed by the question.
The numbered excerpts are what the model echoes back as [1] and
[2] when it cites. The LLM never sees the embeddings or the
search scores — only this packaged prompt.
3. Stream the LLM completion
QvacGateway.answer(history) is where the chat LLM does its
work. It calls QVAC's completion() with stream: true on
QWEN3_1_7B_INST_Q4 (the default 1.7B-parameter Q4-quantized
Qwen3 instruct model), and yields each contentDelta event as an
AsyncGenerator<string>. askBrain accumulates those tokens into
the final string.
On machines that can't load the 1.7B, the gateway transparently
falls back to QWEN3_600M_INST_Q4 the first time models load
(see QvacGateway.loadModels).
The streaming generator is the same shape either way, so nothing
upstream notices the swap.
captureThinking: true keeps any internal reasoning tokens out of
the visible output, and kvCache: true lets the runtime reuse
prefix attention state for follow-ups.
This is the step where the actual answer gets written. Without it you'd have ranked chunks but no prose.
4. Return citations
The hits used to build the prompt go straight back to the caller as
ChatAnswer.citations:
type ChatAnswer = {
answer: string;
citations: { id; relativePath; chunkIndex; score? }[];
};That's what the CLI prints and what the UI renders under the answer. The answer is the LLM's output; the citations are the RAG side's evidence list. Both pieces ship together so a reader can verify the claim against the source.
What is not in this flow
- No re-ingestion. Asking a question never touches the file system.
- No store mutation. The JSON store is read-only on the question path.
- No second LLM call. One completion request per question. The embedding model fires once per question (for the search step), the chat LLM fires once per question (for the answer).
Keeping the read path this lean is why follow-up questions feel snappy, even on slim hardware. Two model calls, no disk writes, and the LLM never sees more than the top-K chunks plus the question.