RAG vs LLM

The two letters in RAG refer to two different jobs. Retrieval finds evidence. Generation writes the answer. In LocalLens those two jobs are split across two separate models, in separate modules, on purpose.

Side	Model	What it does
Retrieval (RAG)	`GTE_LARGE_FP16`	Embeds documents and the question. Returns the top-K closest chunks.
Generation (LLM)	`QWEN3_1_7B_INST_Q4` (fallback `QWEN3_600M_INST_Q4`)	Reads the question + retrieved chunks. Writes the answer with citations.

The RAG side is mechanical — vector similarity over the workspace, no language understanding. The LLM side is where the reasoning and the prose happen. Retrieved chunks are evidence; the answer is what the LLM does with that evidence.

Retrieval finds evidence

Retrieval is the QVAC RAG layer's job, driven by the embedding model:

ragChunk splits a document into overlapping token windows.
ragIngest writes those chunks into a workspace, embedding each one with GTE_LARGE_FP16.
ragSearch embeds a question with the same model and returns the top-K nearest chunks.

Retrieval is not an answer. It's a list of excerpts that might contain the answer, ranked by semantic similarity to the question. A search result is useful even when nothing matched — the LLM can say so plainly instead of inventing something confident-sounding.

The embedding model never reads chat history, never produces text, and never sees the final answer. Its only output is vectors.

Generation writes the answer

Generation is the chat LLM's job. In LocalLens that's QWEN3_1_7B_INST_Q4 by default, with QWEN3_600M_INST_Q4 as a fallback when the 1.7B fails to load.

completion() takes the ChatMessage[] history that the prompt builder produced and streams text back as contentDelta tokens.
The system prompt requires the model to answer from the retrieved excerpts and cite them with brackets.
The model never hits disk and never embeds anything. It only reads its context window — system rules, numbered excerpts, the question.

This is the step where the actual answer exists. Everything before it has only produced evidence and a packaged prompt. Skip it, and you have a search engine, not a chat.

The split is what keeps the prompt builder small and the QVAC gateway clean. The prompt builder knows nothing about embeddings or generation. The gateway holds both models but exposes them through two unrelated methods (search and answer).

The seam between them

buildGroundedHistory(question, hits) is the seam. It takes a question and a list of hits and produces a prompt where:

excerpts are numbered ([1], [2], …) so the LLM can cite them;
the system rules are explicit and short;
there's exactly one user turn. No conversational history is replayed, because every question is treated as fresh evidence retrieval.

If search returns nothing useful, the prompt still goes through the LLM, but with "No matching chunks were found." in place of the excerpts. The system prompt's first rule then takes over and the LLM says so plainly:

Only use facts that appear in the excerpts. If the answer is not in them, say so plainly.

That's the entire anti-hallucination strategy. Two models, one prompt rule, and it works.

Why no chat history?

LocalLens treats every question as an independent retrieval round. Retrieval is fresh, the LLM context is fresh, and follow-ups can't inherit half-true facts from earlier turns. If you want multi-turn dialogue, add it above buildGroundedHistory, not inside it.

Retrieval finds evidence

Generation writes the answer

The seam between them

On this page