RAG vs LLM
Two jobs, two models — retrieval finds the evidence, the LLM writes the answer.
The two letters in RAG refer to two different jobs. Retrieval finds evidence. Generation writes the answer. In LocalLens those two jobs are split across two separate models, in separate modules, on purpose.
| Side | Model | What it does |
|---|---|---|
| Retrieval (RAG) | GTE_LARGE_FP16 | Embeds documents and the question. Returns the top-K closest chunks. |
| Generation (LLM) | QWEN3_1_7B_INST_Q4 (fallback QWEN3_600M_INST_Q4) | Reads the question + retrieved chunks. Writes the answer with citations. |
The RAG side is mechanical — vector similarity over the workspace, no language understanding. The LLM side is where the reasoning and the prose happen. Retrieved chunks are evidence; the answer is what the LLM does with that evidence.
Retrieval finds evidence
Retrieval is the QVAC RAG layer's job, driven by the embedding model:
ragChunksplits a document into overlapping token windows.ragIngestwrites those chunks into a workspace, embedding each one withGTE_LARGE_FP16.ragSearchembeds a question with the same model and returns the top-K nearest chunks.
Retrieval is not an answer. It's a list of excerpts that might contain the answer, ranked by semantic similarity to the question. A search result is useful even when nothing matched — the LLM can say so plainly instead of inventing something confident-sounding.
The embedding model never reads chat history, never produces text, and never sees the final answer. Its only output is vectors.
Generation writes the answer
Generation is the chat LLM's job. In LocalLens that's
QWEN3_1_7B_INST_Q4 by default, with QWEN3_600M_INST_Q4 as a
fallback when the 1.7B fails to load.
completion()takes theChatMessage[]history that the prompt builder produced and streams text back ascontentDeltatokens.- The system prompt requires the model to answer from the retrieved excerpts and cite them with brackets.
- The model never hits disk and never embeds anything. It only reads its context window — system rules, numbered excerpts, the question.
This is the step where the actual answer exists. Everything before it has only produced evidence and a packaged prompt. Skip it, and you have a search engine, not a chat.
The split is what keeps the prompt builder small and the QVAC
gateway clean. The prompt builder knows nothing about embeddings or
generation. The gateway holds both models but exposes them through
two unrelated methods (search and answer).
The seam between them
buildGroundedHistory(question, hits) is the seam. It takes a
question and a list of hits and produces a prompt where:
- excerpts are numbered (
[1],[2], …) so the LLM can cite them; - the system rules are explicit and short;
- there's exactly one user turn. No conversational history is replayed, because every question is treated as fresh evidence retrieval.
If search returns nothing useful, the prompt still goes through the
LLM, but with "No matching chunks were found." in place of the
excerpts. The system prompt's first rule then takes over and the
LLM says so plainly:
Only use facts that appear in the excerpts. If the answer is not in them, say so plainly.
That's the entire anti-hallucination strategy. Two models, one prompt rule, and it works.
Why no chat history?
LocalLens treats every question as an independent retrieval round.
Retrieval is fresh, the LLM context is fresh, and follow-ups can't
inherit half-true facts from earlier turns. If you want multi-turn
dialogue, add it above buildGroundedHistory, not inside it.