5. File adapters
Walk a folder, read text files, normalize browser file picker input.
src/files.ts does two things and refuses to do anything else:
- Local path: walk a folder, read text files, return
LocalDocument[]. - Browser path: take the file picker's input and produce the
same
LocalDocument[]shape.
No chunking. No embedding. No QVAC. This is the seam between files on disk or in a browser and typed data the rest of the app understands.
Filters
Before the code, two constants set the policy:
const supportedExtensions = new Set(
".css .html .js .jsx .json .md .mdx .ts .tsx .txt .yaml .yml".split(" "),
);
const ignoredDirectories = new Set(
".git .locallens .next .turbo build coverage dist node_modules".split(" "),
);
const maxFileBytes = 2 * 1024 * 1024;These decide what becomes a LocalDocument:
- text-shaped extensions only;
- never recurse into caches, lockdirs, or build output;
- skip files larger than 2 MB.
Whatever gets through is then filtered for embedded null bytes
(\u0000) and empty content. Both indicate non-text payloads that
a markdown chunker shouldn't try to handle.
Walking a local folder
export async function discoverTextDocuments(rootPath: string): Promise<LocalDocument[]> {
const root = path.resolve(rootPath);
const rootStats = await stat(root).catch(() => null);
if (!rootStats?.isDirectory()) throw new AppError(`Folder not found: ${root}`, 404);
const documents: LocalDocument[] = [];
for (const absolutePath of await walk(root)) {
const fileStats = await stat(absolutePath);
if (fileStats.size > maxFileBytes) continue;
const content = await readFile(absolutePath, "utf8").catch(() => "");
if (!content.trim() || content.includes("\u0000")) continue;
const relativePath = path.relative(root, absolutePath);
documents.push({
relativePath,
content,
checksum: createHash("sha256").update(`${relativePath}\u0000${content}`).digest("hex"),
bytes: fileStats.size,
});
}
return documents.sort((a, b) => a.relativePath.localeCompare(b.relativePath));
}Two design choices worth flagging:
- The checksum is computed from
${relativePath}\u0000${content}. The null-byte separator can't appear inside either field, so a file moved within the brain produces a different checksum even if its content hasn't changed. That's correct: chunk IDs include the path, so the embeddings need to invalidate. - The result is sorted by relative path. Stable order means stable chunk indices, which keeps citations stable across reindexes.
Normalizing browser input
export function browserDocumentsFromInput(inputs: BrowserDocumentInput[]): LocalDocument[] {
return inputs
.map((input) => {
const relativePath = normalizeBrowserRelativePath(input.relativePath);
const content = typeof input.content === "string" ? input.content : "";
const bytes =
Number.isFinite(input.bytes) && input.bytes >= 0
? input.bytes
: new TextEncoder().encode(content).byteLength;
if (!relativePath || !isSupportedPath(relativePath) || !content.trim()) return undefined;
if (content.includes("\u0000") || bytes > maxFileBytes) return undefined;
return {
relativePath,
content,
checksum: createHash("sha256").update(`${relativePath}\u0000${content}`).digest("hex"),
bytes,
};
})
.filter((document): document is LocalDocument => Boolean(document))
.sort((a, b) => a.relativePath.localeCompare(b.relativePath));
}Same rules as the local walk:
- supported extensions only;
- no null bytes;
- no files >2 MB;
- empty content rejected.
The difference is that the browser doesn't tell us where on disk
the file came from — only what the user picked relative to the
root they chose. So the function rejects .. segments and
dot-prefixed paths to keep the relative-path space clean:
function normalizeBrowserRelativePath(value: string): string | undefined {
const parts = value
.replace(/\\/g, "/")
.split("/")
.map((part) => part.trim())
.filter(Boolean);
if (parts.length === 0 || parts.some((part) => part === "." || part === "..")) return undefined;
return parts.join("/");
}Sanitizing folder names
export function sanitizeFolderName(value: string): string {
return (
value
.trim()
.replace(/[\\/]+/g, "-")
.replace(/[^a-zA-Z0-9._ -]+/g, "")
.replace(/\s+/g, " ")
.slice(0, 80) || "selected-folder"
);
}This produces the browser://my-folder virtual path stored on the
brain. Deliberately strict: no slashes, no funny characters. Folder
names from the picker can never look like real paths.
Why a 2 MB cap?
Most documentation files are small. The 2 MB ceiling is a sanity guard against accidentally indexing a build artefact or a large data file that slipped into the folder. If you want to index larger files, raise the limit in one place — but think about the embedding cost first.
Next: LocalLensApp, where
all of this gets wired together.