LocalLens

5. File adapters

Walk a folder, read text files, normalize browser file picker input.

src/files.ts does two things and refuses to do anything else:

  • Local path: walk a folder, read text files, return LocalDocument[].
  • Browser path: take the file picker's input and produce the same LocalDocument[] shape.

No chunking. No embedding. No QVAC. This is the seam between files on disk or in a browser and typed data the rest of the app understands.

Filters

Before the code, two constants set the policy:

src/files.ts
const supportedExtensions = new Set(
  ".css .html .js .jsx .json .md .mdx .ts .tsx .txt .yaml .yml".split(" "),
);
const ignoredDirectories = new Set(
  ".git .locallens .next .turbo build coverage dist node_modules".split(" "),
);
const maxFileBytes = 2 * 1024 * 1024;

These decide what becomes a LocalDocument:

  • text-shaped extensions only;
  • never recurse into caches, lockdirs, or build output;
  • skip files larger than 2 MB.

Whatever gets through is then filtered for embedded null bytes (\u0000) and empty content. Both indicate non-text payloads that a markdown chunker shouldn't try to handle.

Walking a local folder

export async function discoverTextDocuments(rootPath: string): Promise<LocalDocument[]> {
  const root = path.resolve(rootPath);
  const rootStats = await stat(root).catch(() => null);
  if (!rootStats?.isDirectory()) throw new AppError(`Folder not found: ${root}`, 404);

  const documents: LocalDocument[] = [];

  for (const absolutePath of await walk(root)) {
    const fileStats = await stat(absolutePath);
    if (fileStats.size > maxFileBytes) continue;

    const content = await readFile(absolutePath, "utf8").catch(() => "");
    if (!content.trim() || content.includes("\u0000")) continue;

    const relativePath = path.relative(root, absolutePath);
    documents.push({
      relativePath,
      content,
      checksum: createHash("sha256").update(`${relativePath}\u0000${content}`).digest("hex"),
      bytes: fileStats.size,
    });
  }

  return documents.sort((a, b) => a.relativePath.localeCompare(b.relativePath));
}

Two design choices worth flagging:

  • The checksum is computed from ${relativePath}\u0000${content}. The null-byte separator can't appear inside either field, so a file moved within the brain produces a different checksum even if its content hasn't changed. That's correct: chunk IDs include the path, so the embeddings need to invalidate.
  • The result is sorted by relative path. Stable order means stable chunk indices, which keeps citations stable across reindexes.

Normalizing browser input

export function browserDocumentsFromInput(inputs: BrowserDocumentInput[]): LocalDocument[] {
  return inputs
    .map((input) => {
      const relativePath = normalizeBrowserRelativePath(input.relativePath);
      const content = typeof input.content === "string" ? input.content : "";
      const bytes =
        Number.isFinite(input.bytes) && input.bytes >= 0
          ? input.bytes
          : new TextEncoder().encode(content).byteLength;

      if (!relativePath || !isSupportedPath(relativePath) || !content.trim()) return undefined;
      if (content.includes("\u0000") || bytes > maxFileBytes) return undefined;

      return {
        relativePath,
        content,
        checksum: createHash("sha256").update(`${relativePath}\u0000${content}`).digest("hex"),
        bytes,
      };
    })
    .filter((document): document is LocalDocument => Boolean(document))
    .sort((a, b) => a.relativePath.localeCompare(b.relativePath));
}

Same rules as the local walk:

  • supported extensions only;
  • no null bytes;
  • no files >2 MB;
  • empty content rejected.

The difference is that the browser doesn't tell us where on disk the file came from — only what the user picked relative to the root they chose. So the function rejects .. segments and dot-prefixed paths to keep the relative-path space clean:

function normalizeBrowserRelativePath(value: string): string | undefined {
  const parts = value
    .replace(/\\/g, "/")
    .split("/")
    .map((part) => part.trim())
    .filter(Boolean);
  if (parts.length === 0 || parts.some((part) => part === "." || part === "..")) return undefined;
  return parts.join("/");
}

Sanitizing folder names

export function sanitizeFolderName(value: string): string {
  return (
    value
      .trim()
      .replace(/[\\/]+/g, "-")
      .replace(/[^a-zA-Z0-9._ -]+/g, "")
      .replace(/\s+/g, " ")
      .slice(0, 80) || "selected-folder"
  );
}

This produces the browser://my-folder virtual path stored on the brain. Deliberately strict: no slashes, no funny characters. Folder names from the picker can never look like real paths.

Why a 2 MB cap?

Most documentation files are small. The 2 MB ceiling is a sanity guard against accidentally indexing a build artefact or a large data file that slipped into the folder. If you want to index larger files, raise the limit in one place — but think about the embedding cost first.

Next: LocalLensApp, where all of this gets wired together.

On this page