[diffs] Byte arena parsing experiments by clemg · Pull Request #835 · pierrecomputer/pierre

clemg · 2026-06-18T13:52:20Z

Description

This PR brings the two proposed experiments from #760 together in one cleaner change on top of beta-1.3.

The goal is to store each file's diff lines as a per-file UTF-8 byte arena (meaning that it's just contiguous bytes in a UTF-8 byte array), and parsing the patch straight into that arena as it streams, instead of building tons of JS strings per line.

The suggested approach in this PR produces byte-identical parsed output to today's string approach

So whats in there:

lines are stored as bytes, not as string[]. Each side of a file (before/after) packs all its lines into one UTF-8 Uint8Array (bytes) + a small Int32Array (offsets) of per line offsets; so a line is just decode(bytes[offsets[i]..offsets[i+1]])
The API is a bit changed, string we go from having a string[] to a custom type DiffLines data object. So reads are lineAt(lines, i) and joinLines(lines) instead of lines[i] and lines.join('')
Its stored as plain data, with helpers I said above. We could use a class to mimic the current behavior, but classes can't be thrown or read as is from/to workers, and that's annoying
The patch is parsed directly into the arena. As bytes stream in the parser only peeks at the tiny diff --git / @@ headers to build the structure. This makes parsing noticeably faster and steadier (see benchmarks below)
A line becomes a real string only when it scrolls into view (the viewer already virtualizes today, decoding on-screen lines through one shared TextDecoder). A nice side effect on huge diffs (on my hardware, Air M2): fast-scrolling to a far-away file no longer flashes blank rows for a beat before the text paints. I'm farily sure that's because the whole tab is so much lighter now, not the decode itself
Detaching per line is gone, because the arena already copies each line into its own buffer, so its detached from the source patch for free
Its rebased on top of beta-1.3 from yesterday, so I think I took everything you've already merged about the parsing into account

Note: I left a bunch of // Adapted from ... comments here and there, for functions and stuff I just adapted from before. I don't think its worth keeping but that might help for a review

I also did some opinionated stuff about how I organized my functions and chose to represent data, please let me know if things should be changed!

Results:

The after-GC renderer footprint (of after a stream is settled for long enough), in MB, mean of 3 runs, measured on my VPS:

diff	main	beta-1.3	this PR	Δ vs main
node PR (11 MB)	324	324	293	−9%
ghostty PR (3.4 MB)	248	248	244	−1%
bun PR (41 MB)	441	445	332	−25%
linux v6.0…v7.0 (678 MB)	3179	3185	1711	−46%

For the linux comparison, the peak memory usage while streaming also drops from 4259 (avg) to 2162 (avg), about half less. This is also true for the parsing time (in CPU time used), dropping from 107s to 42.5s (-2.5x).

Motivation & Context

As I said in #775, I'm getting OOM errors on diffshub with linux v6...v7's compare. This fixes it for me (and should for a bunch of different hardware), and also makes the parsing/memory usage more stable

The results of my testing are clear: for instance on linux's compare, its saving about 50% of memory usage, drastically reduces the peak memory usage, and makes the parsing ~2.5x faster. I (my clanked) made this page diffshub-bench.clemg.fr that allows you to start benchmarks on any PR of compare on 3 versions of diffshub: diffshub's main, beta branch and this proposed branch. It runs on my slow as fuck VPS but allows to compare remotely a lot of tests on the same hardware, without setup. It also has some already ran tests for you to check if you want

From my testings, I haven't found any issue on rendering the CodeView component, the docs, the editor or anything already existing. You should obviously try it for yourself because you know this much better than I do

Since you (Amadeus) mentioned that it was easier for you to test everything in one go, this PR shadows the other other one (that I should close?).

Anyway, happy to discuss the whys and the hows of this, and suggestions if you have any

Type of changes

Bug fix (non-breaking change which fixes an issue)
Refactoring (non-breaking change)
New feature (non-breaking change which adds functionality). You must have
first discussed with the dev team and they should be aware that this PR is
being opened
Breaking change (fix or feature that would change existing functionality).
You must have first discussed with the dev team and they should be aware
that this PR is being opened
Documentation update

Checklist

I have read the
contributing guidelines
My code follows the code style of the project (bun run lint)
My code is formatted properly (bun run format)
I have updated the documentation accordingly (if applicable)
I have added tests to cover my changes (if applicable)
All new and existing tests pass (bun run diffs:test)

How was AI used in generating this PR

I used AI to write all the tests (and reviewed them), parts of the commit messages and to rebase on top of beta-1.3 from my old experiment branch to this new one (a lot changed since)

Related issues

See: #760

vercel · 2026-06-18T13:52:29Z

@clemg is attempting to deploy a commit to the Pierre Computer Company Team on Vercel.

A member of the Team first needs to authorize it.

`additionLines`/`deletionLines` change from `string[]` to `DiffLines`: a plain data object holding a file's lines as one UTF-8 byte arena plus an offset table, decoded on demand via `lineAt` / `joinLines`. On a huge diff (linux v6..v7, ~22.8M lines across ~77k files) this avoids tens of millions of tiny `String` objects, so the V8 heap drops ~33% on that compare and the parser is faster: it no longer encode+decode-detaches every line, it encodes once on seal and decodes only the visible (virtualized) lines. It is plain data on purpose, so it survives structured clone (the highlight worker), `structuredClone`, and IndexedDB without a revive step (no class, no prototype to drop). `.length` stays a field, so the many `.length` consumers are unchanged; only content reads migrate (`x[i]` -> `lineAt(x, i)`, `x.join('')` -> `joinLines(x)`). Per-file offsets use the smallest int width that fits the file. A file with a lone surrogate keeps exact strings as a fallback, and merge-conflict diffs keep plain strings (no encode) so their parse stays at parity. The parsed model is byte-identical to before (snapshot + content-hash). The editor's realtime-update path edits addition lines in place (DiffHunksRenderer.updateRenderCache / updateDiffHunks), so diffLines also exposes mutableLines (decode an arena side into the plain form once, on the first edit) and joinLineRange (read a hunk's line range as one contiguous byte slice for the partial reparse). Whole-document changes reassign the side with plainLines(splitFileContents(...)), keeping the editor on plain strings while it is editing.

Adds diffLines.test.ts (arena round-trip, multibyte, emoji-keeps-arena, lone-surrogate fallback, BOM, offset-width, plainLines, joinLines, isWellFormed) and a withPlainLines snapshot converter so the existing parsed-model snapshots assert byte-identical line content. The audited tests' direct string[] reads move to the DiffLines helpers (lineAt / linesToArray / plainLines), and the updateDiffHunks fixtures edit lines through the plain form like the editor does.

processFileBytes parses a single file's diff straight from its UTF-8 bytes: only the file header and the @@ hunk header lines are decoded into JS strings, every content line's bytes are copied verbatim into the per-side arenas and decoded on demand by lineAt, so a parse allocates no per-line strings and no per-line garbage. The per-side byte arena is filled by a small SideBuilder (appendLine / sealSide) that sits in diffLines.ts next to finishLines, so every way of building a DiffLines lives in one module, while parsePatchFiles keeps only the byte scanners that walk the patch structure. It is the only hunk-content parser: processFile encodes its string once up front and hands the bytes over, and the full-file path (parseDiffFromFile, merge conflicts) rides the same loop: the patch bytes drive the hunk structure while the sides keep coming from the caller's contents strings through finishLines, like before. The previous per-line string loop and its helpers go away. Two byte-only behaviors to know about: invalid UTF-8 stays as-is in the arena and decodes to U+FFFD on read (the same text a stream decode produces), and a patch string holding a lone surrogate (which the up-front encode would corrupt) gets its exact line strings rebuilt from the original text into the plain-string form, so the surrogate-preservation behavior is unchanged. The parsed model is hash-identical to the previous parser on linux v6.0..v7.0 (76,872 files, 22.8M lines, every line compared).

streamGitPatchFiles moves into the package as a byte-level splitter: the patch stream buffers as bytes and each file is handed over as a Uint8Array slice at its diff --git boundary, ready for processFileBytes, so the bulk of the patch is never decoded into JS strings. Format-patch commit metadata splits on the From <hash> boundary like the string splitter did, a stream-leading BOM is stripped like a whole-stream TextDecoder would, and a boundary-less patch falls back to one decoded string for parsePatchFiles. The byte scanners and ASCII byte codes shared by the splitter and the file parser (matchesAscii, lineEndExclusive, a generic findNextLineStartingWith, isBlankLine, hasNonWhitespace, and the byte constants) now live in one byteScan module, defined once and kept out of the package's public exports. diffshub's viewer feeds the slices straight to processFileBytes; its own string splitter goes away, and it imports COMMIT_HASH_METADATA_PATTERN from the package instead of keeping a second copy.

Chunk-size sweeps over the file boundary splitting (including one-byte chunks), per-file model parity between the streamed bytes and the string entry, format-patch metadata attachment, stream-leading vs in-file BOM, the unified-diff fallback, and the invalid-UTF-8 read-back.

clemg · 2026-06-25T19:09:40Z

rebased, set ready to review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4c1f6db212

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-25T19:17:21Z

+   * count with `deletionLines.length` (see {@link DiffLines}).
   */
-  deletionLines: string[];
+  deletionLines: DiffLines;


Convert empty-document recompute to DiffLines

When a non-partial diff is edited down to an empty document, DiffHunksRenderer.applyDocumentChange calls recomputeEmptyDocumentDiff; that helper still does diff.deletionLines.join('') and returns additionLines: ['']. With this field now a DiffLines object, .join is undefined at runtime (and the return shape is not readable by lineAt), so emptying an editable diff throws instead of preserving the caret row.

Useful? React with 👍 / 👎.

actually a good point

amadeus · 2026-06-26T01:02:41Z

Awesome! We are right now in the process of trying to get the editor branch merged into main (targeting next week some time). So as soon as we land that, i'll look into this closer. One kinda interesting this is we might start looking at doing some byte arena stuff for our semantic highlighting APIs, so something like this might also have a natural fit in landing with that (which is still a bit further out)

clemg changed the title ~~Clemg/byte arena parsing~~ [diffs] Byte arena parsing experiments Jun 18, 2026

amadeus force-pushed the beta-1.3 branch 4 times, most recently from 94717e0 to da2129f Compare June 25, 2026 00:39

clemg added 5 commits June 25, 2026 21:06

clemg force-pushed the clemg/byte-arena-parsing branch from 1007982 to 4c1f6db Compare June 25, 2026 19:07

clemg marked this pull request as ready for review June 25, 2026 19:09

chatgpt-codex-connector Bot reviewed Jun 25, 2026

View reviewed changes

fix(diffs): convert empty doc recompute to DiffLines

f11a6d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[diffs] Byte arena parsing experiments#835

[diffs] Byte arena parsing experiments#835
clemg wants to merge 6 commits into
pierrecomputer:beta-1.3from
clemg:clemg/byte-arena-parsing

clemg commented Jun 18, 2026

Uh oh!

vercel Bot commented Jun 18, 2026

Uh oh!

clemg commented Jun 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Uh oh!

clemg Jun 29, 2026

Uh oh!

amadeus commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

clemg commented Jun 18, 2026

Description

Results:

Motivation & Context

Type of changes

Checklist

How was AI used in generating this PR

Related issues

Uh oh!

vercel Bot commented Jun 18, 2026

Uh oh!

clemg commented Jun 25, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

clemg Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

amadeus commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants