Skip to content

[diffs] Byte arena parsing experiments#835

Open
clemg wants to merge 6 commits into
pierrecomputer:beta-1.3from
clemg:clemg/byte-arena-parsing
Open

[diffs] Byte arena parsing experiments#835
clemg wants to merge 6 commits into
pierrecomputer:beta-1.3from
clemg:clemg/byte-arena-parsing

Conversation

@clemg

@clemg clemg commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Description

This PR brings the two proposed experiments from #760 together in one cleaner change on top of beta-1.3.

The goal is to store each file's diff lines as a per-file UTF-8 byte arena (meaning that it's just contiguous bytes in a UTF-8 byte array), and parsing the patch straight into that arena as it streams, instead of building tons of JS strings per line.

The suggested approach in this PR produces byte-identical parsed output to today's string approach

So whats in there:

  • lines are stored as bytes, not as string[]. Each side of a file (before/after) packs all its lines into one UTF-8 Uint8Array (bytes) + a small Int32Array (offsets) of per line offsets; so a line is just decode(bytes[offsets[i]..offsets[i+1]])
  • The API is a bit changed, string we go from having a string[] to a custom type DiffLines data object. So reads are lineAt(lines, i) and joinLines(lines) instead of lines[i] and lines.join('')
  • Its stored as plain data, with helpers I said above. We could use a class to mimic the current behavior, but classes can't be thrown or read as is from/to workers, and that's annoying
  • The patch is parsed directly into the arena. As bytes stream in the parser only peeks at the tiny diff --git / @@ headers to build the structure. This makes parsing noticeably faster and steadier (see benchmarks below)
  • A line becomes a real string only when it scrolls into view (the viewer already virtualizes today, decoding on-screen lines through one shared TextDecoder). A nice side effect on huge diffs (on my hardware, Air M2): fast-scrolling to a far-away file no longer flashes blank rows for a beat before the text paints. I'm farily sure that's because the whole tab is so much lighter now, not the decode itself
  • Detaching per line is gone, because the arena already copies each line into its own buffer, so its detached from the source patch for free
  • Its rebased on top of beta-1.3 from yesterday, so I think I took everything you've already merged about the parsing into account

Note: I left a bunch of // Adapted from ... comments here and there, for functions and stuff I just adapted from before. I don't think its worth keeping but that might help for a review

I also did some opinionated stuff about how I organized my functions and chose to represent data, please let me know if things should be changed!

Results:

The after-GC renderer footprint (of after a stream is settled for long enough), in MB, mean of 3 runs, measured on my VPS:

diff main beta-1.3 this PR Δ vs main
node PR (11 MB) 324 324 293 −9%
ghostty PR (3.4 MB) 248 248 244 −1%
bun PR (41 MB) 441 445 332 −25%
linux v6.0…v7.0 (678 MB) 3179 3185 1711 −46%

For the linux comparison, the peak memory usage while streaming also drops from 4259 (avg) to 2162 (avg), about half less. This is also true for the parsing time (in CPU time used), dropping from 107s to 42.5s (-2.5x).

Motivation & Context

As I said in #775, I'm getting OOM errors on diffshub with linux v6...v7's compare. This fixes it for me (and should for a bunch of different hardware), and also makes the parsing/memory usage more stable

The results of my testing are clear: for instance on linux's compare, its saving about 50% of memory usage, drastically reduces the peak memory usage, and makes the parsing ~2.5x faster. I (my clanked) made this page diffshub-bench.clemg.fr that allows you to start benchmarks on any PR of compare on 3 versions of diffshub: diffshub's main, beta branch and this proposed branch. It runs on my slow as fuck VPS but allows to compare remotely a lot of tests on the same hardware, without setup. It also has some already ran tests for you to check if you want

From my testings, I haven't found any issue on rendering the CodeView component, the docs, the editor or anything already existing. You should obviously try it for yourself because you know this much better than I do

Since you (Amadeus) mentioned that it was easier for you to test everything in one go, this PR shadows the other other one (that I should close?).

Anyway, happy to discuss the whys and the hows of this, and suggestions if you have any

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Refactoring (non-breaking change)
  • New feature (non-breaking change which adds functionality). You must have
    first discussed with the dev team and they should be aware that this PR is
    being opened
  • Breaking change (fix or feature that would change existing functionality).
    You must have first discussed with the dev team and they should be aware
    that this PR is being opened
  • Documentation update

Checklist

  • I have read the
    contributing guidelines
  • My code follows the code style of the project (bun run lint)
  • My code is formatted properly (bun run format)
  • I have updated the documentation accordingly (if applicable)
  • I have added tests to cover my changes (if applicable)
  • All new and existing tests pass (bun run diffs:test)

How was AI used in generating this PR

I used AI to write all the tests (and reviewed them), parts of the commit messages and to rebase on top of beta-1.3 from my old experiment branch to this new one (a lot changed since)

Related issues

See: #760

@vercel

vercel Bot commented Jun 18, 2026

Copy link
Copy Markdown

@clemg is attempting to deploy a commit to the Pierre Computer Company Team on Vercel.

A member of the Team first needs to authorize it.

@clemg clemg changed the title Clemg/byte arena parsing [diffs] Byte arena parsing experiments Jun 18, 2026
@amadeus amadeus force-pushed the beta-1.3 branch 4 times, most recently from 94717e0 to da2129f Compare June 25, 2026 00:39
clemg added 5 commits June 25, 2026 21:06
`additionLines`/`deletionLines` change from `string[]` to `DiffLines`: a plain
data object holding a file's lines as one UTF-8 byte arena plus an offset table,
decoded on demand via `lineAt` / `joinLines`. On a huge diff (linux v6..v7,
~22.8M lines across ~77k files) this avoids tens of millions of tiny `String`
objects, so the V8 heap drops ~33% on that compare and the parser is faster: it
no longer encode+decode-detaches every line, it encodes once on seal and decodes
only the visible (virtualized) lines.

It is plain data on purpose, so it survives structured clone (the highlight
worker), `structuredClone`, and IndexedDB without a revive step (no class, no
prototype to drop). `.length` stays a field, so the many `.length` consumers are
unchanged; only content reads migrate (`x[i]` -> `lineAt(x, i)`,
`x.join('')` -> `joinLines(x)`). Per-file offsets use the smallest int width that
fits the file. A file with a lone surrogate keeps exact strings as a fallback,
and merge-conflict diffs keep plain strings (no encode) so their parse stays at
parity. The parsed model is byte-identical to before (snapshot + content-hash).

The editor's realtime-update path edits addition lines in place
(DiffHunksRenderer.updateRenderCache / updateDiffHunks), so diffLines also
exposes mutableLines (decode an arena side into the plain form once, on the
first edit) and joinLineRange (read a hunk's line range as one contiguous byte
slice for the partial reparse). Whole-document changes reassign the side with
plainLines(splitFileContents(...)), keeping the editor on plain strings while
it is editing.
Adds diffLines.test.ts (arena round-trip, multibyte, emoji-keeps-arena,
lone-surrogate fallback, BOM, offset-width, plainLines, joinLines, isWellFormed)
and a withPlainLines snapshot converter so the existing parsed-model snapshots
assert byte-identical line content. The audited tests' direct string[] reads
move to the DiffLines helpers (lineAt / linesToArray / plainLines), and the
updateDiffHunks fixtures edit lines through the plain form like the editor
does.
processFileBytes parses a single file's diff straight from its UTF-8 bytes:
only the file header and the @@ hunk header lines are decoded into JS
strings, every content line's bytes are copied verbatim into the per-side
arenas and decoded on demand by lineAt, so a parse allocates no per-line
strings and no per-line garbage.

The per-side byte arena is filled by a small SideBuilder (appendLine /
sealSide) that sits in diffLines.ts next to finishLines, so every way of
building a DiffLines lives in one module, while parsePatchFiles keeps only
the byte scanners that walk the patch structure.

It is the only hunk-content parser: processFile encodes its string once up
front and hands the bytes over, and the full-file path (parseDiffFromFile,
merge conflicts) rides the same loop: the patch bytes drive the hunk
structure while the sides keep coming from the caller's contents strings
through finishLines, like before. The previous per-line string loop and its
helpers go away.

Two byte-only behaviors to know about: invalid UTF-8 stays as-is in the
arena and decodes to U+FFFD on read (the same text a stream decode
produces), and a patch string holding a lone surrogate (which the up-front
encode would corrupt) gets its exact line strings rebuilt from the original
text into the plain-string form, so the surrogate-preservation behavior is
unchanged.

The parsed model is hash-identical to the previous parser on linux
v6.0..v7.0 (76,872 files, 22.8M lines, every line compared).
streamGitPatchFiles moves into the package as a byte-level splitter: the
patch stream buffers as bytes and each file is handed over as a Uint8Array
slice at its diff --git boundary, ready for processFileBytes, so the bulk of
the patch is never decoded into JS strings. Format-patch commit metadata
splits on the From <hash> boundary like the string splitter did, a
stream-leading BOM is stripped like a whole-stream TextDecoder would, and a
boundary-less patch falls back to one decoded string for parsePatchFiles.

The byte scanners and ASCII byte codes shared by the splitter and the file
parser (matchesAscii, lineEndExclusive, a generic findNextLineStartingWith,
isBlankLine, hasNonWhitespace, and the byte constants) now live in one
byteScan module, defined once and kept out of the package's public exports.

diffshub's viewer feeds the slices straight to processFileBytes; its own
string splitter goes away, and it imports COMMIT_HASH_METADATA_PATTERN from
the package instead of keeping a second copy.
Chunk-size sweeps over the file boundary splitting (including one-byte
chunks), per-file model parity between the streamed bytes and the string
entry, format-patch metadata attachment, stream-leading vs in-file BOM,
the unified-diff fallback, and the invalid-UTF-8 read-back.
@clemg clemg force-pushed the clemg/byte-arena-parsing branch from 1007982 to 4c1f6db Compare June 25, 2026 19:07
@clemg clemg marked this pull request as ready for review June 25, 2026 19:09
@clemg

clemg commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

rebased, set ready to review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4c1f6db212

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

* count with `deletionLines.length` (see {@link DiffLines}).
*/
deletionLines: string[];
deletionLines: DiffLines;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Convert empty-document recompute to DiffLines

When a non-partial diff is edited down to an empty document, DiffHunksRenderer.applyDocumentChange calls recomputeEmptyDocumentDiff; that helper still does diff.deletionLines.join('') and returns additionLines: ['']. With this field now a DiffLines object, .join is undefined at runtime (and the return shape is not readable by lineAt), so emptying an editable diff throws instead of preserving the caret row.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually a good point

@amadeus

amadeus commented Jun 26, 2026

Copy link
Copy Markdown
Member

Awesome! We are right now in the process of trying to get the editor branch merged into main (targeting next week some time). So as soon as we land that, i'll look into this closer. One kinda interesting this is we might start looking at doing some byte arena stuff for our semantic highlighting APIs, so something like this might also have a natural fit in landing with that (which is still a bit further out)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants