[diffs] Byte arena parsing experiments#835
Conversation
|
@clemg is attempting to deploy a commit to the Pierre Computer Company Team on Vercel. A member of the Team first needs to authorize it. |
94717e0 to
da2129f
Compare
`additionLines`/`deletionLines` change from `string[]` to `DiffLines`: a plain
data object holding a file's lines as one UTF-8 byte arena plus an offset table,
decoded on demand via `lineAt` / `joinLines`. On a huge diff (linux v6..v7,
~22.8M lines across ~77k files) this avoids tens of millions of tiny `String`
objects, so the V8 heap drops ~33% on that compare and the parser is faster: it
no longer encode+decode-detaches every line, it encodes once on seal and decodes
only the visible (virtualized) lines.
It is plain data on purpose, so it survives structured clone (the highlight
worker), `structuredClone`, and IndexedDB without a revive step (no class, no
prototype to drop). `.length` stays a field, so the many `.length` consumers are
unchanged; only content reads migrate (`x[i]` -> `lineAt(x, i)`,
`x.join('')` -> `joinLines(x)`). Per-file offsets use the smallest int width that
fits the file. A file with a lone surrogate keeps exact strings as a fallback,
and merge-conflict diffs keep plain strings (no encode) so their parse stays at
parity. The parsed model is byte-identical to before (snapshot + content-hash).
The editor's realtime-update path edits addition lines in place
(DiffHunksRenderer.updateRenderCache / updateDiffHunks), so diffLines also
exposes mutableLines (decode an arena side into the plain form once, on the
first edit) and joinLineRange (read a hunk's line range as one contiguous byte
slice for the partial reparse). Whole-document changes reassign the side with
plainLines(splitFileContents(...)), keeping the editor on plain strings while
it is editing.
Adds diffLines.test.ts (arena round-trip, multibyte, emoji-keeps-arena, lone-surrogate fallback, BOM, offset-width, plainLines, joinLines, isWellFormed) and a withPlainLines snapshot converter so the existing parsed-model snapshots assert byte-identical line content. The audited tests' direct string[] reads move to the DiffLines helpers (lineAt / linesToArray / plainLines), and the updateDiffHunks fixtures edit lines through the plain form like the editor does.
processFileBytes parses a single file's diff straight from its UTF-8 bytes: only the file header and the @@ hunk header lines are decoded into JS strings, every content line's bytes are copied verbatim into the per-side arenas and decoded on demand by lineAt, so a parse allocates no per-line strings and no per-line garbage. The per-side byte arena is filled by a small SideBuilder (appendLine / sealSide) that sits in diffLines.ts next to finishLines, so every way of building a DiffLines lives in one module, while parsePatchFiles keeps only the byte scanners that walk the patch structure. It is the only hunk-content parser: processFile encodes its string once up front and hands the bytes over, and the full-file path (parseDiffFromFile, merge conflicts) rides the same loop: the patch bytes drive the hunk structure while the sides keep coming from the caller's contents strings through finishLines, like before. The previous per-line string loop and its helpers go away. Two byte-only behaviors to know about: invalid UTF-8 stays as-is in the arena and decodes to U+FFFD on read (the same text a stream decode produces), and a patch string holding a lone surrogate (which the up-front encode would corrupt) gets its exact line strings rebuilt from the original text into the plain-string form, so the surrogate-preservation behavior is unchanged. The parsed model is hash-identical to the previous parser on linux v6.0..v7.0 (76,872 files, 22.8M lines, every line compared).
streamGitPatchFiles moves into the package as a byte-level splitter: the patch stream buffers as bytes and each file is handed over as a Uint8Array slice at its diff --git boundary, ready for processFileBytes, so the bulk of the patch is never decoded into JS strings. Format-patch commit metadata splits on the From <hash> boundary like the string splitter did, a stream-leading BOM is stripped like a whole-stream TextDecoder would, and a boundary-less patch falls back to one decoded string for parsePatchFiles. The byte scanners and ASCII byte codes shared by the splitter and the file parser (matchesAscii, lineEndExclusive, a generic findNextLineStartingWith, isBlankLine, hasNonWhitespace, and the byte constants) now live in one byteScan module, defined once and kept out of the package's public exports. diffshub's viewer feeds the slices straight to processFileBytes; its own string splitter goes away, and it imports COMMIT_HASH_METADATA_PATTERN from the package instead of keeping a second copy.
Chunk-size sweeps over the file boundary splitting (including one-byte chunks), per-file model parity between the streamed bytes and the string entry, format-patch metadata attachment, stream-leading vs in-file BOM, the unified-diff fallback, and the invalid-UTF-8 read-back.
1007982 to
4c1f6db
Compare
|
rebased, set ready to review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4c1f6db212
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| * count with `deletionLines.length` (see {@link DiffLines}). | ||
| */ | ||
| deletionLines: string[]; | ||
| deletionLines: DiffLines; |
There was a problem hiding this comment.
Convert empty-document recompute to DiffLines
When a non-partial diff is edited down to an empty document, DiffHunksRenderer.applyDocumentChange calls recomputeEmptyDocumentDiff; that helper still does diff.deletionLines.join('') and returns additionLines: ['']. With this field now a DiffLines object, .join is undefined at runtime (and the return shape is not readable by lineAt), so emptying an editable diff throws instead of preserving the caret row.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
actually a good point
|
Awesome! We are right now in the process of trying to get the editor branch merged into main (targeting next week some time). So as soon as we land that, i'll look into this closer. One kinda interesting this is we might start looking at doing some byte arena stuff for our semantic highlighting APIs, so something like this might also have a natural fit in landing with that (which is still a bit further out) |
Description
This PR brings the two proposed experiments from #760 together in one cleaner change on top of
beta-1.3.The goal is to store each file's diff lines as a per-file UTF-8 byte arena (meaning that it's just contiguous bytes in a UTF-8 byte array), and parsing the patch straight into that arena as it streams, instead of building tons of JS strings per line.
The suggested approach in this PR produces byte-identical parsed output to today's string approach
So whats in there:
string[]. Each side of a file (before/after) packs all its lines into one UTF-8Uint8Array(bytes) + a smallInt32Array(offsets) of per line offsets; so a line is justdecode(bytes[offsets[i]..offsets[i+1]])string[]to a custom typeDiffLinesdata object. So reads arelineAt(lines, i)andjoinLines(lines)instead oflines[i]andlines.join('')diff --git/@@headers to build the structure. This makes parsing noticeably faster and steadier (see benchmarks below)TextDecoder). A nice side effect on huge diffs (on my hardware, Air M2): fast-scrolling to a far-away file no longer flashes blank rows for a beat before the text paints. I'm farily sure that's because the whole tab is so much lighter now, not the decode itselfbeta-1.3from yesterday, so I think I took everything you've already merged about the parsing into accountNote: I left a bunch of
// Adapted from ...comments here and there, for functions and stuff I just adapted from before. I don't think its worth keeping but that might help for a reviewI also did some opinionated stuff about how I organized my functions and chose to represent data, please let me know if things should be changed!
Results:
The after-GC renderer footprint (of after a stream is settled for long enough), in MB, mean of 3 runs, measured on my VPS:
For the linux comparison, the peak memory usage while streaming also drops from 4259 (avg) to 2162 (avg), about half less. This is also true for the parsing time (in CPU time used), dropping from 107s to 42.5s (-2.5x).
Motivation & Context
As I said in #775, I'm getting OOM errors on diffshub with linux v6...v7's compare. This fixes it for me (and should for a bunch of different hardware), and also makes the parsing/memory usage more stable
The results of my testing are clear: for instance on linux's compare, its saving about 50% of memory usage, drastically reduces the peak memory usage, and makes the parsing ~2.5x faster. I (my clanked) made this page diffshub-bench.clemg.fr that allows you to start benchmarks on any PR of compare on 3 versions of diffshub: diffshub's main, beta branch and this proposed branch. It runs on my slow as fuck VPS but allows to compare remotely a lot of tests on the same hardware, without setup. It also has some already ran tests for you to check if you want
From my testings, I haven't found any issue on rendering the CodeView component, the docs, the editor or anything already existing. You should obviously try it for yourself because you know this much better than I do
Since you (Amadeus) mentioned that it was easier for you to test everything in one go, this PR shadows the other other one (that I should close?).
Anyway, happy to discuss the whys and the hows of this, and suggestions if you have any
Type of changes
first discussed with the dev team and they should be aware that this PR is
being opened
You must have first discussed with the dev team and they should be aware
that this PR is being opened
Checklist
contributing guidelines
bun run lint)bun run format)bun run diffs:test)How was AI used in generating this PR
I used AI to write all the tests (and reviewed them), parts of the commit messages and to rebase on top of beta-1.3 from my old experiment branch to this new one (a lot changed since)
Related issues
See: #760