Skip to content

[CSV-329] Fix byte tracking for supplementary delimiters#613

Open
OldTruckDriver wants to merge 2 commits into
apache:masterfrom
OldTruckDriver:fix/CSV-329_trackbytes_supplementary_delimiter
Open

[CSV-329] Fix byte tracking for supplementary delimiters#613
OldTruckDriver wants to merge 2 commits into
apache:masterfrom
OldTruckDriver:fix/CSV-329_trackbytes_supplementary_delimiter

Conversation

@OldTruckDriver

Copy link
Copy Markdown
Contributor

[CSV-329] Fix byte tracking for supplementary delimiters

CSVParser with trackBytes enabled could throw CharacterCodingException when a multi-character delimiter contained a supplementary Unicode character. The failure happened while delimiter lookahead read a surrogate pair through ExtendedBufferedReader.read(char[]).

This change updates ExtendedBufferedReader byte-length accounting for char-buffer reads so surrogate pairs are evaluated with the correct previous character before lastChar is updated. This lets byte tracking remain metadata-only and not change parser correctness.

Tests cover trackBytes=true with a multi-character delimiter containing an emoji, including byte-position tracking across records.

Tests run:

  • mvn -q -Dtest=org.apache.commons.csv.CSVParserTest#testGetBytePositionMultiCharacterDelimiterWithSupplementaryCharacter test
  • mvn -q -Dtest=org.apache.commons.csv.CSVParserTest,org.apache.commons.csv.ExtendedBufferedReaderTest test
  • mvn -q

OldTruckDriver and others added 2 commits June 19, 2026 02:20
ExtendedBufferedReader.read(char[], int, int) updated lastChar before computing the
encoded byte length, so a surrogate pair in the delimiter lookahead buffer was paired
against the post-update lastChar and threw CharacterCodingException. Count bytes before
updating lastChar, and pair each char against the preceding char in the buffer (seeded
from lastChar so pairs split across reads still count). Also fix the loop bound.

Reviewed-by: OpenAI Codex
Reviewed-by: Anthropic Claude Code
@garydgregory

Copy link
Copy Markdown
Member

Jira ticket is https://issues.apache.org/jira/browse/CSV-329

@garydgregory garydgregory left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OldTruckDriver
Thank you for the PR.
Please add a test to ExtendedBufferedReaderTest to help future maintenance.
TY!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants