Skip to content

Keep abbreviate and truncate off surrogate pair boundaries#1719

Merged
garydgregory merged 2 commits into
apache:masterfrom
alhudz:abbreviate-truncate-surrogate-split
Jun 20, 2026
Merged

Keep abbreviate and truncate off surrogate pair boundaries#1719
garydgregory merged 2 commits into
apache:masterfrom
alhudz:abbreviate-truncate-surrogate-split

Conversation

@alhudz

@alhudz alhudz commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Repro: StringUtils.abbreviate("😀abcdef", 4) returns a lone high surrogate followed by ..., and StringUtils.truncate("a😀b", 0, 2) returns a plus a lone high surrogate.
Cause: both choose the cut position in char units and slice with String.substring, so a boundary that lands between the two halves of a surrogate pair leaves an unpaired surrogate behind.
Fix: nudge the boundary to the nearest code-point edge before slicing, so a pair is never split. The result still never exceeds maxWidth.

@garydgregory garydgregory changed the title keep abbreviate and truncate off surrogate pair boundaries Keep abbreviate and truncate off surrogate pair boundaries Jun 19, 2026
@garydgregory

Copy link
Copy Markdown
Member

@alhudz Please address #1718 (review)

@garydgregory garydgregory left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alhudz
Thank you for your PR.
Please see my 2 comments.

assertEquals(grin, StringUtils.truncate("a" + grin + "b", 1, 2));
assertEquals("ab", StringUtils.truncate("ab" + grin, 0, 3));
// an offset that lands inside a pair skips the orphaned low surrogate
assertEquals("a", StringUtils.truncate(grin + "ab", 1, 2));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing assertions for an input string that only contains the grin, with permutations of the other parameters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. truncate(grin, ...) is now covered across offset/maxWidth permutations: the whole pair when it fits (0,2, 0,3), and empty when the cut or the offset lands inside it (0,0, 0,1, 1,1, 1,2, 2,2). Never a lone surrogate.

}
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing assertions for an input string that only contains the grin, with permutations of the other parameters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added abbreviate(grin, ...) across the width, offset and marker permutations. The pair is kept whole in the normal cases, and the empty-marker case (grin, "", 0, 1) exercises the head cut backing off to empty instead of emitting a lone surrogate.

@alhudz

alhudz commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

The 1718 review is already handled, I pushed the POSITIVE_INFINITY/NEGATIVE_INFINITY tests for double and float there. For this PR I added the grin-only assertions across the parameter permutations and replied inline on both threads.

@garydgregory garydgregory left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @alhudz

Thank you for this PR.

With this PR applied locally, and with the assertions in org.apache.commons.lang3.StringUtilsAbbreviateTest.testEmoji() commented in, that test fails.

Would you please review testEmoji() failures in light of this PR.

Thank you!

@alhudz

alhudz commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

Had a look at testEmoji(). With the expectedResultsFox/expectedResultsFamilyWithCodepoints assertions commented in, it fails both before and after this PR, so they have never been green on master:

abbreviate("🦊…", 4)

  • before: <lone high surrogate>... (the malformed output this PR targets)
  • this PR: ...
  • testEmoji expects: 🦊...

The gap is the contract. testEmoji counts maxWidth in code points: the marker is 3, then maxWidth - 3 whole code points (width 4 → 1 fox, width 5 → 2 foxes, and the family case counts each skin-tone modifier and ZWJ as one). So 🦊... is 5 chars but 4 code points. This PR keeps the documented char-based contract (result.length() <= maxWidth) and only nudges the cut to a code-point boundary, which is always shorter, so it drops the partial emoji rather than keeping it whole and over the numeric width.

So this change is strictly the lone-surrogate fix. Making testEmoji pass is the larger LANG-1770 job of re-basing abbreviate on code points, which redefines what maxWidth means and the length() <= maxWidth guarantee. Happy to take that on as the actual LANG-1770 fix if you want it in this PR, or keep this one scoped to removing the malformed surrogate. Which would you prefer?

@garydgregory

Copy link
Copy Markdown
Member

@alhudz
We can take it one task at time. I'll merge this PR and you can continue in another.
TY!

@garydgregory garydgregory merged commit 75d85a8 into apache:master Jun 20, 2026
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants