fix(thread_pool): re-read outstanding work under lock before stopping#339
Merged
sgerbino merged 1 commit intoJun 25, 2026
Merged
Conversation
testJoinDrainsWork could intermittently fail: join() returned with posted tasks still queued and never run. on_work_finished() decremented outstanding_work_ lock-free and decided to stop from that decrement alone. A worker could observe the count transiently reach zero, get preempted before taking the mutex, and then latch stop_ after more work had been posted and join() had begun waiting; join() woke and abandoned the still-outstanding work. The same hole strands a task that suspends and is resumed after the count briefly hits zero, since its run queue is empty while it is in flight. Keep outstanding_work_ atomic and lock-free on the start path, but have the worker that drives the count to zero re-read it under mutex_ before latching stop_. The re-read observes any on_work_started() that landed in the window after the lock-free decrement, so work started before the decision is never stranded; work whose count is raised after the decision is post-drain and abandoned as before. join() still blocks until the count reaches zero. Also correct the class example: a bare post() does not register outstanding work, so join() does not wait for it. Use run_async, which holds a work guard for the operation, and document the contract.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop-2 #339 +/- ##
============================================
Coverage ? 98.39%
============================================
Files ? 83
Lines ? 4234
Branches ? 0
============================================
Hits ? 4166
Misses ? 68
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
|
An automated preview of the documentation is available at https://339.capy.prtest3.cppalliance.org/index.html If more commits are pushed to the pull request, the docs will rebuild at the same URL. 2026-06-25 19:38:01 UTC |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
testJoinDrainsWork could intermittently fail: join() returned with posted tasks still queued and never run.
on_work_finished() decremented outstanding_work_ lock-free and decided to stop from that decrement alone. A worker could observe the count transiently reach zero, get preempted before taking the mutex, and then latch stop_ after more work had been posted and join() had begun waiting; join() woke and abandoned the still-outstanding work. The same hole strands a task that suspends and is resumed after the count briefly hits zero, since its run queue is empty while it is in flight.
Keep outstanding_work_ atomic and lock-free on the start path, but have the worker that drives the count to zero re-read it under mutex_ before latching stop_. The re-read observes any on_work_started() that landed in the window after the lock-free decrement, so work started before the decision is never stranded; work whose count is raised after the decision is post-drain and abandoned as before. join() still blocks until the count reaches zero.
Also correct the class example: a bare post() does not register outstanding work, so join() does not wait for it. Use run_async, which holds a work guard for the operation, and document the contract.