-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
usb/cdc: fix RP2 USB CDC TX race with cores scheduler #5391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
rdon-key
wants to merge
5
commits into
tinygo-org:dev
Choose a base branch
from
rdon-key:fix-rp2-usb-cdc-tx-race
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+58
−20
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is odd to me this is not guarded with a txActive compare-swap. I think my issue with this PR as it stands is it is not clear to me what txActive is guarding. From the looks of it sendFromRing can be called concurrently from kickTx and txhandler and we've added a bunch of logic in sendFromRing to deal with this. I feel like we can maybe keep sendFromRing as a simple function that does what it says and guard it from executing concurrently as I suspect that concurrent execution is not necessary for this to work correctly.
maybe we can even delete kickTx and simply add the guard at the start of sendFromRing and do a simple Store(0) immediately when sendFromRing ends it's process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds very reasonable, and less complex. What do you think @rdon-key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
txActive is not only guarding the lexical execution of sendFromRing. It represents ownership of the asynchronous TX pump.
This is the usual queue + active-flag missed-wakeup pattern: while the active flag is set, producers may enqueue more work but do not start another worker. Therefore, when the current owner is about to go idle, it must clear the active flag and then re-check the queue before returning.
In this case, sendFromRing submits one USB IN packet and then returns, but the TX pump must still be considered active while that packet is in flight. The ownership is continued by txhandler when the TX completion arrives.
That is why txhandler does not acquire txActive with CAS before calling sendFromRing: it is continuing the ownership that was acquired by kickTx.
If we put the CAS at the start of sendFromRing, the completion path would normally see txActive already set and fail to continue the pump. If we clear txActive immediately when sendFromRing returns after submitting a packet, Write could acquire it and call SendUSBInPacket again while the previous packet is still in flight, which is the race this PR is trying to avoid.
The intended ownership model:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a really hard time working through the logic. I've asked google AI to create an abstraction taking into account your approach. I do like the idea of defining a type that encapsulates the task pump functionality. Here's what the AI came up with. Let me know if any of these abstractions work. Note there are two abstractions provided, one uses a dynamic call, the other embeds the task pump logic more cleanly (second one). Let me know if this sounds like a good path forward.
https://share.google/aimode/CPm4rcMLZXj16sTsA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @soypat — I did look at both sketches.
The main issue is that they don't capture the TX pump's IRQ boundary. The TX pump isn't a run-to-completion loop: it sends one packet, then has to suspend while keeping ownership, and resume from the TX-completion IRQ (
txhandler) without re-acquiring.As written, both sketches seem to collapse that boundary: they put
Kick()/CAS intxhandler, so the pump can stall after the first packet whiletxActiveis still held. To make the abstraction actually span the IRQ boundary, it would need a separate continuation entry that skips the CAS, an explicit yield / in-flight state, and the release-with-recheck path.At that point, I don't think the abstraction is simpler than what's here now — it just spreads the same state machine across a generic type and the driver.
So for this PR I'd like to keep the scope minimal. It's a regression fix for #5377, and I'd rather land the smallest change that closes the race. If we want to revisit a TaskPump-style abstraction, I think that should be a separate follow-up issue with its own review and on-hardware testing.