Skip to content

For sensors glink lpaicp crash#1408

Open
quic-vishsant wants to merge 3 commits into
qualcomm-linux:early/hwe/shikra/driversfrom
quic-vishsant:for-sensors-glink-lpaicp-crash
Open

For sensors glink lpaicp crash#1408
quic-vishsant wants to merge 3 commits into
qualcomm-linux:early/hwe/shikra/driversfrom
quic-vishsant:for-sensors-glink-lpaicp-crash

Conversation

@quic-vishsant

Copy link
Copy Markdown

GLINK channels can crash a remote subsystem when open close sequences races. This occurred during island mode. Island mode is a low-power state where the remote suspends IPC queue processing while continuing other execution -- queued GLINK commands are dispatched in a burst when the subsystem exits island mode.

The root cause is the absence of a formal channel state machine. The host processes GLINK_CMD_OPEN_ACK in interrupt context while CLOSE and CLOSE_ACK are deferred to a worker thread. When two close acknowledgments for the same channel arrive back-to-back, the first clears lcid to 0; the second close request goes out with lcid 0, which the remote silently ignores. The host believes the channel is closed; the remote keeps it open. On the next open attempt the remote sees a duplicate open and asserts.

The GLINK native driver has no formal channel state machine, making it
prone to race conditions during concurrent open/close sequences.

Consider a remote subsystem operating in island mode -- a low-power
state where the subsystem suspends processing of its IPC message queue
while continuing other execution. Incoming GLINK commands accumulate
and are dispatched only when the subsystem exits island mode.

The host processes GLINK_CMD_OPEN_ACK in interrupt context via
qcom_glink_native_rx(), while GLINK_CMD_OPEN, GLINK_CMD_CLOSE, and
GLINK_CMD_CLOSE_ACK are deferred to a worker thread. This asymmetry
creates a window where two close acknowledgments for the same channel,
but carrying different local channel IDs (lcids), are processed
together.

The first close acknowledgment clears the lcid to 0. Any subsequent
close request then goes out with lcid 0, which the remote silently
ignores. The host considers the channel closed while the remote keeps
it open. On the next open attempt the remote sees a duplicate open and
crashes.

Introduce an explicit channel state machine to eliminate these races:
- enum glink_channel_state covers local state: CLOSED, OPENING,
  OPENED, CLOSING.
- remote_opened flag tracks the remote endpoint's state.
- state_lock spinlock protects both state variables.
- Helper functions validate and perform state transitions atomically.

IDR entries (lcids, rcids) and the reference count are deferred until
both local and remote sides are fully closed, preventing stale IDR
lookups and use-after-free regardless of which close ordering occurs.

Signed-off-by: Vishnu Santhosh <vishnu.santhosh@oss.qualcomm.com>
Debugging channel state issues during open/close sequences currently
requires adding temporary instrumentation to the driver. Add two
tracepoints that provide persistent, zero-overhead-when-disabled
visibility into the state machine:

- qcom_glink_channel_state: fires on every local state transition,
  recording old and new states and an error flag that indicates an
  invalid transition attempt.
- qcom_glink_channel_info: fires on remote endpoint state changes,
  recording whether the remote side opened or closed.

Together these allow a complete channel lifecycle to be reconstructed
from a trace log without any driver modification.

Signed-off-by: Vishnu Santhosh <vishnu.santhosh@oss.qualcomm.com>
Wire the qcom_glink_channel_state and qcom_glink_channel_info
tracepoints into the channel state machine helpers.

Signed-off-by: Vishnu Santhosh <vishnu.santhosh@oss.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant