drm/msm: GPU recovery, IRQ, and state capture fixes#771
Conversation
|
Merge Check Failed: No CR Numbers Found Error: No Change Request numbers were found. Please add Change Request numbers to your pull request description in the format CRs-Fixed: 12345 or link GitHub issues that are associated with Change Requests. |
PR #771 — validate-patchPR: #771
Final Summary
|
PR #771 — checker-log-analyzerPR: #771
Detailed report: Full report
|
|
Merge Check Failed: No CR Numbers Found Error: No Change Request numbers were found. Please add Change Request numbers to your pull request description in the format CRs-Fixed: 12345 or link GitHub issues that are associated with Change Requests. |
1 similar comment
|
Merge Check Failed: No CR Numbers Found Error: No Change Request numbers were found. Please add Change Request numbers to your pull request description in the format CRs-Fixed: 12345 or link GitHub issues that are associated with Change Requests. |
b549e1e to
d9d8de8
Compare
|
Merge Check Failed: No CR Numbers Found Error: No Change Request numbers were found. Please add Change Request numbers to your pull request description in the format CRs-Fixed: 12345 or link GitHub issues that are associated with Change Requests. |
PR #771 — validate-patchPR: #771
Final Summary
|
PR #771 — checker-log-analyzerPR: #771
Detailed report: Full report
|
d9d8de8 to
62bd82f
Compare
|
Merge Check Failed: No CR Numbers Found Error: No Change Request numbers were found. Please add Change Request numbers to your pull request description in the format CRs-Fixed: 12345 or link GitHub issues that are associated with Change Requests. |
|
Merge Check Failed: No Change Task Found No associated change tasks found for CR 4434857 on any of the following entities: Entities:
CR: 4434857 Please ensure the CR has a change task associated with at least one of the entities for this branch. |
PR #771 — validate-patchPR: #771
Final Summary
|
PR #771 — checker-log-analyzerPR: #771
Detailed report: Full report
|
62bd82f to
aa1d3f7
Compare
Previously, in case there was no more work to do, recover worker wouldn't trigger recovery and would instead rely on the gpu going to sleep and then resuming when more work is submitted. Recover_worker will first increment the fence of the hung ring so, if there's only one job submitted to a ring and that causes an hang, it will early out. There's no guarantee that the gpu will suspend and resume before more work is submitted and if the gpu is in a hung state it will stay in that state and probably trigger a timeout again. Just stop checking and always recover the gpu. Signed-off-by: Anna Maniscalco <anna.maniscalco2000@gmail.com> Link: https://lore.kernel.org/linux-arm-msm/20260210-recovery_suspend_fix-v1-1-00ed9013da04@gmail.com/ Message-ID: <20260210-recovery_suspend_fix-v1-1-00ed9013da04@gmail.com> Signed-off-by: Rob Clark <robin.clark@oss.qualcomm.com> Signed-off-by: Veeresh Bagale <vbagale@hu-vbagale-hyd.qualcomm.com> (cherry picked from commit 01a0d6c)
During recovery, it is not safe to retire the hung submit before we recover the GPU. Retiring the submit triggers BO free and that can result in GPU pagefaults since the GPU may be actively accessing those BOs. To fix this, retire the submits after gpu recovery is complete in recover_worker(). Fixes: 1a370be ("drm/msm: restart queued submits after hang") Signed-off-by: Veeresh Bagale <vbagale@qti.qualcomm.com> Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com> Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com> Link: https://lore.kernel.org/linux-arm-msm/20260605-assorted-fixes-june-v1-2-2caa04f7287c@oss.qualcomm.com
The GPUCC register list for A663 is incorrect, which can cause out-of-bounds register access during GPU state capture. Update it to use the correct register ranges. Fixes: 5773cce ("drm/msm/a6xx: Add support for A663") Signed-off-by: Veeresh Bagale <vbagale@qti.qualcomm.com> Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com> Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com> Link: https://lore.kernel.org/linux-arm-msm/20260605-assorted-fixes-june-v1-3-2caa04f7287c@oss.qualcomm.com
A621 uses an incorrect GPUCC register list during state capture. The existing list matches A623/A663. Rename it accordingly and add a dedicated A621 GPUCC register list. Fixes: 11cdb81 ("drm/msm/a6xx: Fix gpucc register block for A621") Signed-off-by: Veeresh Bagale <vbagale@qti.qualcomm.com> Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com> Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com> Link: https://lore.kernel.org/linux-arm-msm/20260605-assorted-fixes-june-v1-4-2caa04f7287c@oss.qualcomm.com
Once a hang is triggered by the msm_recovery test, the gpu error irq remains asserted and triggers an interrupt storm. In the worst case, this IRQ storm lands on the CPU core where the hangcheck timer is scheduled, blocking it from running. This eventually leads to CPU watchdog timeouts. To fix this, mask the gpu error irqs during msm_recovery test and enable them back during the recovery. Fixes: 5edf275 ("drm/msm: Add debugfs to disable hw err handling") Signed-off-by: Veeresh Bagale <vbagale@qti.qualcomm.com> Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com> Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com> Link: https://lore.kernel.org/linux-arm-msm/20260605-assorted-fixes-june-v1-5-2caa04f7287c@oss.qualcomm.com
get_pid_task() increments the task reference count, but the corresponding put_task_struct() was missing in the else branch, leaking a reference on every GPU hang recovery. Fixes: 25654a1 ("drm/msm: Update global fault counter when faulty process has already ended") Signed-off-by: Veeresh Bagale <vbagale@qti.qualcomm.com> Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com> Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com> Link: https://lore.kernel.org/linux-arm-msm/20260605-assorted-fixes-june-v1-6-2caa04f7287c@oss.qualcomm.com Signed-off-by: Veeresh Bagale <vbagale@hu-vbagale-hyd.qualcomm.com>
aa1d3f7 to
9c2fc0c
Compare
|
LGTM. failing kernel cheekers are known and can be ignored for merging |
PR #771 — validate-patchPR: #771
Final Summary
|
PR #771 — checker-log-analyzerPR: #771
Detailed report: Full report
|
Commit picked up from the mainline kernel. This is need because the current fix has the code which is present in mainline but no on 6.18 branch. so, backport that commit as well say like this |
|
also put the numbering for each commit.....it more presentable and easy for eyes to look. or no need to put commit messages again. a short description is enough. |
|
Thanks, updated accordingly. |
Bug fixes for the Adreno GPU driver covering recovery correctness, IRQ handling, and state capture.
Summary of changes:
Always recover GPU in hang scenarios
Recover HW before retiring hung submit
Fix GPUCC register list for A663
Fix GPUCC register list for A621
Fix IRQ storm during msm_recovery test
Fix task_struct reference leak in recover_worker
CRs-Fixed: 4434857, 4434916