Skip to content

Test QR rules with CUDA#241

Open
kshyatt wants to merge 8 commits into
mainfrom
ksh/cuqr
Open

Test QR rules with CUDA#241
kshyatt wants to merge 8 commits into
mainfrom
ksh/cuqr

Conversation

@kshyatt

@kshyatt kshyatt commented Jun 1, 2026

Copy link
Copy Markdown
Member

No description provided.

@kshyatt

kshyatt commented Jun 4, 2026

Copy link
Copy Markdown
Member Author

OK, this is now working for QR on full CuMatrix, but there are still scalar indexing errors for Diagonal because dispatch for SubArray{T, 2, Diagonal{T, CuVector{T}} is awful. I'd be ok with merging this as-is if the pullback modifications look ok, and then come back for another swing at the Diagonal case.

@kshyatt kshyatt requested review from Jutho and lkdvos and removed request for Jutho June 4, 2026 15:48
Comment thread src/pullbacks/qr.jl Outdated

@lkdvos lkdvos left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to introduce a project_triu(!) and/or norm_triu function for this, and just overload the GPU ones to allocate?

As a side note, we might at some point try and not generate any of the gauge dependency checks if the tolerance is set to 0, as a way to disable the warnings, or alternatively put the entire thing in a @warn block so we can disable that, since these checks are starting to feel a bit expensive if we need to start allocating.

@kshyatt

kshyatt commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

Would it make sense to introduce a project_triu(!) and/or norm_triu function for this, and just overload the GPU ones to allocate?

Yeah I think that could be a nice way to handle it, however I'm unsure if it's that helpful if it's only used in these checks.

@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
src/pullbacks/qr.jl 97.00% <100.00%> (+1.44%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread src/pullbacks/qr.jl Outdated
Comment thread src/pullbacks/qr.jl

Q₁ = view(Q, :, 1:p)
R₁₁ = UpperTriangular(view(R, 1:p, 1:p))
R₁₁ = UpperTriangular(R[1:p, 1:p])

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a subtle and impactful change. The UpperTriangular wrapper is really only necessary to enable the rdiv! call below. If GPUs cannot deal with UpperTriangular of a view of a GPUArray, then maybe we need to call the corresponding BLAS/LAPACK methods directly, or have some intermediate wrapper like rdiv_uppertriangular!.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If GPUs cannot deal with UpperTriangular of a view of a GPUArray

Indeed they can't :(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I even wonder how rdiv!(::Matrix, ::UpperTriangular) is evaluated on the GPU, since you need cuSOLVERDx to access TRSM.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW cuSOLVERDx is only for device side code, so it can only be called by running CUDA kernels, not from host side code as we are doing here...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, this is internally in dividing using \ by a general matrix, which is then evaluated by computing its QR decomposition and then directly calling cuBLAS.trsm! on the triangular factor. Here we already have the triangular factor, so I indeed want to call trsm!, but using generic code, which is why I was using ldiv!/rdiv!. And I don't see where rdiv!(first_arg, second_arg::UpperTrangiular) is then actually lowered to cuBLAS.trsm! in the CuArray case (and why that only works for a pure CuMatrix and not for a view over it.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To unblock this: shall we indeed create a helper rdiv!_uppertriangular! that for now just avoids the copy on the CPU and simply copies on the GPU with a # TODO: dispatch to trsm! directly

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But Then I don't understand why it doesn't work. The argument B::StridedCuMatrix in https://github.com/JuliaGPU/CUDA.jl/blob/fbb90981cbde21d979087ad518a510f5b38f95b3/lib/cublas/src/linalg.jl#L443 should accept a view of a CuMatrix, no? My apologies for being annoying, my lack of access to a GPU to test these things myself makes me ask these questions.

@kshyatt

kshyatt commented Jun 30, 2026

Copy link
Copy Markdown
Member Author

The Enzyme windows fail is happening on main and seems unrelated

Comment thread src/pullbacks/qr.jl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants