The cublasStrsm() function allow us to solve linear systems in parallel but only when triangular matrix are involved.
For the general case, we need to perform a LU decomposition of the matrix beforehand.
The Blocked LU decomposition works like standard LU decomposition. But instead of applying the algorithm to digit, each step is performed onto block matrices.
Then lets start with the LU decomposition of a sample matrix before introducing the Blocked LU decomposition algorithm.Standard LU decomposition :
![(1) delim {[} {matrix{3}{3}{8 2 9 4 9 4 6 7 9}} {]} = delim {[}{matrix{3}{3}{1 0 0 {L21} 1 0 {L31}{L32} 1}}{]} delim {[}{matrix{3}{3}{{U11} {U12} {U13} 0 {U22} {U23} 0 0 {U33}}}{]} (1) delim {[} {matrix{3}{3}{8 2 9 4 9 4 6 7 9}} {]} = delim {[}{matrix{3}{3}{1 0 0 {L21} 1 0 {L31}{L32} 1}}{]} delim {[}{matrix{3}{3}{{U11} {U12} {U13} 0 {U22} {U23} 0 0 {U33}}}{]}](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_956.5_835c7056760a8d1c76d83da364c5e584.png)
![8=delim {[} {matrix {1}{3}{1 0 0 }} {]} delim {[} {matrix {1}{3}{{U11} 0 0}}{]}^T doubleleftright {U11}=8 8=delim {[} {matrix {1}{3}{1 0 0 }} {]} delim {[} {matrix {1}{3}{{U11} 0 0}}{]}^T doubleleftright {U11}=8](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_986_ee8ab0eb51943c3c314a6dc0943aec71.png)
![2=delim {[} {matrix {1}{3}{1 0 0 }} {]} delim {[} {matrix {1}{3}{{U12} {U22} 0}}{]}^T doubleleftright {U12}=2 2=delim {[} {matrix {1}{3}{1 0 0 }} {]} delim {[} {matrix {1}{3}{{U12} {U22} 0}}{]}^T doubleleftright {U12}=2](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_986_6ce203fc132b6354332b4c54a68e6f08.png)
![9=delim {[} {matrix {1}{3}{1 0 0 }} {]} delim {[} {matrix {1}{3}{{U13} {U23} {U33}}}{]}^T doubleleftright {U13}=9 9=delim {[} {matrix {1}{3}{1 0 0 }} {]} delim {[} {matrix {1}{3}{{U13} {U23} {U33}}}{]}^T doubleleftright {U13}=9](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_986_e24ca51a1f569ccfa8642520baca1e67.png)
![4=delim {[} {matrix {1}{3}{{L21} 1 0 }} {]} delim {[} {matrix {1}{3}{{U11} 0 0}}{]}^T doubleleftright {L21}=4/{U11} doubleleftright {L21}=1/2 4=delim {[} {matrix {1}{3}{{L21} 1 0 }} {]} delim {[} {matrix {1}{3}{{U11} 0 0}}{]}^T doubleleftright {L21}=4/{U11} doubleleftright {L21}=1/2](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_981_263a99a8a898d252f587e073eef6f357.png)
![6=delim {[} {matrix {1}{3}{{L31} {L32} 0 }} {]} delim {[} {matrix {1}{3}{{U11} 0 0}}{]}^T doubleleftright {L31}=6/{U11} doubleleftright {L31}=3/4 6=delim {[} {matrix {1}{3}{{L31} {L32} 0 }} {]} delim {[} {matrix {1}{3}{{U11} 0 0}}{]}^T doubleleftright {L31}=6/{U11} doubleleftright {L31}=3/4](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_981_60809b9c3e18666e6f622b645609874f.png)
![(2) delim {[} {matrix{3}{3}{8 2 9 4 9 4 6 7 9}} {]} = delim {[}{matrix{3}{3}{1 0 0 0.5 1 0 0.75 {L32} 1}}{]} delim {[}{matrix{3}{3}{ 8 2 9 0 {U22} {U23} 0 0 {U33}}}{]} (2) delim {[} {matrix{3}{3}{8 2 9 4 9 4 6 7 9}} {]} = delim {[}{matrix{3}{3}{1 0 0 0.5 1 0 0.75 {L32} 1}}{]} delim {[}{matrix{3}{3}{ 8 2 9 0 {U22} {U23} 0 0 {U33}}}{]}](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_956.5_c43debe4adc0aad4ab0c4ffb274d2ac5.png)
![tabular{1101}{1101}{8 2 9 4 9 4 6 7 9} = tabular{1101}{1101}{1 0 0 0.5 1 0 0.75 {L32} 1} tabular{1101}{1101}{8 2 9 0 {U22} {U23} 0 0 {U33}} tabular{1101}{1101}{8 2 9 4 9 4 6 7 9} = tabular{1101}{1101}{1 0 0 0.5 1 0 0.75 {L32} 1} tabular{1101}{1101}{8 2 9 0 {U22} {U23} 0 0 {U33}}](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_961.5_5b18c7b11c783da5afb91656ee43c230.png)
![{A1}= delim {[} { matrix{2}{2}{9 4 7 9} } {]} {L1}= delim {[} { matrix{2}{2}{1 0 {L32} 1} } {]} {U1} = delim {[} { matrix{2}{2}{{U22} {U23} 0 {U33}} } {]} {A1}= delim {[} { matrix{2}{2}{9 4 7 9} } {]} {L1}= delim {[} { matrix{2}{2}{1 0 {L32} 1} } {]} {U1} = delim {[} { matrix{2}{2}{{U22} {U23} 0 {U33}} } {]}](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_971.5_44bfb265e9b971615edf005381eec61e.png)
![{A1}= delim {[} { matrix{1}{2}{0.5 0.75} } {]}^t delim {[} { matrix{1}{2}{2 9} } {]}+{ L1} {U1} {A1}= delim {[} { matrix{1}{2}{0.5 0.75} } {]}^t delim {[} { matrix{1}{2}{2 9} } {]}+{ L1} {U1}](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_986_0b44ba2fac596d326e2a510d6bd4ecbb.png)
![doubleleftright {L1}{U1} = A1 - delim {[} { matrix{1}{2}{0.5 0.75 } } {]}^t delim {[} { matrix{1}{2}{2 9} } {]} doubleleftright {L1}{U1} = A1 - delim {[} { matrix{1}{2}{0.5 0.75 } } {]}^t delim {[} { matrix{1}{2}{2 9} } {]}](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_986_b8df7334d2b19a88d32c41eaf02ff583.png)
![doubleleftright {L1}{U1} = delim {[} { matrix{2}{2}{ 8 -0.5 5.5 2.25} } {]} doubleleftright {L1}{U1} = delim {[} { matrix{2}{2}{ 8 -0.5 5.5 2.25} } {]}](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_971.5_c9cddea68ce8f4b29fe959b590c6cac7.png)
![{L1}{U1} {L1}{U1}](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992.5_fe757ab99d5a621d16d7caa370e040f5.png)
![delim {[} { matrix{2}{2}{ 8 -0.5 5.5 2.25} } {]} delim {[} { matrix{2}{2}{ 8 -0.5 5.5 2.25} } {]}](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_971.5_8a99632de6607229e6279b20e735515c.png)
![{Ln}{Un} {Ln}{Un}](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992.5_5c4128c15416562b3f2c29c4da42276e.png)
![1*1 1*1](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992.5_ffebef795df883b74eeca0a8d85e7464.png)
![delim {[} {matrix{3}{3}{8 2 9 4 9 4 6 7 9}} {]} = delim {[} { matrix{3}{3}{ 1 0 0 0.5 1 0 0.75 0.6875 1 }} {]} delim {[}{matrix{3}{3}{ 8 2 9 0 8 {-0.5} 0 0 2.59375 }}{]} delim {[} {matrix{3}{3}{8 2 9 4 9 4 6 7 9}} {]} = delim {[} { matrix{3}{3}{ 1 0 0 0.5 1 0 0.75 0.6875 1 }} {]} delim {[}{matrix{3}{3}{ 8 2 9 0 8 {-0.5} 0 0 2.59375 }}{]}](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_956.5_18fbf90fb1ba4a3ae76692d8bb5817d3.png)
![{Aii}!=0 forall i {Aii}!=0 forall i](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_994.5_dbf3fd5cf5927eb958e95efe88cfdc9b.png)
Block LU decomposition is very similar to the algorithm we just described. The only difference is that we introduce the Block matrix at the beginning of the process.We choose a sub matrix block size :
![r r](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992.5_003a63aa0b2e193ef81111bc8c0b56c3.png)
![M*N M*N](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992.5_65c33a56f1357175c39102cc52d253b6.png)
![img46](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/img46.gif)
![A00 A00](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992.5_5a2dd2e01d4f1e9259b400b7f596f70c.png)
![r*r r*r](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992.5_b14640fe6e4ebd8a8c372f65eabdb669.png)
![{L11}{U11} {L11}{U11}](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992.5_102b54e10ff59bf32d71d2916ed6973b.png)
![delim{lbrace}{matrix{4}{2}{1 {{A00} = {L00}{U00}} 2 {{A10} = {L10}{U00}} 3 {{A01} = {L00}{U01}} 4 {{A11} = {L10}{U01} + {L11}{U11}}}} { } delim{lbrace}{matrix{4}{2}{1 {{A00} = {L00}{U00}} 2 {{A10} = {L10}{U00}} 3 {{A01} = {L00}{U01}} 4 {{A11} = {L10}{U01} + {L11}{U11}}}} { }](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_926.5_1c9e725e9c54f0a3964eb7f404be28b6.png)
![{L00}, {U00}, {L10} {L00}, {U00}, {L10}](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992.5_ad721a35d138e11e856bd017df604a83.png)
![M*r M*r](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992.5_eb971e071b9b6a5e74212ff658c65e04.png)
![{A00} {A00}](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992.5_758ef66cc600a09340514698101aa85b.png)
![{A10} {A10}](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992.5_2ad8c8df67893f692d9c91cdc350fcfa.png)
![{U01} = {L00}^-1{A01} {U01} = {L00}^-1{A01}](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992_2a747c5771878e3d53773918a62135df.png)
![{A11}prime = {A11}-{L10}{U01} = {L11}{U11} {A11}prime = {A11}-{L10}{U01} = {L11}{U11}](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992.5_8f23ebbb7f1e95f8ac4d4cd391360278.png)
![L11 L11](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992.5_364727cf4ba946638565049a0a4c33fd.png)
![U11 U11](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992.5_40e7e84252705a7095ec16165744c843.png)
![(M-r)*(N-R) (M-r)*(N-R)](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_986.5_09c5438de4ef8eb879b6168bb9befc77.png)
![{A11}prime {A11}prime](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992.5_59508ae70f3f5cd41f9f3c9ce3113753.png)
![{A11}prime {A11}prime](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/math_992.5_59508ae70f3f5cd41f9f3c9ce3113753.png)
![img60](./LU Matrix Decomposition in parallel with CUDA _ Noctua_files/img60.gif)
- //***************************************************************************************
- //void DecomposeLU( int M, int N, int lda , float* A,
- // int* permute, float epsilon, InfoStat& stat)
- //
- // M : Num of rows of A
- // N : Num of column of A
- // A : Float Matrix of size M*N
- // on the output contains the result of the LU decomposition
- // The diagonal elements for L are not stored in A ( assuming they are all 1)
- //lda : Leading dim of A lda < std::max(1,M)
- //P : Permutation vector of size M
- //epsilon : Epsilon (used to test for singularities)
- //stat : return status
- // **************************************************************************************
- void DecomposeLU(int M, int N, int lda , float* A, int* P, float epsilon, InfoStat& stat)
- {
- cublasStatus cuStat;
- //Preconditions
- if ( M<=0 || N<=0 || lda < std::max(1,M) )
- {
- stat._info = -1;
- if (M<=0)
- stat._str = "M<=0";
- if (N<=0)
- stat._str = "M<=0";
- if (lda < std::max(1,M))
- stat._str = "lda < std::max(1,M)";
- return;
- }
- int minDim = std::min( M, N );
- for (int k=0; k<minDim-1; k++)
- {
- int pivotRow = k-1+cublasIsamax(M-k,A+k + k*lda, 1); // row relative to the current submatrix
- int kp1 = k+1;
- P[k] = pivotRow;
- if (pivotRow!=k)
- {
- cublasSswap(N, A+pivotRow, lda, A+k, lda);
- }
- float valcheck;
- cublasGetVector(1,sizeof(float),A+k+ k*lda, 1, &valcheck, 1);
- if (fabs(valcheck) < epsilon)
- {
- stat._info =k+1;
- stat._str = " Matrix is Singular ";
- return;
- }
- if (kp1 < M)
- {
- cublasSscal(M-kp1, 1.0f/valcheck,A+kp1+ k*lda, 1);
- }
- if ( kp1 < minDim )
- {
- cublasSger (M-kp1, N-kp1, -1.0f,A+kp1+ k*lda, 1, A+k+ kp1*lda, lda,A+ kp1*lda+kp1, lda);
- }
- }
- CHECK_CUBLAS("decomposeLU pb");
- }
As a further improvement, it would be nice to avoid the cublasGetVector() call. One solution would be to write this function as a kernel instead of using Cublas. Any other idea is welcome !
We can now use DecomposeLU() while performing the block decomposition :- //***************************************************************************************
- //void DecomposeBlockedLU ( int M, int N,int lda,
- // float *A,
- // int* P, int blockSize,float epsilon, InfoStat &stat )
- //
- // M : Num of rows of A
- // N : Num of column of A
- // A : Float Matrix of size M*N
- // on the output contains the result of the LU decomposition
- // The diagonal elements for L are not stored in A ( assuming they are all 1)
- //lda : Leading dim of A lda < std::max(1,M) //P : Permutation vector of size M //blockSize : Size of the submatrices // if blockSize>=M || blockSize==1 unblocked decomposition is called
- //epsilon : Epsilon (used to test for singularities)
- //stat : return status
- // **************************************************************************************
- void DecomposeBlockedLU ( int M, int N,int lda,
- float *A,
- int* P, int blockSize,float epsilon, InfoStat &stat )
- {
- cublasStatus cuStat;
- //Preconditions
- if (M < 0 || N < 0 || lda < std::max(1,M) )
- {
- stat._info = -1;
- if (M<=0)
- stat._str = "M<=0";
- if (N<=0)
- stat._str = "M<=0";
- if (lda < std::max(1,M))
- stat._str = "lda < std::max(1,M)"; return; } int minSize = std::min(M,N); if ( blockSize > minSize || blockSize == 1)
- {
- //straight LU decomposition
- DecomposeLU( M, N, lda, A, P, epsilon, stat);
- }
- else
- {
- //blocked decomposition
- for (int i =0; i< minSize ; i+=blockSize)
- {
- int realBlockSize = std::min(minSize - i, blockSize);
- //decompose the current rectangular block
- DecomposeLU( M-i, realBlockSize, lda, A+i+i*lda, P+i, epsilon, stat);
- //adjust pivot infos
- //Todo : write a kernel for that
- for (int p = i; p< std::min( M, i+realBlockSize)-1; p++)
- {
- P[p] = P[p]+i;
- if (P[p] != p)
- {
- // Apply interchanges to columns 0:i.
- cublasSswap(i, A+p , lda, A+ P[p], lda);
- // Apply interchanges to columns i+blockSize:N.
- cublasSswap(N-i-realBlockSize, A+p+(i+realBlockSize)*lda , lda, A+ P[p]+(i+realBlockSize)*lda, lda);
- }
- }
- // Compute block row of U.
- cublasStrsm( 'l','l','n','u', realBlockSize, N-i-realBlockSize, 1.0f,
- A +i +i*lda, lda, A +i + (i+realBlockSize)*lda, lda);
- CHECK_CUBLAS("decomposeBlockedLU cublasStrsm");
- if (i+realBlockSize < M)
- {
- cublasSgemm('n','n', M-i-realBlockSize, N-i-realBlockSize, realBlockSize,
- -1.0f,
- A+i+realBlockSize+i*lda,lda,
- A+i+(realBlockSize+i)*lda,lda,
- 1.0f,
- A+i+realBlockSize+(realBlockSize+i)*lda,lda );
- CHECK_CUBLAS("decomposeBlockedLU cublasSgemm");
- }
- }
- }
- }
Performances improvement :
For a random 10 000*10 000 matrix DecomposeBlockedLU run in about 3 second on my Quadro FX 4800 versus 98 second if we use DecomposeLU alone.This code is used as a basis for Radial Basis Function interpolation computed on the GPU. I’m still eager to get better performances. I’m considering writing some part of it as a Cuda kernel. For any suggestion or if you know any alternative implementation available, thanks to drop me a mail or a comment.
Hello,
Thanks for this code.
Under which license is it released ?
thanks
S
Hi Sylvestre
This code is released under BSD license.
Cheers.
Hello,
I need to use a function like this in order to solve a linear system problem with matrix dimension that don’t fill in the device memory.
Do you think that are possible use the same approach to do this task?
Have you some suggestions for me?
Thank a lot in advance for your help
Best regards
Andrea
Hi Andrea,
Yes, I think it would not be too difficult to adapt the code to an hybrid GPU/CPU solution.
Start with a CPU Blocked LU (choosing a block size that fills in the Device memory) then solve each block using the GPU blocked LU as shown in this post.
You can implement a CPU Blocked LU implementation using LAPACK .
For this implementation, instead of calling DecomposeLU() the CPU Blocked LU will call the GPU blocked LU.
Cheers,
Hi,
First, thanks for your article – it’s really clear and helpful.
I think i’ve found a small mistake in the code, though : in DecomposeBlockedLU(), line 67, shouldn’t the first dimension argument be “M-i-realBlockSize” (and not “N-i-realBlockSize” which appears twice) ?
# cublasSgemm(‘n’,’n’, M-i-realBlockSize, N-i-realBlockSize, realBlockSize,
Best regards,
Clément
Hi Clement,
Thanks for your feedback.
You are perfectly right, it is # cublasSgemm(‘n’,’n’, M-i-realBlockSize, N-i-realBlockSize, realBlockSize
I updated the article.
Cheers.
F.
Hi,
another small question about the pivoting: in the code, it doesn’t seem to me that P[minDim] is ever set – is that normal?
It’s up to the client code to initialize the vector with a sequence (0,1,2 … ,M-1)
OK. Anyway, unless i’m mistaken, using the resulting P as a permutation matrix does not require to ever consider the last component (swapping iteratively the rows, starting at 0 and to M-2, should put the M rows to their “correct” position) – or am i mistaken?
It sounds good, if B is a “colum” device vector of size M, you can do :
for (int i=0;i<M-1;i++)
if(i != P[i] ) cublasSswap(1,B+i,1,B+P[i],1);
I discovered your blog site on google and check a few of your early posts. Continue to keep up the very good operate. Seeking forward to reading more from you later…