LU Matrix Decomposition in parallel with CUDA

The cublasStrsm() function allow us to solve linear systems in parallel but only when triangular matrix are involved.
For the general case, we need to perform a LU decomposition of the matrix beforehand.

For now,the Cublas library lack this feature  but almost all the Level 3 blass functions needed for the blocked LU decomposition are already available in Cublas.
The Blocked LU decomposition works like standard LU decomposition. But instead of applying the algorithm to digit, each step is performed onto block matrices.

Then lets start with the LU decomposition of a sample matrix before introducing the Blocked LU decomposition algorithm.

Standard LU decomposition :

(1)   delim {[} {matrix{3}{3}{8 2 9 4 9 4 6 7 9}} {]} = delim {[}{matrix{3}{3}{1 0 0 {L21} 1 0 {L31}{L32} 1}}{]} delim {[}{matrix{3}{3}{{U11} {U12} {U13} 0 {U22} {U23} 0 0 {U33}}}{]}

From that we can compute easily  the first Row and the first column of L and U :

8=delim {[} {matrix {1}{3}{1 0 0 }} {]} delim {[} {matrix {1}{3}{{U11} 0 0}}{]}^T   doubleleftright {U11}=8

2=delim {[} {matrix {1}{3}{1 0 0 }} {]} delim {[} {matrix {1}{3}{{U12} {U22} 0}}{]}^T     doubleleftright {U12}=2

9=delim {[} {matrix {1}{3}{1 0 0 }} {]} delim {[} {matrix {1}{3}{{U13} {U23} {U33}}}{]}^T    doubleleftright {U13}=9

4=delim {[} {matrix {1}{3}{{L21} 1 0 }} {]} delim {[} {matrix {1}{3}{{U11} 0 0}}{]}^T     doubleleftright {L21}=4/{U11}   doubleleftright {L21}=1/2

6=delim {[} {matrix {1}{3}{{L31} {L32} 0 }} {]} delim {[} {matrix {1}{3}{{U11} 0 0}}{]}^T    doubleleftright {L31}=6/{U11}    doubleleftright {L31}=3/4

We can now rewrite (1)

(2)   delim {[} {matrix{3}{3}{8 2 9 4 9 4 6 7 9}} {]} = delim {[}{matrix{3}{3}{1 0 0 0.5 1 0 0.75 {L32} 1}}{]} delim {[}{matrix{3}{3}{ 8 2 9 0 {U22} {U23} 0 0 {U33}}}{]}

if we regroup terms by block :

tabular{1101}{1101}{8 2 9 4 9 4 6 7 9}  =  tabular{1101}{1101}{1 0 0 0.5 1 0 0.75 {L32} 1}    tabular{1101}{1101}{8 2 9 0 {U22} {U23} 0 0 {U33}}

if we name : {A1}= delim {[} { matrix{2}{2}{9 4 7 9} } {]}     {L1}= delim {[} { matrix{2}{2}{1 0 {L32} 1} } {]}     {U1} = delim {[} { matrix{2}{2}{{U22} {U23} 0 {U33}} } {]}

Applying the block matrix product leads to : {A1}= delim {[} { matrix{1}{2}{0.5 0.75} } {]}^t delim {[} { matrix{1}{2}{2 9} } {]}+{ L1} {U1}

doubleleftright {L1}{U1} = A1 - delim {[} { matrix{1}{2}{0.5 0.75 } } {]}^t delim {[} { matrix{1}{2}{2 9} } {]}

doubleleftright {L1}{U1} = delim {[} { matrix{2}{2}{ 8 -0.5 5.5 2.25} } {]}

Then, to compute {L1}{U1}, we can Apply  the same steps  to delim {[} { matrix{2}{2}{ 8 -0.5 5.5 2.25} } {]}.

If we repeat the process until {Ln}{Un}  is size 1*1. We get the LU decomposition of A

Finally :    delim {[} {matrix{3}{3}{8 2 9 4 9 4 6 7 9}} {]} = delim {[} { matrix{3}{3}{ 1 0 0 0.5 1 0 0.75 0.6875 1 }} {]} delim {[}{matrix{3}{3}{ 8 2 9 0 8 {-0.5} 0 0 2.59375  }}{]}

I deliberately use a sample where we never need to divide by 0 (  {Aii}!=0   forall  i  ). If it’s not the case a permutation matrix must be introduced to keep track of row pivoting.

Block LU decomposition :
Block LU decomposition is very similar to the algorithm we just described. The only difference is that we introduce the Block matrix at the beginning of the process.

We choose a sub matrix block size : r and rewrite the matrix A of size M*N as a block matrix:

Where A00 is r*r.

Like we did in the LU Decomposition section we can easily compute the first row and the first column of the L and U block matrices and apply the process iteratively to {L11}{U11}

As the usual rules of matrix multiplication hold with block matrices we can write

delim{lbrace}{matrix{4}{2}{1 {{A00} = {L00}{U00}} 2 {{A10} = {L10}{U00}} 3 {{A01} = {L00}{U01}} 4 {{A11} = {L10}{U01} + {L11}{U11}}}} { }

From Eq(1) and (2) We can get {L00}, {U00}, {L10} applying LU decomposition (cf prev section) to the matrix  of size M*r composed by  {A00} and {A10}.

From Eq(3) we get {U01} = {L00}^-1{A01}

Finally we rearrange Eq(4) as {A11}prime = {A11}-{L10}{U01} = {L11}{U11}

From this equation we see that the problem of finding L11  and U11 reduces to finding the LU factorization of the (M-r)*(N-R) matrix {A11}prime. This can be done by applying the steps outlined above to {A11}prime instead of to A

For an in-place algorithm, A is overwritten by L and U – the 1s on the diagonal of L do not need to be stored explicitly. Similarly, when A is updated by Eq. (5) this may also be done in place. After k of these K steps, the first kr columns of L and the first kr rows of U have been evaluated, and matrix A has been updated to the form shown bellow.

Implementation :

As stressed in the previous section, we first need to implement the standard LU decomposition.

  1. //***************************************************************************************  
  2. //void DecomposeLU( int M, int N, int lda , float* A,  
  3. //                    int* permute, float epsilon, InfoStat& stat)  
  4. //  
  5. // M         :     Num of rows of A  
  6. // N         :     Num of column of A  
  7. // A         :     Float Matrix of size M*N  
  8. //                on the output contains the result of the LU decomposition  
  9. //                The diagonal elements for L are not stored in A ( assuming they are all 1)  
  10. //lda        :    Leading dim of A lda < std::max(1,M)  
  11. //P          :      Permutation vector of size M  
  12. //epsilon    :     Epsilon (used to test for singularities)  
  13. //stat          :  return status  
  14. // **************************************************************************************  
  15. void DecomposeLU(int M, int N, int lda , float* A, int* P, float epsilon, InfoStat& stat)  
  16. {  
  17.      cublasStatus cuStat;  
  18.      //Preconditions  
  19.      if ( M<=0 || N<=0 || lda < std::max(1,M) )  
  20.      {  
  21.           stat._info = -1;  
  22.           if (M<=0)  
  23.               stat._str = "M<=0";  
  24.           if (N<=0)  
  25.               stat._str = "M<=0";  
  26.           if (lda < std::max(1,M))  
  27.               stat._str = "lda < std::max(1,M)";  
  28.           return;  
  29.      }  
  30.      int minDim = std::min( M, N );  
  31.      for (int k=0; k<minDim-1; k++)  
  32.      {  
  33.           int pivotRow = k-1+cublasIsamax(M-k,A+k + k*lda, 1); // row relative to the current submatrix  
  34.           int kp1 = k+1;  
  35.           P[k] = pivotRow;  
  36.           if (pivotRow!=k)  
  37.           {  
  38.                cublasSswap(N, A+pivotRow, lda, A+k, lda);  
  39.           }  
  40.           float valcheck;  
  41.           cublasGetVector(1,sizeof(float),A+k+ k*lda, 1, &valcheck, 1);  
  42.           if (fabs(valcheck) < epsilon)  
  43.           {  
  44.                stat._info =k+1;  
  45.                stat._str = " Matrix is Singular ";  
  46.                return;  
  47.           }  
  48.           if (kp1 < M)  
  49.          {  
  50.               cublasSscal(M-kp1, 1.0f/valcheck,A+kp1+ k*lda, 1);  
  51.          }  
  52.          if ( kp1 < minDim )  
  53.          {  
  54.               cublasSger (M-kp1, N-kp1, -1.0f,A+kp1+ k*lda, 1, A+k+ kp1*lda, lda,A+ kp1*lda+kp1, lda);  
  55.          }  
  56.      }  
  57.      CHECK_CUBLAS("decomposeLU pb");  
  58. }  

As a further improvement, it would be nice to avoid the cublasGetVector() call. One solution would be to write this function as a kernel instead of using Cublas. Any other idea is welcome !

We can now use DecomposeLU() while performing the block decomposition :

  1. //***************************************************************************************  
  2. //void DecomposeBlockedLU ( int M, int N,int lda,  
  3. //                          float *A,  
  4. //                          int* P, int blockSize,float epsilon, InfoStat &stat )  
  5. //  
  6. // M            :   Num of rows of A  
  7. // N            :   Num of column of A  
  8. // A            :   Float Matrix of size M*N  
  9. //                  on the output contains the result of the LU decomposition  
  10. //                  The diagonal elements for L are not stored in A ( assuming they are all 1)  
  11. //lda           :   Leading dim of A lda < std::max(1,M) //P             :   Permutation vector of size M //blockSize        :   Size of the submatrices //                  if blockSize>=M || blockSize==1 unblocked decomposition is called  
  12. //epsilon       :   Epsilon (used to test for singularities)  
  13. //stat          :  return status  
  14. // **************************************************************************************  
  15. void DecomposeBlockedLU (   int M, int N,int lda,  
  16.                             float *A,  
  17.                             int* P, int blockSize,float epsilon, InfoStat &stat )  
  18. {  
  19.   
  20.     cublasStatus cuStat;  
  21.     //Preconditions  
  22.     if (M < 0 || N < 0 || lda < std::max(1,M) )  
  23.     {  
  24.         stat._info = -1;  
  25.         if (M<=0)  
  26.             stat._str = "M<=0";  
  27.         if (N<=0)  
  28.             stat._str = "M<=0";  
  29.         if (lda < std::max(1,M))  
  30.             stat._str = "lda < std::max(1,M)";       return;     }   int minSize = std::min(M,N);    if ( blockSize > minSize || blockSize == 1)  
  31.     {  
  32.         //straight LU decomposition  
  33.         DecomposeLU( M, N, lda, A, P, epsilon, stat);  
  34.     }  
  35.     else  
  36.     {  
  37.         //blocked decomposition  
  38.         for (int i =0; i< minSize ; i+=blockSize)  
  39.         {  
  40.             int realBlockSize  = std::min(minSize - i, blockSize);  
  41.   
  42.             //decompose the current rectangular block  
  43.             DecomposeLU( M-i, realBlockSize, lda, A+i+i*lda, P+i, epsilon, stat);  
  44.   
  45.             //adjust pivot infos  
  46.             //Todo : write a kernel for that  
  47.             for (int p = i; p< std::min( M, i+realBlockSize)-1; p++)  
  48.             {  
  49.                     P[p] = P[p]+i;  
  50.                     if (P[p] != p)  
  51.                     {  
  52.                         // Apply interchanges to columns 0:i.  
  53.                         cublasSswap(i, A+p , lda, A+ P[p], lda);  
  54.                         // Apply interchanges to columns i+blockSize:N.  
  55.                         cublasSswap(N-i-realBlockSize, A+p+(i+realBlockSize)*lda , lda, A+ P[p]+(i+realBlockSize)*lda, lda);  
  56.                     }  
  57.   
  58.             }  
  59.   
  60.             // Compute block row of U.  
  61.             cublasStrsm( 'l','l','n','u', realBlockSize, N-i-realBlockSize, 1.0f,  
  62.                          A +i +i*lda, lda, A +i + (i+realBlockSize)*lda, lda);  
  63.             CHECK_CUBLAS("decomposeBlockedLU cublasStrsm");  
  64.   
  65.             if (i+realBlockSize < M)  
  66.             {  
  67.                  cublasSgemm('n','n',  M-i-realBlockSize, N-i-realBlockSize, realBlockSize,  
  68.                              -1.0f,  
  69.                              A+i+realBlockSize+i*lda,lda,  
  70.                              A+i+(realBlockSize+i)*lda,lda,  
  71.                              1.0f,  
  72.                              A+i+realBlockSize+(realBlockSize+i)*lda,lda );  
  73.                  CHECK_CUBLAS("decomposeBlockedLU cublasSgemm");  
  74.             }  
  75.         }  
  76.     }  
  77.   
  78. }  

Performances improvement :

For a random 10 000*10 000 matrix DecomposeBlockedLU run in about 3 second on my Quadro FX 4800 versus 98 second if we use DecomposeLU alone.

This code is used as a basis for Radial Basis Function interpolation computed on the GPU. I’m still eager to get better performances. I’m considering writing some part of it as a Cuda kernel. For any suggestion or if you know any alternative implementation available, thanks to drop me a mail or a comment.

11 responses to “LU Matrix Decomposition in parallel with CUDA

  1. Hello,

    Thanks for this code.
    Under which license is it released ?

    thanks
    S

  2. andrea chiariello

    Hello,
    I need to use a function like this in order to solve a linear system problem with matrix dimension that don’t fill in the device memory.

    Do you think that are possible use the same approach to do this task?

    Have you some suggestions for me?

    Thank a lot in advance for your help
    Best regards
    Andrea

    • Hi Andrea,

      Yes, I think it would not be too difficult to adapt the code to an hybrid GPU/CPU solution.
      Start with a CPU Blocked LU (choosing a block size that fills in the Device memory) then solve each block using the GPU blocked LU as shown in this post.
      You can implement a CPU Blocked LU implementation using LAPACK .
      For this implementation, instead of calling DecomposeLU() the CPU Blocked LU will call the GPU blocked LU.
      Cheers,

  3. Clément Canonne

    Hi,

    First, thanks for your article – it’s really clear and helpful.
    I think i’ve found a small mistake in the code, though : in DecomposeBlockedLU(), line 67, shouldn’t the first dimension argument be “M-i-realBlockSize” (and not “N-i-realBlockSize” which appears twice) ?

    # cublasSgemm(‘n’,’n’, M-i-realBlockSize, N-i-realBlockSize, realBlockSize,

    Best regards,

    Clément

    • Hi Clement,

      Thanks for your feedback.
      You are perfectly right, it is # cublasSgemm(‘n’,’n’, M-i-realBlockSize, N-i-realBlockSize, realBlockSize
      I updated the article.
      Cheers.
      F.

  4. Clément Canonne

    Hi,
    another small question about the pivoting: in the code, it doesn’t seem to me that P[minDim] is ever set – is that normal?

  5. Clément Canonne

    OK. Anyway, unless i’m mistaken, using the resulting P as a permutation matrix does not require to ever consider the last component (swapping iteratively the rows, starting at 0 and to M-2, should put the M rows to their “correct” position) – or am i mistaken?

    • It sounds good, if B is a “colum” device vector of size M, you can do :
      for (int i=0;i<M-1;i++)
      if(i != P[i] ) cublasSswap(1,B+i,1,B+P[i],1);

  6. Stuttgart Webdesign

    I discovered your blog site on google and check a few of your early posts. Continue to keep up the very good operate. Seeking forward to reading more from you later…