1*4bcd95a3SBarry Smith# The Various Matrix Classes 2*4bcd95a3SBarry Smith 3*4bcd95a3SBarry SmithPETSc provides a variety of matrix implementations, since no single 4*4bcd95a3SBarry Smithmatrix format is appropriate for all problems. This section first 5*4bcd95a3SBarry Smithdiscusses various matrix blocking strategies and then describes the 6*4bcd95a3SBarry Smithassortment of matrix types in PETSc. 7*4bcd95a3SBarry Smith 8*4bcd95a3SBarry Smith## Matrix Blocking Strategies 9*4bcd95a3SBarry Smith 10*4bcd95a3SBarry SmithIn today’s computers, the time to perform an arithmetic operation is 11*4bcd95a3SBarry Smithdominated by the time to move the data into position, not the time to 12*4bcd95a3SBarry Smithcompute the arithmetic result. For example, the time to perform a 13*4bcd95a3SBarry Smithmultiplication operation may be one clock cycle, while the time to move 14*4bcd95a3SBarry Smiththe floating-point number from memory to the arithmetic unit may take 10 15*4bcd95a3SBarry Smithor more cycles. In order to help manage this difference in time scales, 16*4bcd95a3SBarry Smithmost processors have at least three levels of memory: registers, cache, 17*4bcd95a3SBarry Smithand random access memory. (In addition, some processors have external 18*4bcd95a3SBarry Smithcaches, and the complications of paging introduce another level to the 19*4bcd95a3SBarry Smithhierarchy.) 20*4bcd95a3SBarry Smith 21*4bcd95a3SBarry SmithThus, to achieve high performance, a code should first move data into 22*4bcd95a3SBarry Smithcache and from there move it into registers and use it repeatedly while 23*4bcd95a3SBarry Smithit remains in the cache or registers before returning it to main memory. 24*4bcd95a3SBarry SmithIf a floating-point number is reused 50 times while it is in registers, 25*4bcd95a3SBarry Smiththen the “hit” of 10 clock cycles to bring it into the register is not 26*4bcd95a3SBarry Smithimportant. But if the floating-point number is used only once, the “hit” 27*4bcd95a3SBarry Smithof 10 clock cycles becomes noticeable, resulting in disappointing flop 28*4bcd95a3SBarry Smithrates. 29*4bcd95a3SBarry Smith 30*4bcd95a3SBarry SmithUnfortunately, the compiler controls the use of the registers, and the 31*4bcd95a3SBarry Smithhardware controls the use of the cache. Since the user has essentially 32*4bcd95a3SBarry Smithno direct control, code must be written in such a way that the compiler 33*4bcd95a3SBarry Smithand hardware cache system can perform well. Good-quality code is then 34*4bcd95a3SBarry Smithsaid to respect the memory hierarchy. 35*4bcd95a3SBarry Smith 36*4bcd95a3SBarry SmithThe standard approach to improving the hardware utilization is to use 37*4bcd95a3SBarry Smithblocking. That is, rather than working with individual elements in the 38*4bcd95a3SBarry Smithmatrices, you employ blocks of elements. Since the use of implicit 39*4bcd95a3SBarry Smithmethods in PDE-based simulations leads to matrices with a naturally 40*4bcd95a3SBarry Smithblocked structure (with a block size equal to the number of degrees of 41*4bcd95a3SBarry Smithfreedom per cell), blocking is advantageous. The PETSc sparse matrix 42*4bcd95a3SBarry Smithrepresentations use a variety of techniques for blocking, including the 43*4bcd95a3SBarry Smithfollowing: 44*4bcd95a3SBarry Smith 45*4bcd95a3SBarry Smith- Storing the matrices using a generic sparse matrix format, but 46*4bcd95a3SBarry Smith storing additional information about adjacent rows with identical 47*4bcd95a3SBarry Smith nonzero structure (so-called I-nodes); this I-node information is 48*4bcd95a3SBarry Smith used in the key computational routines to improve performance (the 49*4bcd95a3SBarry Smith default for the `MATSEQAIJ` and `MATMPIAIJ` formats). 50*4bcd95a3SBarry Smith- Storing the matrices using a fixed (problem dependent) block size 51*4bcd95a3SBarry Smith (via the `MATSEQBAIJ` and `MATMPIBAIJ` formats). 52*4bcd95a3SBarry Smith 53*4bcd95a3SBarry SmithThe advantage of the first approach is that it is a minimal change from 54*4bcd95a3SBarry Smitha standard sparse matrix format and brings a large percentage of the 55*4bcd95a3SBarry Smithimprovement obtained via blocking. Using a fixed block size gives the 56*4bcd95a3SBarry Smithbest performance, since the code can be hardwired with that particular 57*4bcd95a3SBarry Smithsize (for example, in some problems the size may be 3, in others 5, and 58*4bcd95a3SBarry Smithso on), so that the compiler will then optimize for that size, removing 59*4bcd95a3SBarry Smiththe overhead of small loops entirely. 60*4bcd95a3SBarry Smith 61*4bcd95a3SBarry SmithThe following table presents the floating-point performance for a basic 62*4bcd95a3SBarry Smithmatrix-vector product using three approaches: a basic compressed row 63*4bcd95a3SBarry Smithstorage format (using the PETSc runtime options 64*4bcd95a3SBarry Smith`-mat_seqaij -mat_nounroll)`; the same compressed row format using 65*4bcd95a3SBarry SmithI-nodes (with the option `-mat_seqaij`); and a fixed block size code, 66*4bcd95a3SBarry Smithwith a block size of 3 for these problems (using the option 67*4bcd95a3SBarry Smith`-mat_seqbaij`). The rates were computed on one node of an older IBM 68*4bcd95a3SBarry SmithPower processor based system, using two test matrices. The first matrix 69*4bcd95a3SBarry Smith(ARCO1), courtesy of Rick Dean of Arco, arises in multiphase flow 70*4bcd95a3SBarry Smithsimulation; it has 1,501 degrees of freedom, 26,131 matrix nonzeros, a 71*4bcd95a3SBarry Smithnatural block size of 3, and a small number of well terms. The second 72*4bcd95a3SBarry Smithmatrix (CFD), arises in a three-dimensional Euler flow simulation and 73*4bcd95a3SBarry Smithhas 15,360 degrees of freedom, 496,000 nonzeros, and a natural block 74*4bcd95a3SBarry Smithsize of 5. In addition to displaying the flop rates for matrix-vector 75*4bcd95a3SBarry Smithproducts, we display them for triangular solves obtained from an ILU(0) 76*4bcd95a3SBarry Smithfactorization. 77*4bcd95a3SBarry Smith 78*4bcd95a3SBarry Smith```{eval-rst} 79*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+ 80*4bcd95a3SBarry Smith| Problem | Block size | Basic | I-node version | Fixed block size | 81*4bcd95a3SBarry Smith+============+============+=======+=================+==================+ 82*4bcd95a3SBarry Smith| Matrix-Vector Product (Mflop/sec) | 83*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+ 84*4bcd95a3SBarry Smith| Multiphase | 3 | 27 | 43 | 70 | 85*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+ 86*4bcd95a3SBarry Smith| Euler | 5 | 28 | 58 | 90 | 87*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+ 88*4bcd95a3SBarry Smith| Triangular Solves from ILU(0) (Mflop/sec) | 89*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+ 90*4bcd95a3SBarry Smith| Multiphase | 3 | 22 | 31 | 49 | 91*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+ 92*4bcd95a3SBarry Smith| Euler | 5 | 22 | 39 | 65 | 93*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+ 94*4bcd95a3SBarry Smith``` 95*4bcd95a3SBarry Smith 96*4bcd95a3SBarry SmithThese examples demonstrate that careful implementations of the basic 97*4bcd95a3SBarry Smithsequential kernels in PETSc can dramatically improve overall floating 98*4bcd95a3SBarry Smithpoint performance, and users can immediately benefit from such 99*4bcd95a3SBarry Smithenhancements without altering a single line of their application codes. 100*4bcd95a3SBarry SmithNote that the speeds of the I-node and fixed block operations are 101*4bcd95a3SBarry Smithseveral times that of the basic sparse implementations. 102*4bcd95a3SBarry Smith 103*4bcd95a3SBarry Smith## Assorted Matrix Types 104*4bcd95a3SBarry Smith 105*4bcd95a3SBarry SmithPETSc offers a variety of both sparse and dense matrix types. 106*4bcd95a3SBarry Smith 107*4bcd95a3SBarry Smith### Sequential AIJ Sparse Matrices 108*4bcd95a3SBarry Smith 109*4bcd95a3SBarry SmithThe default matrix representation within PETSc is the general sparse AIJ 110*4bcd95a3SBarry Smithformat (also called the compressed sparse row format, CSR). 111*4bcd95a3SBarry Smith 112*4bcd95a3SBarry Smith### Parallel AIJ Sparse Matrices 113*4bcd95a3SBarry Smith 114*4bcd95a3SBarry SmithThe AIJ sparse matrix type, is the default parallel matrix format; 115*4bcd95a3SBarry Smithadditional implementation details are given in {cite}`petsc-efficient`. 116*4bcd95a3SBarry Smith 117*4bcd95a3SBarry Smith### Sequential Block AIJ Sparse Matrices 118*4bcd95a3SBarry Smith 119*4bcd95a3SBarry SmithThe sequential and parallel block AIJ formats, which are extensions of 120*4bcd95a3SBarry Smiththe AIJ formats described above, are intended especially for use with 121*4bcd95a3SBarry Smithmulticlass PDEs. The block variants store matrix elements by fixed-sized 122*4bcd95a3SBarry Smithdense `nb` $\times$ `nb` blocks. The stored row and column 123*4bcd95a3SBarry Smithindices begin at zero. 124*4bcd95a3SBarry Smith 125*4bcd95a3SBarry SmithThe routine for creating a sequential block AIJ matrix with `m` rows, 126*4bcd95a3SBarry Smith`n` columns, and a block size of `nb` is 127*4bcd95a3SBarry Smith 128*4bcd95a3SBarry Smith``` 129*4bcd95a3SBarry SmithMatCreateSeqBAIJ(MPI_Comm comm,int nb,int m,int n,int nz,int *nnz,Mat *A) 130*4bcd95a3SBarry Smith``` 131*4bcd95a3SBarry Smith 132*4bcd95a3SBarry SmithThe arguments `nz` and `nnz` can be used to preallocate matrix 133*4bcd95a3SBarry Smithmemory by indicating the number of *block* nonzeros per row. For good 134*4bcd95a3SBarry Smithperformance during matrix assembly, preallocation is crucial; however, 135*4bcd95a3SBarry Smithyou can set `nz=0` and `nnz=NULL` for PETSc to dynamically allocate 136*4bcd95a3SBarry Smithmatrix memory as needed. The PETSc users manual discusses preallocation 137*4bcd95a3SBarry Smithfor the AIJ format; extension to the block AIJ format is 138*4bcd95a3SBarry Smithstraightforward. 139*4bcd95a3SBarry Smith 140*4bcd95a3SBarry SmithNote that the routine `MatSetValuesBlocked()` can be used for more 141*4bcd95a3SBarry Smithefficient matrix assembly when using the block AIJ format. 142*4bcd95a3SBarry Smith 143*4bcd95a3SBarry Smith### Parallel Block AIJ Sparse Matrices 144*4bcd95a3SBarry Smith 145*4bcd95a3SBarry SmithParallel block AIJ matrices with block size nb can be created with the 146*4bcd95a3SBarry Smithcommand `MatCreateBAIJ()` 147*4bcd95a3SBarry Smith 148*4bcd95a3SBarry Smith``` 149*4bcd95a3SBarry SmithMatCreateBAIJ(MPI_Comm comm,int nb,int m,int n,int M,int N,int d_nz,int *d_nnz,int o_nz,int *o_nnz,Mat *A); 150*4bcd95a3SBarry Smith``` 151*4bcd95a3SBarry Smith 152*4bcd95a3SBarry Smith`A` is the newly created matrix, while the arguments `m`, `n`, 153*4bcd95a3SBarry Smith`M`, and `N` indicate the number of local rows and columns and the 154*4bcd95a3SBarry Smithnumber of global rows and columns, respectively. Either the local or 155*4bcd95a3SBarry Smithglobal parameters can be replaced with `PETSC_DECIDE`, so that PETSc 156*4bcd95a3SBarry Smithwill determine them. The matrix is stored with a fixed number of rows on 157*4bcd95a3SBarry Smitheach processor, given by `m`, or determined by PETSc if `m` is 158*4bcd95a3SBarry Smith`PETSC_DECIDE`. 159*4bcd95a3SBarry Smith 160*4bcd95a3SBarry SmithIf `PETSC_DECIDE` is not used for `m` and `n` then you must ensure 161*4bcd95a3SBarry Smiththat they are chosen to be compatible with the vectors. To do so, you 162*4bcd95a3SBarry Smithfirst consider the product $y = A x$. The `m` that used in 163*4bcd95a3SBarry Smith`MatCreateBAIJ()` must match the local size used in the 164*4bcd95a3SBarry Smith`VecCreateMPI()` for `y`. The `n` used must match that used as the 165*4bcd95a3SBarry Smithlocal size in `VecCreateMPI()` for `x`. 166*4bcd95a3SBarry Smith 167*4bcd95a3SBarry SmithYou must set `d_nz=0`, `o_nz=0`, `d_nnz=NULL`, and `o_nnz=NULL` for 168*4bcd95a3SBarry SmithPETSc to control dynamic allocation of matrix memory space. Analogous to 169*4bcd95a3SBarry Smith`nz` and `nnz` for the routine `MatCreateSeqBAIJ()`, these 170*4bcd95a3SBarry Smitharguments optionally specify block nonzero information for the diagonal 171*4bcd95a3SBarry Smith(`d_nz` and `d_nnz`) and off-diagonal (`o_nz` and `o_nnz`) parts of 172*4bcd95a3SBarry Smiththe matrix. For a square global matrix, we define each processor’s 173*4bcd95a3SBarry Smithdiagonal portion to be its local rows and the corresponding columns (a 174*4bcd95a3SBarry Smithsquare submatrix); each processor’s off-diagonal portion encompasses the 175*4bcd95a3SBarry Smithremainder of the local matrix (a rectangular submatrix). The PETSc users 176*4bcd95a3SBarry Smithmanual gives an example of preallocation for the parallel AIJ matrix 177*4bcd95a3SBarry Smithformat; extension to the block parallel AIJ case is straightforward. 178*4bcd95a3SBarry Smith 179*4bcd95a3SBarry Smith### Sequential Dense Matrices 180*4bcd95a3SBarry Smith 181*4bcd95a3SBarry SmithPETSc provides both sequential and parallel dense matrix formats, where 182*4bcd95a3SBarry Smitheach processor stores its entries in a column-major array in the usual 183*4bcd95a3SBarry SmithFortran style. 184*4bcd95a3SBarry Smith 185*4bcd95a3SBarry Smith### Parallel Dense Matrices 186*4bcd95a3SBarry Smith 187*4bcd95a3SBarry SmithThe parallel dense matrices are partitioned by rows across the 188*4bcd95a3SBarry Smithprocessors, so that each local rectangular submatrix is stored in the 189*4bcd95a3SBarry Smithdense format described above. 190*4bcd95a3SBarry Smith 191*4bcd95a3SBarry Smith## References 192*4bcd95a3SBarry Smith 193*4bcd95a3SBarry Smith```{bibliography} /petsc.bib 194*4bcd95a3SBarry Smith:filter: docname in docnames 195*4bcd95a3SBarry Smith``` 196