1# The Various Matrix Classes 2 3PETSc provides a variety of matrix implementations, since no single 4matrix format is appropriate for all problems. This section first 5discusses various matrix blocking strategies and then describes the 6assortment of matrix types in PETSc. 7 8## Matrix Blocking Strategies 9 10In today’s computers, the time to perform an arithmetic operation is 11dominated by the time to move the data into position, not the time to 12compute the arithmetic result. For example, the time to perform a 13multiplication operation may be one clock cycle, while the time to move 14the floating-point number from memory to the arithmetic unit may take 10 15or more cycles. In order to help manage this difference in time scales, 16most processors have at least three levels of memory: registers, cache, 17and random access memory. (In addition, some processors have external 18caches, and the complications of paging introduce another level to the 19hierarchy.) 20 21Thus, to achieve high performance, a code should first move data into 22cache and from there move it into registers and use it repeatedly while 23it remains in the cache or registers before returning it to main memory. 24If a floating-point number is reused 50 times while it is in registers, 25then the “hit” of 10 clock cycles to bring it into the register is not 26important. But if the floating-point number is used only once, the “hit” 27of 10 clock cycles becomes noticeable, resulting in disappointing flop 28rates. 29 30Unfortunately, the compiler controls the use of the registers, and the 31hardware controls the use of the cache. Since the user has essentially 32no direct control, code must be written in such a way that the compiler 33and hardware cache system can perform well. Good-quality code is then 34said to respect the memory hierarchy. 35 36The standard approach to improving the hardware utilization is to use 37blocking. That is, rather than working with individual elements in the 38matrices, you employ blocks of elements. Since the use of implicit 39methods in PDE-based simulations leads to matrices with a naturally 40blocked structure (with a block size equal to the number of degrees of 41freedom per cell), blocking is advantageous. The PETSc sparse matrix 42representations use a variety of techniques for blocking, including the 43following: 44 45- Storing the matrices using a generic sparse matrix format, but 46 storing additional information about adjacent rows with identical 47 nonzero structure (so-called I-nodes); this I-node information is 48 used in the key computational routines to improve performance (the 49 default for the `MATSEQAIJ` and `MATMPIAIJ` formats). 50- Storing the matrices using a fixed (problem dependent) block size 51 (via the `MATSEQBAIJ` and `MATMPIBAIJ` formats). 52 53The advantage of the first approach is that it is a minimal change from 54a standard sparse matrix format and brings a large percentage of the 55improvement obtained via blocking. Using a fixed block size gives the 56best performance, since the code can be hardwired with that particular 57size (for example, in some problems the size may be 3, in others 5, and 58so on), so that the compiler will then optimize for that size, removing 59the overhead of small loops entirely. 60 61The following table presents the floating-point performance for a basic 62matrix-vector product using three approaches: a basic compressed row 63storage format (using the PETSc runtime options 64`-mat_seqaij -mat_nounroll)`; the same compressed row format using 65I-nodes (with the option `-mat_seqaij`); and a fixed block size code, 66with a block size of 3 for these problems (using the option 67`-mat_seqbaij`). The rates were computed on one node of an older IBM 68Power processor based system, using two test matrices. The first matrix 69(ARCO1), courtesy of Rick Dean of Arco, arises in multiphase flow 70simulation; it has 1,501 degrees of freedom, 26,131 matrix nonzeros, a 71natural block size of 3, and a small number of well terms. The second 72matrix (CFD), arises in a three-dimensional Euler flow simulation and 73has 15,360 degrees of freedom, 496,000 nonzeros, and a natural block 74size of 5. In addition to displaying the flop rates for matrix-vector 75products, we display them for triangular solves obtained from an ILU(0) 76factorization. 77 78```{eval-rst} 79+------------+------------+-------+-----------------+------------------+ 80| Problem | Block size | Basic | I-node version | Fixed block size | 81+============+============+=======+=================+==================+ 82| Matrix-Vector Product (Mflop/sec) | 83+------------+------------+-------+-----------------+------------------+ 84| Multiphase | 3 | 27 | 43 | 70 | 85+------------+------------+-------+-----------------+------------------+ 86| Euler | 5 | 28 | 58 | 90 | 87+------------+------------+-------+-----------------+------------------+ 88| Triangular Solves from ILU(0) (Mflop/sec) | 89+------------+------------+-------+-----------------+------------------+ 90| Multiphase | 3 | 22 | 31 | 49 | 91+------------+------------+-------+-----------------+------------------+ 92| Euler | 5 | 22 | 39 | 65 | 93+------------+------------+-------+-----------------+------------------+ 94``` 95 96These examples demonstrate that careful implementations of the basic 97sequential kernels in PETSc can dramatically improve overall floating 98point performance, and users can immediately benefit from such 99enhancements without altering a single line of their application codes. 100Note that the speeds of the I-node and fixed block operations are 101several times that of the basic sparse implementations. 102 103## Assorted Matrix Types 104 105PETSc offers a variety of both sparse and dense matrix types. 106 107### Sequential AIJ Sparse Matrices 108 109The default matrix representation within PETSc is the general sparse AIJ 110format (also called the compressed sparse row format, CSR). 111 112### Parallel AIJ Sparse Matrices 113 114The AIJ sparse matrix type, is the default parallel matrix format; 115additional implementation details are given in {cite}`petsc-efficient`. 116 117### Sequential Block AIJ Sparse Matrices 118 119The sequential and parallel block AIJ formats, which are extensions of 120the AIJ formats described above, are intended especially for use with 121multiclass PDEs. The block variants store matrix elements by fixed-sized 122dense `nb` $\times$ `nb` blocks. The stored row and column 123indices begin at zero. 124 125The routine for creating a sequential block AIJ matrix with `m` rows, 126`n` columns, and a block size of `nb` is 127 128``` 129MatCreateSeqBAIJ(MPI_Comm comm,int nb,int m,int n,int nz,int *nnz,Mat *A) 130``` 131 132The arguments `nz` and `nnz` can be used to preallocate matrix 133memory by indicating the number of *block* nonzeros per row. For good 134performance during matrix assembly, preallocation is crucial; however, 135you can set `nz=0` and `nnz=NULL` for PETSc to dynamically allocate 136matrix memory as needed. The PETSc users manual discusses preallocation 137for the AIJ format; extension to the block AIJ format is 138straightforward. 139 140Note that the routine `MatSetValuesBlocked()` can be used for more 141efficient matrix assembly when using the block AIJ format. 142 143### Parallel Block AIJ Sparse Matrices 144 145Parallel block AIJ matrices with block size nb can be created with the 146command `MatCreateBAIJ()` 147 148``` 149MatCreateBAIJ(MPI_Comm comm,int nb,int m,int n,int M,int N,int d_nz,int *d_nnz,int o_nz,int *o_nnz,Mat *A); 150``` 151 152`A` is the newly created matrix, while the arguments `m`, `n`, 153`M`, and `N` indicate the number of local rows and columns and the 154number of global rows and columns, respectively. Either the local or 155global parameters can be replaced with `PETSC_DECIDE`, so that PETSc 156will determine them. The matrix is stored with a fixed number of rows on 157each processor, given by `m`, or determined by PETSc if `m` is 158`PETSC_DECIDE`. 159 160If `PETSC_DECIDE` is not used for `m` and `n` then you must ensure 161that they are chosen to be compatible with the vectors. To do so, you 162first consider the product $y = A x$. The `m` that used in 163`MatCreateBAIJ()` must match the local size used in the 164`VecCreateMPI()` for `y`. The `n` used must match that used as the 165local size in `VecCreateMPI()` for `x`. 166 167You must set `d_nz=0`, `o_nz=0`, `d_nnz=NULL`, and `o_nnz=NULL` for 168PETSc to control dynamic allocation of matrix memory space. Analogous to 169`nz` and `nnz` for the routine `MatCreateSeqBAIJ()`, these 170arguments optionally specify block nonzero information for the diagonal 171(`d_nz` and `d_nnz`) and off-diagonal (`o_nz` and `o_nnz`) parts of 172the matrix. For a square global matrix, we define each processor’s 173diagonal portion to be its local rows and the corresponding columns (a 174square submatrix); each processor’s off-diagonal portion encompasses the 175remainder of the local matrix (a rectangular submatrix). The PETSc users 176manual gives an example of preallocation for the parallel AIJ matrix 177format; extension to the block parallel AIJ case is straightforward. 178 179### Sequential Dense Matrices 180 181PETSc provides both sequential and parallel dense matrix formats, where 182each processor stores its entries in a column-major array in the usual 183Fortran style. 184 185### Parallel Dense Matrices 186 187The parallel dense matrices are partitioned by rows across the 188processors, so that each local rectangular submatrix is stored in the 189dense format described above. 190 191## References 192 193```{bibliography} /petsc.bib 194:filter: docname in docnames 195``` 196