doc/developers/matrices.md

*4bcd95a3SBarry Smith# The Various Matrix Classes
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithPETSc provides a variety of matrix implementations, since no single
*4bcd95a3SBarry Smithmatrix format is appropriate for all problems. This section first
*4bcd95a3SBarry Smithdiscusses various matrix blocking strategies and then describes the
*4bcd95a3SBarry Smithassortment of matrix types in PETSc.
*4bcd95a3SBarry Smith
*4bcd95a3SBarry Smith## Matrix Blocking Strategies
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithIn today’s computers, the time to perform an arithmetic operation is
*4bcd95a3SBarry Smithdominated by the time to move the data into position, not the time to
*4bcd95a3SBarry Smithcompute the arithmetic result. For example, the time to perform a
*4bcd95a3SBarry Smithmultiplication operation may be one clock cycle, while the time to move
*4bcd95a3SBarry Smiththe floating-point number from memory to the arithmetic unit may take 10
*4bcd95a3SBarry Smithor more cycles. In order to help manage this difference in time scales,
*4bcd95a3SBarry Smithmost processors have at least three levels of memory: registers, cache,
*4bcd95a3SBarry Smithand random access memory. (In addition, some processors have external
*4bcd95a3SBarry Smithcaches, and the complications of paging introduce another level to the
*4bcd95a3SBarry Smithhierarchy.)
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithThus, to achieve high performance, a code should first move data into
*4bcd95a3SBarry Smithcache and from there move it into registers and use it repeatedly while
*4bcd95a3SBarry Smithit remains in the cache or registers before returning it to main memory.
*4bcd95a3SBarry SmithIf a floating-point number is reused 50 times while it is in registers,
*4bcd95a3SBarry Smiththen the “hit” of 10 clock cycles to bring it into the register is not
*4bcd95a3SBarry Smithimportant. But if the floating-point number is used only once, the “hit”
*4bcd95a3SBarry Smithof 10 clock cycles becomes noticeable, resulting in disappointing flop
*4bcd95a3SBarry Smithrates.
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithUnfortunately, the compiler controls the use of the registers, and the
*4bcd95a3SBarry Smithhardware controls the use of the cache. Since the user has essentially
*4bcd95a3SBarry Smithno direct control, code must be written in such a way that the compiler
*4bcd95a3SBarry Smithand hardware cache system can perform well. Good-quality code is then
*4bcd95a3SBarry Smithsaid to respect the memory hierarchy.
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithThe standard approach to improving the hardware utilization is to use
*4bcd95a3SBarry Smithblocking. That is, rather than working with individual elements in the
*4bcd95a3SBarry Smithmatrices, you employ blocks of elements. Since the use of implicit
*4bcd95a3SBarry Smithmethods in PDE-based simulations leads to matrices with a naturally
*4bcd95a3SBarry Smithblocked structure (with a block size equal to the number of degrees of
*4bcd95a3SBarry Smithfreedom per cell), blocking is advantageous. The PETSc sparse matrix
*4bcd95a3SBarry Smithrepresentations use a variety of techniques for blocking, including the
*4bcd95a3SBarry Smithfollowing:
*4bcd95a3SBarry Smith
*4bcd95a3SBarry Smith- Storing the matrices using a generic sparse matrix format, but
*4bcd95a3SBarry Smith  storing additional information about adjacent rows with identical
*4bcd95a3SBarry Smith  nonzero structure (so-called I-nodes); this I-node information is
*4bcd95a3SBarry Smith  used in the key computational routines to improve performance (the
*4bcd95a3SBarry Smith  default for the `MATSEQAIJ` and `MATMPIAIJ` formats).
*4bcd95a3SBarry Smith- Storing the matrices using a fixed (problem dependent) block size
*4bcd95a3SBarry Smith  (via the `MATSEQBAIJ` and `MATMPIBAIJ` formats).
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithThe advantage of the first approach is that it is a minimal change from
*4bcd95a3SBarry Smitha standard sparse matrix format and brings a large percentage of the
*4bcd95a3SBarry Smithimprovement obtained via blocking. Using a fixed block size gives the
*4bcd95a3SBarry Smithbest performance, since the code can be hardwired with that particular
*4bcd95a3SBarry Smithsize (for example, in some problems the size may be 3, in others 5, and
*4bcd95a3SBarry Smithso on), so that the compiler will then optimize for that size, removing
*4bcd95a3SBarry Smiththe overhead of small loops entirely.
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithThe following table presents the floating-point performance for a basic
*4bcd95a3SBarry Smithmatrix-vector product using three approaches: a basic compressed row
*4bcd95a3SBarry Smithstorage format (using the PETSc runtime options
*4bcd95a3SBarry Smith`-mat_seqaij -mat_nounroll)`; the same compressed row format using
*4bcd95a3SBarry SmithI-nodes (with the option `-mat_seqaij`); and a fixed block size code,
*4bcd95a3SBarry Smithwith a block size of 3 for these problems (using the option
*4bcd95a3SBarry Smith`-mat_seqbaij`). The rates were computed on one node of an older IBM
*4bcd95a3SBarry SmithPower processor based system, using two test matrices. The first matrix
*4bcd95a3SBarry Smith(ARCO1), courtesy of Rick Dean of Arco, arises in multiphase flow
*4bcd95a3SBarry Smithsimulation; it has 1,501 degrees of freedom, 26,131 matrix nonzeros, a
*4bcd95a3SBarry Smithnatural block size of 3, and a small number of well terms. The second
*4bcd95a3SBarry Smithmatrix (CFD), arises in a three-dimensional Euler flow simulation and
*4bcd95a3SBarry Smithhas 15,360 degrees of freedom, 496,000 nonzeros, and a natural block
*4bcd95a3SBarry Smithsize of 5. In addition to displaying the flop rates for matrix-vector
*4bcd95a3SBarry Smithproducts, we display them for triangular solves obtained from an ILU(0)
*4bcd95a3SBarry Smithfactorization.
*4bcd95a3SBarry Smith
*4bcd95a3SBarry Smith```{eval-rst}
*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+
*4bcd95a3SBarry Smith| Problem    | Block size | Basic | I-node version  | Fixed block size |
*4bcd95a3SBarry Smith+============+============+=======+=================+==================+
*4bcd95a3SBarry Smith| Matrix-Vector Product (Mflop/sec)                                    |
*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+
*4bcd95a3SBarry Smith| Multiphase | 3          | 27    | 43              | 70               |
*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+
*4bcd95a3SBarry Smith| Euler      | 5          | 28    | 58              | 90               |
*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+
*4bcd95a3SBarry Smith| Triangular Solves from ILU(0) (Mflop/sec)                            |
*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+
*4bcd95a3SBarry Smith| Multiphase | 3          | 22    | 31              | 49               |
*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+
*4bcd95a3SBarry Smith| Euler      | 5          | 22    | 39              | 65               |
*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+
*4bcd95a3SBarry Smith```
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithThese examples demonstrate that careful implementations of the basic
*4bcd95a3SBarry Smithsequential kernels in PETSc can dramatically improve overall floating
*4bcd95a3SBarry Smithpoint performance, and users can immediately benefit from such
*4bcd95a3SBarry Smithenhancements without altering a single line of their application codes.
*4bcd95a3SBarry SmithNote that the speeds of the I-node and fixed block operations are
*4bcd95a3SBarry Smithseveral times that of the basic sparse implementations.
*4bcd95a3SBarry Smith
*4bcd95a3SBarry Smith## Assorted Matrix Types
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithPETSc offers a variety of both sparse and dense matrix types.
*4bcd95a3SBarry Smith
*4bcd95a3SBarry Smith### Sequential AIJ Sparse Matrices
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithThe default matrix representation within PETSc is the general sparse AIJ
*4bcd95a3SBarry Smithformat (also called the compressed sparse row format, CSR).
*4bcd95a3SBarry Smith
*4bcd95a3SBarry Smith### Parallel AIJ Sparse Matrices
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithThe AIJ sparse matrix type, is the default parallel matrix format;
*4bcd95a3SBarry Smithadditional implementation details are given in {cite}`petsc-efficient`.
*4bcd95a3SBarry Smith
*4bcd95a3SBarry Smith### Sequential Block AIJ Sparse Matrices
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithThe sequential and parallel block AIJ formats, which are extensions of
*4bcd95a3SBarry Smiththe AIJ formats described above, are intended especially for use with
*4bcd95a3SBarry Smithmulticlass PDEs. The block variants store matrix elements by fixed-sized
*4bcd95a3SBarry Smithdense `nb` $\times$ `nb` blocks. The stored row and column
*4bcd95a3SBarry Smithindices begin at zero.
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithThe routine for creating a sequential block AIJ matrix with `m` rows,
*4bcd95a3SBarry Smith`n` columns, and a block size of `nb` is
*4bcd95a3SBarry Smith
*4bcd95a3SBarry Smith```
*4bcd95a3SBarry SmithMatCreateSeqBAIJ(MPI_Comm comm,int nb,int m,int n,int nz,int *nnz,Mat *A)
*4bcd95a3SBarry Smith```
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithThe arguments `nz` and `nnz` can be used to preallocate matrix
*4bcd95a3SBarry Smithmemory by indicating the number of *block* nonzeros per row. For good
*4bcd95a3SBarry Smithperformance during matrix assembly, preallocation is crucial; however,
*4bcd95a3SBarry Smithyou can set `nz=0` and `nnz=NULL` for PETSc to dynamically allocate
*4bcd95a3SBarry Smithmatrix memory as needed. The PETSc users manual discusses preallocation
*4bcd95a3SBarry Smithfor the AIJ format; extension to the block AIJ format is
*4bcd95a3SBarry Smithstraightforward.
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithNote that the routine `MatSetValuesBlocked()` can be used for more
*4bcd95a3SBarry Smithefficient matrix assembly when using the block AIJ format.
*4bcd95a3SBarry Smith
*4bcd95a3SBarry Smith### Parallel Block AIJ Sparse Matrices
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithParallel block AIJ matrices with block size nb can be created with the
*4bcd95a3SBarry Smithcommand `MatCreateBAIJ()`
*4bcd95a3SBarry Smith
*4bcd95a3SBarry Smith```
*4bcd95a3SBarry SmithMatCreateBAIJ(MPI_Comm comm,int nb,int m,int n,int M,int N,int d_nz,int *d_nnz,int o_nz,int *o_nnz,Mat *A);
*4bcd95a3SBarry Smith```
*4bcd95a3SBarry Smith
*4bcd95a3SBarry Smith`A` is the newly created matrix, while the arguments `m`, `n`,
*4bcd95a3SBarry Smith`M`, and `N` indicate the number of local rows and columns and the
*4bcd95a3SBarry Smithnumber of global rows and columns, respectively. Either the local or
*4bcd95a3SBarry Smithglobal parameters can be replaced with `PETSC_DECIDE`, so that PETSc
*4bcd95a3SBarry Smithwill determine them. The matrix is stored with a fixed number of rows on
*4bcd95a3SBarry Smitheach processor, given by `m`, or determined by PETSc if `m` is
*4bcd95a3SBarry Smith`PETSC_DECIDE`.
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithIf `PETSC_DECIDE` is not used for `m` and `n` then you must ensure
*4bcd95a3SBarry Smiththat they are chosen to be compatible with the vectors. To do so, you
*4bcd95a3SBarry Smithfirst consider the product $y = A x$. The `m` that used in
*4bcd95a3SBarry Smith`MatCreateBAIJ()` must match the local size used in the
*4bcd95a3SBarry Smith`VecCreateMPI()` for `y`. The `n` used must match that used as the
*4bcd95a3SBarry Smithlocal size in `VecCreateMPI()` for `x`.
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithYou must set `d_nz=0`, `o_nz=0`, `d_nnz=NULL`, and `o_nnz=NULL` for
*4bcd95a3SBarry SmithPETSc to control dynamic allocation of matrix memory space. Analogous to
*4bcd95a3SBarry Smith`nz` and `nnz` for the routine `MatCreateSeqBAIJ()`, these
*4bcd95a3SBarry Smitharguments optionally specify block nonzero information for the diagonal
*4bcd95a3SBarry Smith(`d_nz` and `d_nnz`) and off-diagonal (`o_nz` and `o_nnz`) parts of
*4bcd95a3SBarry Smiththe matrix. For a square global matrix, we define each processor’s
*4bcd95a3SBarry Smithdiagonal portion to be its local rows and the corresponding columns (a
*4bcd95a3SBarry Smithsquare submatrix); each processor’s off-diagonal portion encompasses the
*4bcd95a3SBarry Smithremainder of the local matrix (a rectangular submatrix). The PETSc users
*4bcd95a3SBarry Smithmanual gives an example of preallocation for the parallel AIJ matrix
*4bcd95a3SBarry Smithformat; extension to the block parallel AIJ case is straightforward.
*4bcd95a3SBarry Smith
*4bcd95a3SBarry Smith### Sequential Dense Matrices
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithPETSc provides both sequential and parallel dense matrix formats, where
*4bcd95a3SBarry Smitheach processor stores its entries in a column-major array in the usual
*4bcd95a3SBarry SmithFortran style.
*4bcd95a3SBarry Smith
*4bcd95a3SBarry Smith### Parallel Dense Matrices
*4bcd95a3SBarry Smith
*4bcd95a3SBarry SmithThe parallel dense matrices are partitioned by rows across the
*4bcd95a3SBarry Smithprocessors, so that each local rectangular submatrix is stored in the
*4bcd95a3SBarry Smithdense format described above.
*4bcd95a3SBarry Smith
*4bcd95a3SBarry Smith## References
*4bcd95a3SBarry Smith
*4bcd95a3SBarry Smith```{bibliography} /petsc.bib
*4bcd95a3SBarry Smith:filter: docname in docnames
*4bcd95a3SBarry Smith```