xref: /petsc/doc/developers/matrices.md (revision 0b4b7b1c20c2ed4ade67e3d50a7710fe0ffbfca5)
1# The Various Matrix Classes
2
3PETSc provides a variety of matrix implementations, since no single
4matrix format is appropriate for all problems. This section first
5discusses various matrix blocking strategies and then describes the
6assortment of matrix types in PETSc.
7
8## Matrix Blocking Strategies
9
10In today’s computers, the time to perform an arithmetic operation is
11dominated by the time to move the data into position, not the time to
12compute the arithmetic result. For example, the time to perform a
13multiplication operation may be one clock cycle, while the time to move
14the floating-point number from memory to the arithmetic unit may take 10
15or more cycles. In order to help manage this difference in time scales,
16most processors have at least three levels of memory: registers, cache,
17and random access memory. (In addition, some processors have external
18caches, and the complications of paging introduce another level to the
19hierarchy.)
20
21Thus, to achieve high performance, a code should first move data into
22cache and from there move it into registers and use it repeatedly while
23it remains in the cache or registers before returning it to main memory.
24If a floating-point number is reused 50 times while it is in registers,
25then the “hit” of 10 clock cycles to bring it into the register is not
26important. But if the floating-point number is used only once, the “hit”
27of 10 clock cycles becomes noticeable, resulting in disappointing flop
28rates.
29
30Unfortunately, the compiler controls the use of the registers, and the
31hardware controls the use of the cache. Since the user has essentially
32no direct control, code must be written in such a way that the compiler
33and hardware cache system can perform well. Good-quality code is then
34said to respect the memory hierarchy.
35
36The standard approach to improving the hardware utilization is to use
37blocking. That is, rather than working with individual elements in the
38matrices, you employ blocks of elements. Since the use of implicit
39methods in PDE-based simulations leads to matrices with a naturally
40blocked structure (with a block size equal to the number of degrees of
41freedom per cell), blocking is advantageous. The PETSc sparse matrix
42representations use a variety of techniques for blocking, including the
43following:
44
45- Storing the matrices using a generic sparse matrix format, but
46  storing additional information about adjacent rows with identical
47  nonzero structure (so-called I-nodes); this I-node information is
48  used in the key computational routines to improve performance (the
49  default for the `MATSEQAIJ` and `MATMPIAIJ` formats).
50- Storing the matrices using a fixed (problem dependent) block size
51  (via the `MATSEQBAIJ` and `MATMPIBAIJ` formats).
52
53The advantage of the first approach is that it is a minimal change from
54a standard sparse matrix format and brings a large percentage of the
55improvement obtained via blocking. Using a fixed block size gives the
56best performance, since the code can be hardwired with that particular
57size (for example, in some problems the size may be 3, in others 5, and
58so on), so that the compiler will then optimize for that size, removing
59the overhead of small loops entirely.
60
61The following table presents the floating-point performance for a basic
62matrix-vector product using three approaches: a basic compressed row
63storage format (using the PETSc runtime options
64`-mat_seqaij -mat_nounroll)`; the same compressed row format using
65I-nodes (with the option `-mat_seqaij`); and a fixed block size code,
66with a block size of 3 for these problems (using the option
67`-mat_seqbaij`). The rates were computed on one node of an older IBM
68Power processor based system, using two test matrices. The first matrix
69(ARCO1), courtesy of Rick Dean of Arco, arises in multiphase flow
70simulation; it has 1,501 degrees of freedom, 26,131 matrix nonzeros, a
71natural block size of 3, and a small number of well terms. The second
72matrix (CFD), arises in a three-dimensional Euler flow simulation and
73has 15,360 degrees of freedom, 496,000 nonzeros, and a natural block
74size of 5. In addition to displaying the flop rates for matrix-vector
75products, we display them for triangular solves obtained from an ILU(0)
76factorization.
77
78```{eval-rst}
79+------------+------------+-------+-----------------+------------------+
80| Problem    | Block size | Basic | I-node version  | Fixed block size |
81+============+============+=======+=================+==================+
82| Matrix-Vector Product (Mflop/sec)                                    |
83+------------+------------+-------+-----------------+------------------+
84| Multiphase | 3          | 27    | 43              | 70               |
85+------------+------------+-------+-----------------+------------------+
86| Euler      | 5          | 28    | 58              | 90               |
87+------------+------------+-------+-----------------+------------------+
88| Triangular Solves from ILU(0) (Mflop/sec)                            |
89+------------+------------+-------+-----------------+------------------+
90| Multiphase | 3          | 22    | 31              | 49               |
91+------------+------------+-------+-----------------+------------------+
92| Euler      | 5          | 22    | 39              | 65               |
93+------------+------------+-------+-----------------+------------------+
94```
95
96These examples demonstrate that careful implementations of the basic
97sequential kernels in PETSc can dramatically improve overall floating
98point performance, and users can immediately benefit from such
99enhancements without altering a single line of their application codes.
100Note that the speeds of the I-node and fixed block operations are
101several times that of the basic sparse implementations.
102
103## Assorted Matrix Types
104
105PETSc offers a variety of both sparse and dense matrix types.
106
107### Sequential AIJ Sparse Matrices
108
109The default matrix representation within PETSc is the general sparse AIJ
110format (also called the compressed sparse row format, CSR).
111
112### Parallel AIJ Sparse Matrices
113
114The AIJ sparse matrix type, is the default parallel matrix format;
115additional implementation details are given in {cite}`petsc-efficient`.
116
117### Sequential Block AIJ Sparse Matrices
118
119The sequential and parallel block AIJ formats, which are extensions of
120the AIJ formats described above, are intended especially for use with
121multiclass PDEs. The block variants store matrix elements by fixed-sized
122dense `nb` $\times$ `nb` blocks. The stored row and column
123indices begin at zero.
124
125The routine for creating a sequential block AIJ matrix with `m` rows,
126`n` columns, and a block size of `nb` is
127
128```
129MatCreateSeqBAIJ(MPI_Comm comm,int nb,int m,int n,int nz,int *nnz,Mat *A)
130```
131
132The arguments `nz` and `nnz` can be used to preallocate matrix
133memory by indicating the number of *block* nonzeros per row. For good
134performance during matrix assembly, preallocation is crucial; however,
135you can set `nz=0` and `nnz=NULL` for PETSc to dynamically allocate
136matrix memory as needed. The PETSc users manual discusses preallocation
137for the AIJ format; extension to the block AIJ format is
138straightforward.
139
140Note that the routine `MatSetValuesBlocked()` can be used for more
141efficient matrix assembly when using the block AIJ format.
142
143### Parallel Block AIJ Sparse Matrices
144
145Parallel block AIJ matrices with block size nb can be created with the
146command `MatCreateBAIJ()`
147
148```
149MatCreateBAIJ(MPI_Comm comm,int nb,int m,int n,int M,int N,int d_nz,int *d_nnz,int o_nz,int *o_nnz,Mat *A);
150```
151
152`A` is the newly created matrix, while the arguments `m`, `n`,
153`M`, and `N` indicate the number of local rows and columns and the
154number of global rows and columns, respectively. Either the local or
155global parameters can be replaced with `PETSC_DECIDE`, so that PETSc
156will determine them. The matrix is stored with a fixed number of rows on
157each processor, given by `m`, or determined by PETSc if `m` is
158`PETSC_DECIDE`.
159
160If `PETSC_DECIDE` is not used for `m` and `n` then you must ensure
161that they are chosen to be compatible with the vectors. To do so, you
162first consider the product $y = A x$. The `m` that used in
163`MatCreateBAIJ()` must match the local size used in the
164`VecCreateMPI()` for `y`. The `n` used must match that used as the
165local size in `VecCreateMPI()` for `x`.
166
167You must set `d_nz=0`, `o_nz=0`, `d_nnz=NULL`, and `o_nnz=NULL` for
168PETSc to control dynamic allocation of matrix memory space. Analogous to
169`nz` and `nnz` for the routine `MatCreateSeqBAIJ()`, these
170arguments optionally specify block nonzero information for the diagonal
171(`d_nz` and `d_nnz`) and off-diagonal (`o_nz` and `o_nnz`) parts of
172the matrix. For a square global matrix, we define each processor’s
173diagonal portion to be its local rows and the corresponding columns (a
174square submatrix); each processor’s off-diagonal portion encompasses the
175remainder of the local matrix (a rectangular submatrix). The PETSc users
176manual gives an example of preallocation for the parallel AIJ matrix
177format; extension to the block parallel AIJ case is straightforward.
178
179### Sequential Dense Matrices
180
181PETSc provides both sequential and parallel dense matrix formats, where
182each processor stores its entries in a column-major array in the usual
183Fortran style.
184
185### Parallel Dense Matrices
186
187The parallel dense matrices are partitioned by rows across the
188processors, so that each local rectangular submatrix is stored in the
189dense format described above.
190
191## References
192
193```{bibliography} /petsc.bib
194:filter: docname in docnames
195```
196