xref: /petsc/doc/developers/matrices.md (revision a982d5546cc9bcf96044945e3157033f4bde0259)
1*4bcd95a3SBarry Smith# The Various Matrix Classes
2*4bcd95a3SBarry Smith
3*4bcd95a3SBarry SmithPETSc provides a variety of matrix implementations, since no single
4*4bcd95a3SBarry Smithmatrix format is appropriate for all problems. This section first
5*4bcd95a3SBarry Smithdiscusses various matrix blocking strategies and then describes the
6*4bcd95a3SBarry Smithassortment of matrix types in PETSc.
7*4bcd95a3SBarry Smith
8*4bcd95a3SBarry Smith## Matrix Blocking Strategies
9*4bcd95a3SBarry Smith
10*4bcd95a3SBarry SmithIn today’s computers, the time to perform an arithmetic operation is
11*4bcd95a3SBarry Smithdominated by the time to move the data into position, not the time to
12*4bcd95a3SBarry Smithcompute the arithmetic result. For example, the time to perform a
13*4bcd95a3SBarry Smithmultiplication operation may be one clock cycle, while the time to move
14*4bcd95a3SBarry Smiththe floating-point number from memory to the arithmetic unit may take 10
15*4bcd95a3SBarry Smithor more cycles. In order to help manage this difference in time scales,
16*4bcd95a3SBarry Smithmost processors have at least three levels of memory: registers, cache,
17*4bcd95a3SBarry Smithand random access memory. (In addition, some processors have external
18*4bcd95a3SBarry Smithcaches, and the complications of paging introduce another level to the
19*4bcd95a3SBarry Smithhierarchy.)
20*4bcd95a3SBarry Smith
21*4bcd95a3SBarry SmithThus, to achieve high performance, a code should first move data into
22*4bcd95a3SBarry Smithcache and from there move it into registers and use it repeatedly while
23*4bcd95a3SBarry Smithit remains in the cache or registers before returning it to main memory.
24*4bcd95a3SBarry SmithIf a floating-point number is reused 50 times while it is in registers,
25*4bcd95a3SBarry Smiththen the “hit” of 10 clock cycles to bring it into the register is not
26*4bcd95a3SBarry Smithimportant. But if the floating-point number is used only once, the “hit”
27*4bcd95a3SBarry Smithof 10 clock cycles becomes noticeable, resulting in disappointing flop
28*4bcd95a3SBarry Smithrates.
29*4bcd95a3SBarry Smith
30*4bcd95a3SBarry SmithUnfortunately, the compiler controls the use of the registers, and the
31*4bcd95a3SBarry Smithhardware controls the use of the cache. Since the user has essentially
32*4bcd95a3SBarry Smithno direct control, code must be written in such a way that the compiler
33*4bcd95a3SBarry Smithand hardware cache system can perform well. Good-quality code is then
34*4bcd95a3SBarry Smithsaid to respect the memory hierarchy.
35*4bcd95a3SBarry Smith
36*4bcd95a3SBarry SmithThe standard approach to improving the hardware utilization is to use
37*4bcd95a3SBarry Smithblocking. That is, rather than working with individual elements in the
38*4bcd95a3SBarry Smithmatrices, you employ blocks of elements. Since the use of implicit
39*4bcd95a3SBarry Smithmethods in PDE-based simulations leads to matrices with a naturally
40*4bcd95a3SBarry Smithblocked structure (with a block size equal to the number of degrees of
41*4bcd95a3SBarry Smithfreedom per cell), blocking is advantageous. The PETSc sparse matrix
42*4bcd95a3SBarry Smithrepresentations use a variety of techniques for blocking, including the
43*4bcd95a3SBarry Smithfollowing:
44*4bcd95a3SBarry Smith
45*4bcd95a3SBarry Smith- Storing the matrices using a generic sparse matrix format, but
46*4bcd95a3SBarry Smith  storing additional information about adjacent rows with identical
47*4bcd95a3SBarry Smith  nonzero structure (so-called I-nodes); this I-node information is
48*4bcd95a3SBarry Smith  used in the key computational routines to improve performance (the
49*4bcd95a3SBarry Smith  default for the `MATSEQAIJ` and `MATMPIAIJ` formats).
50*4bcd95a3SBarry Smith- Storing the matrices using a fixed (problem dependent) block size
51*4bcd95a3SBarry Smith  (via the `MATSEQBAIJ` and `MATMPIBAIJ` formats).
52*4bcd95a3SBarry Smith
53*4bcd95a3SBarry SmithThe advantage of the first approach is that it is a minimal change from
54*4bcd95a3SBarry Smitha standard sparse matrix format and brings a large percentage of the
55*4bcd95a3SBarry Smithimprovement obtained via blocking. Using a fixed block size gives the
56*4bcd95a3SBarry Smithbest performance, since the code can be hardwired with that particular
57*4bcd95a3SBarry Smithsize (for example, in some problems the size may be 3, in others 5, and
58*4bcd95a3SBarry Smithso on), so that the compiler will then optimize for that size, removing
59*4bcd95a3SBarry Smiththe overhead of small loops entirely.
60*4bcd95a3SBarry Smith
61*4bcd95a3SBarry SmithThe following table presents the floating-point performance for a basic
62*4bcd95a3SBarry Smithmatrix-vector product using three approaches: a basic compressed row
63*4bcd95a3SBarry Smithstorage format (using the PETSc runtime options
64*4bcd95a3SBarry Smith`-mat_seqaij -mat_nounroll)`; the same compressed row format using
65*4bcd95a3SBarry SmithI-nodes (with the option `-mat_seqaij`); and a fixed block size code,
66*4bcd95a3SBarry Smithwith a block size of 3 for these problems (using the option
67*4bcd95a3SBarry Smith`-mat_seqbaij`). The rates were computed on one node of an older IBM
68*4bcd95a3SBarry SmithPower processor based system, using two test matrices. The first matrix
69*4bcd95a3SBarry Smith(ARCO1), courtesy of Rick Dean of Arco, arises in multiphase flow
70*4bcd95a3SBarry Smithsimulation; it has 1,501 degrees of freedom, 26,131 matrix nonzeros, a
71*4bcd95a3SBarry Smithnatural block size of 3, and a small number of well terms. The second
72*4bcd95a3SBarry Smithmatrix (CFD), arises in a three-dimensional Euler flow simulation and
73*4bcd95a3SBarry Smithhas 15,360 degrees of freedom, 496,000 nonzeros, and a natural block
74*4bcd95a3SBarry Smithsize of 5. In addition to displaying the flop rates for matrix-vector
75*4bcd95a3SBarry Smithproducts, we display them for triangular solves obtained from an ILU(0)
76*4bcd95a3SBarry Smithfactorization.
77*4bcd95a3SBarry Smith
78*4bcd95a3SBarry Smith```{eval-rst}
79*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+
80*4bcd95a3SBarry Smith| Problem    | Block size | Basic | I-node version  | Fixed block size |
81*4bcd95a3SBarry Smith+============+============+=======+=================+==================+
82*4bcd95a3SBarry Smith| Matrix-Vector Product (Mflop/sec)                                    |
83*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+
84*4bcd95a3SBarry Smith| Multiphase | 3          | 27    | 43              | 70               |
85*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+
86*4bcd95a3SBarry Smith| Euler      | 5          | 28    | 58              | 90               |
87*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+
88*4bcd95a3SBarry Smith| Triangular Solves from ILU(0) (Mflop/sec)                            |
89*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+
90*4bcd95a3SBarry Smith| Multiphase | 3          | 22    | 31              | 49               |
91*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+
92*4bcd95a3SBarry Smith| Euler      | 5          | 22    | 39              | 65               |
93*4bcd95a3SBarry Smith+------------+------------+-------+-----------------+------------------+
94*4bcd95a3SBarry Smith```
95*4bcd95a3SBarry Smith
96*4bcd95a3SBarry SmithThese examples demonstrate that careful implementations of the basic
97*4bcd95a3SBarry Smithsequential kernels in PETSc can dramatically improve overall floating
98*4bcd95a3SBarry Smithpoint performance, and users can immediately benefit from such
99*4bcd95a3SBarry Smithenhancements without altering a single line of their application codes.
100*4bcd95a3SBarry SmithNote that the speeds of the I-node and fixed block operations are
101*4bcd95a3SBarry Smithseveral times that of the basic sparse implementations.
102*4bcd95a3SBarry Smith
103*4bcd95a3SBarry Smith## Assorted Matrix Types
104*4bcd95a3SBarry Smith
105*4bcd95a3SBarry SmithPETSc offers a variety of both sparse and dense matrix types.
106*4bcd95a3SBarry Smith
107*4bcd95a3SBarry Smith### Sequential AIJ Sparse Matrices
108*4bcd95a3SBarry Smith
109*4bcd95a3SBarry SmithThe default matrix representation within PETSc is the general sparse AIJ
110*4bcd95a3SBarry Smithformat (also called the compressed sparse row format, CSR).
111*4bcd95a3SBarry Smith
112*4bcd95a3SBarry Smith### Parallel AIJ Sparse Matrices
113*4bcd95a3SBarry Smith
114*4bcd95a3SBarry SmithThe AIJ sparse matrix type, is the default parallel matrix format;
115*4bcd95a3SBarry Smithadditional implementation details are given in {cite}`petsc-efficient`.
116*4bcd95a3SBarry Smith
117*4bcd95a3SBarry Smith### Sequential Block AIJ Sparse Matrices
118*4bcd95a3SBarry Smith
119*4bcd95a3SBarry SmithThe sequential and parallel block AIJ formats, which are extensions of
120*4bcd95a3SBarry Smiththe AIJ formats described above, are intended especially for use with
121*4bcd95a3SBarry Smithmulticlass PDEs. The block variants store matrix elements by fixed-sized
122*4bcd95a3SBarry Smithdense `nb` $\times$ `nb` blocks. The stored row and column
123*4bcd95a3SBarry Smithindices begin at zero.
124*4bcd95a3SBarry Smith
125*4bcd95a3SBarry SmithThe routine for creating a sequential block AIJ matrix with `m` rows,
126*4bcd95a3SBarry Smith`n` columns, and a block size of `nb` is
127*4bcd95a3SBarry Smith
128*4bcd95a3SBarry Smith```
129*4bcd95a3SBarry SmithMatCreateSeqBAIJ(MPI_Comm comm,int nb,int m,int n,int nz,int *nnz,Mat *A)
130*4bcd95a3SBarry Smith```
131*4bcd95a3SBarry Smith
132*4bcd95a3SBarry SmithThe arguments `nz` and `nnz` can be used to preallocate matrix
133*4bcd95a3SBarry Smithmemory by indicating the number of *block* nonzeros per row. For good
134*4bcd95a3SBarry Smithperformance during matrix assembly, preallocation is crucial; however,
135*4bcd95a3SBarry Smithyou can set `nz=0` and `nnz=NULL` for PETSc to dynamically allocate
136*4bcd95a3SBarry Smithmatrix memory as needed. The PETSc users manual discusses preallocation
137*4bcd95a3SBarry Smithfor the AIJ format; extension to the block AIJ format is
138*4bcd95a3SBarry Smithstraightforward.
139*4bcd95a3SBarry Smith
140*4bcd95a3SBarry SmithNote that the routine `MatSetValuesBlocked()` can be used for more
141*4bcd95a3SBarry Smithefficient matrix assembly when using the block AIJ format.
142*4bcd95a3SBarry Smith
143*4bcd95a3SBarry Smith### Parallel Block AIJ Sparse Matrices
144*4bcd95a3SBarry Smith
145*4bcd95a3SBarry SmithParallel block AIJ matrices with block size nb can be created with the
146*4bcd95a3SBarry Smithcommand `MatCreateBAIJ()`
147*4bcd95a3SBarry Smith
148*4bcd95a3SBarry Smith```
149*4bcd95a3SBarry SmithMatCreateBAIJ(MPI_Comm comm,int nb,int m,int n,int M,int N,int d_nz,int *d_nnz,int o_nz,int *o_nnz,Mat *A);
150*4bcd95a3SBarry Smith```
151*4bcd95a3SBarry Smith
152*4bcd95a3SBarry Smith`A` is the newly created matrix, while the arguments `m`, `n`,
153*4bcd95a3SBarry Smith`M`, and `N` indicate the number of local rows and columns and the
154*4bcd95a3SBarry Smithnumber of global rows and columns, respectively. Either the local or
155*4bcd95a3SBarry Smithglobal parameters can be replaced with `PETSC_DECIDE`, so that PETSc
156*4bcd95a3SBarry Smithwill determine them. The matrix is stored with a fixed number of rows on
157*4bcd95a3SBarry Smitheach processor, given by `m`, or determined by PETSc if `m` is
158*4bcd95a3SBarry Smith`PETSC_DECIDE`.
159*4bcd95a3SBarry Smith
160*4bcd95a3SBarry SmithIf `PETSC_DECIDE` is not used for `m` and `n` then you must ensure
161*4bcd95a3SBarry Smiththat they are chosen to be compatible with the vectors. To do so, you
162*4bcd95a3SBarry Smithfirst consider the product $y = A x$. The `m` that used in
163*4bcd95a3SBarry Smith`MatCreateBAIJ()` must match the local size used in the
164*4bcd95a3SBarry Smith`VecCreateMPI()` for `y`. The `n` used must match that used as the
165*4bcd95a3SBarry Smithlocal size in `VecCreateMPI()` for `x`.
166*4bcd95a3SBarry Smith
167*4bcd95a3SBarry SmithYou must set `d_nz=0`, `o_nz=0`, `d_nnz=NULL`, and `o_nnz=NULL` for
168*4bcd95a3SBarry SmithPETSc to control dynamic allocation of matrix memory space. Analogous to
169*4bcd95a3SBarry Smith`nz` and `nnz` for the routine `MatCreateSeqBAIJ()`, these
170*4bcd95a3SBarry Smitharguments optionally specify block nonzero information for the diagonal
171*4bcd95a3SBarry Smith(`d_nz` and `d_nnz`) and off-diagonal (`o_nz` and `o_nnz`) parts of
172*4bcd95a3SBarry Smiththe matrix. For a square global matrix, we define each processor’s
173*4bcd95a3SBarry Smithdiagonal portion to be its local rows and the corresponding columns (a
174*4bcd95a3SBarry Smithsquare submatrix); each processor’s off-diagonal portion encompasses the
175*4bcd95a3SBarry Smithremainder of the local matrix (a rectangular submatrix). The PETSc users
176*4bcd95a3SBarry Smithmanual gives an example of preallocation for the parallel AIJ matrix
177*4bcd95a3SBarry Smithformat; extension to the block parallel AIJ case is straightforward.
178*4bcd95a3SBarry Smith
179*4bcd95a3SBarry Smith### Sequential Dense Matrices
180*4bcd95a3SBarry Smith
181*4bcd95a3SBarry SmithPETSc provides both sequential and parallel dense matrix formats, where
182*4bcd95a3SBarry Smitheach processor stores its entries in a column-major array in the usual
183*4bcd95a3SBarry SmithFortran style.
184*4bcd95a3SBarry Smith
185*4bcd95a3SBarry Smith### Parallel Dense Matrices
186*4bcd95a3SBarry Smith
187*4bcd95a3SBarry SmithThe parallel dense matrices are partitioned by rows across the
188*4bcd95a3SBarry Smithprocessors, so that each local rectangular submatrix is stored in the
189*4bcd95a3SBarry Smithdense format described above.
190*4bcd95a3SBarry Smith
191*4bcd95a3SBarry Smith## References
192*4bcd95a3SBarry Smith
193*4bcd95a3SBarry Smith```{bibliography} /petsc.bib
194*4bcd95a3SBarry Smith:filter: docname in docnames
195*4bcd95a3SBarry Smith```
196