xref: /petsc/doc/manual/performance.md (revision 7f296bb328fcd4c99f2da7bfe8ba7ed8a4ebceee)
1*7f296bb3SBarry Smith(ch_performance)=
2*7f296bb3SBarry Smith
3*7f296bb3SBarry Smith# Hints for Performance Tuning
4*7f296bb3SBarry Smith
5*7f296bb3SBarry SmithThis chapter provides hints on how to get to achieve best performance
6*7f296bb3SBarry Smithwith PETSc, particularly on distributed-memory machines with multiple
7*7f296bb3SBarry SmithCPU sockets per node. We focus on machine-related performance
8*7f296bb3SBarry Smithoptimization here; algorithmic aspects like preconditioner selection are
9*7f296bb3SBarry Smithnot the focus of this section.
10*7f296bb3SBarry Smith
11*7f296bb3SBarry Smith## Maximizing Memory Bandwidth
12*7f296bb3SBarry Smith
13*7f296bb3SBarry SmithMost operations in PETSc deal with large datasets (typically vectors and
14*7f296bb3SBarry Smithsparse matrices) and perform relatively few arithmetic operations for
15*7f296bb3SBarry Smitheach byte loaded or stored from global memory. Therefore, the
16*7f296bb3SBarry Smith*arithmetic intensity* expressed as the ratio of floating point
17*7f296bb3SBarry Smithoperations to the number of bytes loaded and stored is usually well
18*7f296bb3SBarry Smithbelow unity for typical PETSc operations. On the other hand, modern CPUs
19*7f296bb3SBarry Smithare able to execute on the order of 10 floating point operations for
20*7f296bb3SBarry Smitheach byte loaded or stored. As a consequence, almost all PETSc
21*7f296bb3SBarry Smithoperations are limited by the rate at which data can be loaded or stored
22*7f296bb3SBarry Smith(*memory bandwidth limited*) rather than by the rate of floating point
23*7f296bb3SBarry Smithoperations.
24*7f296bb3SBarry Smith
25*7f296bb3SBarry SmithThis section discusses ways to maximize the memory bandwidth achieved by
26*7f296bb3SBarry Smithapplications based on PETSc. Where appropriate, we include benchmark
27*7f296bb3SBarry Smithresults in order to provide quantitative results on typical performance
28*7f296bb3SBarry Smithgains one can achieve through parallelization, both on a single compute
29*7f296bb3SBarry Smithnode and across nodes. In particular, we start with the answer to the
30*7f296bb3SBarry Smithcommon question of why performance generally does not increase 20-fold
31*7f296bb3SBarry Smithwith a 20-core CPU.
32*7f296bb3SBarry Smith
33*7f296bb3SBarry Smith(subsec_bandwidth_vs_processes)=
34*7f296bb3SBarry Smith
35*7f296bb3SBarry Smith### Memory Bandwidth vs. Processes
36*7f296bb3SBarry Smith
37*7f296bb3SBarry SmithConsider the addition of two large vectors, with the result written to a
38*7f296bb3SBarry Smiththird vector. Because there are no dependencies across the different
39*7f296bb3SBarry Smithentries of each vector, the operation is embarrassingly parallel.
40*7f296bb3SBarry Smith
41*7f296bb3SBarry Smith:::{figure} /images/manual/stream-results-intel.*
42*7f296bb3SBarry Smith:alt: Memory bandwidth obtained on Intel hardware (dual socket except KNL) over the
43*7f296bb3SBarry Smith:  number of processes used. One can get close to peak memory bandwidth with only a
44*7f296bb3SBarry Smith:  few processes.
45*7f296bb3SBarry Smith:name: fig_stream_intel
46*7f296bb3SBarry Smith:width: 80.0%
47*7f296bb3SBarry Smith
48*7f296bb3SBarry SmithMemory bandwidth obtained on Intel hardware (dual socket except KNL)
49*7f296bb3SBarry Smithover the number of processes used. One can get close to peak memory
50*7f296bb3SBarry Smithbandwidth with only a few processes.
51*7f296bb3SBarry Smith:::
52*7f296bb3SBarry Smith
53*7f296bb3SBarry SmithAs {numref}`fig_stream_intel` shows, the performance gains due to
54*7f296bb3SBarry Smithparallelization on different multi- and many-core CPUs quickly
55*7f296bb3SBarry Smithsaturates. The reason is that only a fraction of the total number of CPU
56*7f296bb3SBarry Smithcores is required to saturate the memory channels. For example, a
57*7f296bb3SBarry Smithdual-socket system equipped with Haswell 12-core Xeon CPUs achieves more
58*7f296bb3SBarry Smiththan 80 percent of achievable peak memory bandwidth with only four
59*7f296bb3SBarry Smithprocesses per socket (8 total), cf. {numref}`fig_stream_intel`.
60*7f296bb3SBarry SmithConsequently, running with more than 8 MPI ranks on such a system will
61*7f296bb3SBarry Smithnot increase performance substantially. For the same reason, PETSc-based
62*7f296bb3SBarry Smithapplications usually do not benefit from hyper-threading.
63*7f296bb3SBarry Smith
64*7f296bb3SBarry SmithPETSc provides a simple way to measure memory bandwidth for different
65*7f296bb3SBarry Smithnumbers of processes via the target `make streams` executed from
66*7f296bb3SBarry Smith`$PETSC_DIR`. The output provides an overview of the possible speedup
67*7f296bb3SBarry Smithone can obtain on the given machine (not necessarily a shared memory
68*7f296bb3SBarry Smithsystem). For example, the following is the most relevant output obtained
69*7f296bb3SBarry Smithon a dual-socket system equipped with two six-core-CPUs with
70*7f296bb3SBarry Smithhyperthreading:
71*7f296bb3SBarry Smith
72*7f296bb3SBarry Smith```none
73*7f296bb3SBarry Smithnp  speedup
74*7f296bb3SBarry Smith1 1.0
75*7f296bb3SBarry Smith2 1.58
76*7f296bb3SBarry Smith3 2.19
77*7f296bb3SBarry Smith4 2.42
78*7f296bb3SBarry Smith5 2.63
79*7f296bb3SBarry Smith6 2.69
80*7f296bb3SBarry Smith...
81*7f296bb3SBarry Smith21 3.82
82*7f296bb3SBarry Smith22 3.49
83*7f296bb3SBarry Smith23 3.79
84*7f296bb3SBarry Smith24 3.71
85*7f296bb3SBarry SmithEstimation of possible speedup of MPI programs based on Streams benchmark.
86*7f296bb3SBarry SmithIt appears you have 1 node(s)
87*7f296bb3SBarry Smith```
88*7f296bb3SBarry Smith
89*7f296bb3SBarry SmithOn this machine, one should expect a speed-up of typical memory
90*7f296bb3SBarry Smithbandwidth-bound PETSc applications of at most 4x when running multiple
91*7f296bb3SBarry SmithMPI ranks on the node. Most of the gains are already obtained when
92*7f296bb3SBarry Smithrunning with only 4-6 ranks. Because a smaller number of MPI ranks
93*7f296bb3SBarry Smithusually implies better preconditioners and better performance for
94*7f296bb3SBarry Smithsmaller problems, the best performance for PETSc applications may be
95*7f296bb3SBarry Smithobtained with fewer ranks than there are physical CPU cores available.
96*7f296bb3SBarry Smith
97*7f296bb3SBarry SmithFollowing the results from the above run of `make streams`, we
98*7f296bb3SBarry Smithrecommend to use additional nodes instead of placing additional MPI
99*7f296bb3SBarry Smithranks on the nodes. In particular, weak scaling (i.e. constant load per
100*7f296bb3SBarry Smithprocess, increasing the number of processes) and strong scaling
101*7f296bb3SBarry Smith(i.e. constant total work, increasing the number of processes) studies
102*7f296bb3SBarry Smithshould keep the number of processes per node constant.
103*7f296bb3SBarry Smith
104*7f296bb3SBarry Smith### Non-Uniform Memory Access (NUMA) and Process Placement
105*7f296bb3SBarry Smith
106*7f296bb3SBarry SmithCPUs in nodes with more than one CPU socket are internally connected via
107*7f296bb3SBarry Smitha high-speed fabric, cf. {numref}`fig_numa`, to enable data
108*7f296bb3SBarry Smithexchange as well as cache coherency. Because main memory on modern
109*7f296bb3SBarry Smithsystems is connected via the integrated memory controllers on each CPU,
110*7f296bb3SBarry Smithmemory is accessed in a non-uniform way: A process running on one socket
111*7f296bb3SBarry Smithhas direct access to the memory channels of the respective CPU, whereas
112*7f296bb3SBarry Smithrequests for memory attached to a different CPU socket need to go
113*7f296bb3SBarry Smiththrough the high-speed fabric. Consequently, best aggregate memory
114*7f296bb3SBarry Smithbandwidth on the node is obtained when the memory controllers on each
115*7f296bb3SBarry SmithCPU are fully saturated. However, full saturation of memory channels is
116*7f296bb3SBarry Smithonly possible if the data is distributed across the different memory
117*7f296bb3SBarry Smithchannels.
118*7f296bb3SBarry Smith
119*7f296bb3SBarry Smith:::{figure} /images/manual/numa.*
120*7f296bb3SBarry Smith:alt: Schematic of a two-socket NUMA system. Processes should be spread across both
121*7f296bb3SBarry Smith:  CPUs to obtain full bandwidth.
122*7f296bb3SBarry Smith:name: fig_numa
123*7f296bb3SBarry Smith:width: 90.0%
124*7f296bb3SBarry Smith
125*7f296bb3SBarry SmithSchematic of a two-socket NUMA system. Processes should be spread
126*7f296bb3SBarry Smithacross both CPUs to obtain full bandwidth.
127*7f296bb3SBarry Smith:::
128*7f296bb3SBarry Smith
129*7f296bb3SBarry SmithData in memory on modern machines is allocated by the operating system
130*7f296bb3SBarry Smithbased on a first-touch policy. That is, memory is not allocated at the
131*7f296bb3SBarry Smithpoint of issuing `malloc()`, but at the point when the respective
132*7f296bb3SBarry Smithmemory segment is actually touched (read or write). Upon first-touch,
133*7f296bb3SBarry Smithmemory is allocated on the memory channel associated with the respective
134*7f296bb3SBarry SmithCPU the process is running on. Only if all memory on the respective CPU
135*7f296bb3SBarry Smithis already in use (either allocated or as IO cache), memory available
136*7f296bb3SBarry Smiththrough other sockets is considered.
137*7f296bb3SBarry Smith
138*7f296bb3SBarry SmithMaximum memory bandwidth can be achieved by ensuring that processes are
139*7f296bb3SBarry Smithspread over all sockets in the respective node. For example, the
140*7f296bb3SBarry Smithrecommended placement of a 8-way parallel run on a four-socket machine
141*7f296bb3SBarry Smithis to assign two processes to each CPU socket. To do so, one needs to
142*7f296bb3SBarry Smithknow the enumeration of cores and pass the requested information to
143*7f296bb3SBarry Smith`mpirun`. Consider the hardware topology information returned by
144*7f296bb3SBarry Smith`lstopo` (part of the hwloc package) for the following two-socket
145*7f296bb3SBarry Smithmachine, in which each CPU consists of six cores and supports
146*7f296bb3SBarry Smithhyperthreading:
147*7f296bb3SBarry Smith
148*7f296bb3SBarry Smith```none
149*7f296bb3SBarry SmithMachine (126GB total)
150*7f296bb3SBarry Smith  NUMANode L#0 (P#0 63GB)
151*7f296bb3SBarry Smith    Package L#0 + L3 L#0 (15MB)
152*7f296bb3SBarry Smith      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
153*7f296bb3SBarry Smith        PU L#0 (P#0)
154*7f296bb3SBarry Smith        PU L#1 (P#12)
155*7f296bb3SBarry Smith      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
156*7f296bb3SBarry Smith        PU L#2 (P#1)
157*7f296bb3SBarry Smith        PU L#3 (P#13)
158*7f296bb3SBarry Smith      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
159*7f296bb3SBarry Smith        PU L#4 (P#2)
160*7f296bb3SBarry Smith        PU L#5 (P#14)
161*7f296bb3SBarry Smith      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
162*7f296bb3SBarry Smith        PU L#6 (P#3)
163*7f296bb3SBarry Smith        PU L#7 (P#15)
164*7f296bb3SBarry Smith      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
165*7f296bb3SBarry Smith        PU L#8 (P#4)
166*7f296bb3SBarry Smith        PU L#9 (P#16)
167*7f296bb3SBarry Smith      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
168*7f296bb3SBarry Smith        PU L#10 (P#5)
169*7f296bb3SBarry Smith        PU L#11 (P#17)
170*7f296bb3SBarry Smith  NUMANode L#1 (P#1 63GB)
171*7f296bb3SBarry Smith    Package L#1 + L3 L#1 (15MB)
172*7f296bb3SBarry Smith      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
173*7f296bb3SBarry Smith        PU L#12 (P#6)
174*7f296bb3SBarry Smith        PU L#13 (P#18)
175*7f296bb3SBarry Smith      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
176*7f296bb3SBarry Smith        PU L#14 (P#7)
177*7f296bb3SBarry Smith        PU L#15 (P#19)
178*7f296bb3SBarry Smith      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
179*7f296bb3SBarry Smith        PU L#16 (P#8)
180*7f296bb3SBarry Smith        PU L#17 (P#20)
181*7f296bb3SBarry Smith      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
182*7f296bb3SBarry Smith        PU L#18 (P#9)
183*7f296bb3SBarry Smith        PU L#19 (P#21)
184*7f296bb3SBarry Smith      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
185*7f296bb3SBarry Smith        PU L#20 (P#10)
186*7f296bb3SBarry Smith        PU L#21 (P#22)
187*7f296bb3SBarry Smith      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
188*7f296bb3SBarry Smith        PU L#22 (P#11)
189*7f296bb3SBarry Smith        PU L#23 (P#23)
190*7f296bb3SBarry Smith```
191*7f296bb3SBarry Smith
192*7f296bb3SBarry SmithThe relevant physical processor IDs are shown in parentheses prefixed by
193*7f296bb3SBarry Smith`P#`. Here, IDs 0 and 12 share the same physical core and have a
194*7f296bb3SBarry Smithcommon L2 cache. IDs 0, 12, 1, 13, 2, 14, 3, 15, 4, 16, 5, 17 share the
195*7f296bb3SBarry Smithsame socket and have a common L3 cache.
196*7f296bb3SBarry Smith
197*7f296bb3SBarry SmithA good placement for a run with six processes is to locate three
198*7f296bb3SBarry Smithprocesses on the first socket and three processes on the second socket.
199*7f296bb3SBarry SmithUnfortunately, mechanisms for process placement vary across MPI
200*7f296bb3SBarry Smithimplementations, so make sure to consult the manual of your MPI
201*7f296bb3SBarry Smithimplementation. The following discussion is based on how processor
202*7f296bb3SBarry Smithplacement is done with MPICH and Open MPI, where one needs to pass
203*7f296bb3SBarry Smith`--bind-to core --map-by socket` to `mpirun`:
204*7f296bb3SBarry Smith
205*7f296bb3SBarry Smith```console
206*7f296bb3SBarry Smith$ mpirun -n 6 --bind-to core --map-by socket ./stream
207*7f296bb3SBarry Smithprocess 0 binding: 100000000000100000000000
208*7f296bb3SBarry Smithprocess 1 binding: 000000100000000000100000
209*7f296bb3SBarry Smithprocess 2 binding: 010000000000010000000000
210*7f296bb3SBarry Smithprocess 3 binding: 000000010000000000010000
211*7f296bb3SBarry Smithprocess 4 binding: 001000000000001000000000
212*7f296bb3SBarry Smithprocess 5 binding: 000000001000000000001000
213*7f296bb3SBarry SmithTriad:        45403.1949   Rate (MB/s)
214*7f296bb3SBarry Smith```
215*7f296bb3SBarry Smith
216*7f296bb3SBarry SmithIn this configuration, process 0 is bound to the first physical core on
217*7f296bb3SBarry Smiththe first socket (with IDs 0 and 12), process 1 is bound to the first
218*7f296bb3SBarry Smithcore on the second socket (IDs 6 and 18), and similarly for the
219*7f296bb3SBarry Smithremaining processes. The achieved bandwidth of 45 GB/sec is close to the
220*7f296bb3SBarry Smithpractical peak of about 50 GB/sec available on the machine. If, however,
221*7f296bb3SBarry Smithall MPI processes are located on the same socket, memory bandwidth drops
222*7f296bb3SBarry Smithsignificantly:
223*7f296bb3SBarry Smith
224*7f296bb3SBarry Smith```console
225*7f296bb3SBarry Smith$ mpirun -n 6 --bind-to core --map-by core ./stream
226*7f296bb3SBarry Smithprocess 0 binding: 100000000000100000000000
227*7f296bb3SBarry Smithprocess 1 binding: 010000000000010000000000
228*7f296bb3SBarry Smithprocess 2 binding: 001000000000001000000000
229*7f296bb3SBarry Smithprocess 3 binding: 000100000000000100000000
230*7f296bb3SBarry Smithprocess 4 binding: 000010000000000010000000
231*7f296bb3SBarry Smithprocess 5 binding: 000001000000000001000000
232*7f296bb3SBarry SmithTriad:        25510.7507   Rate (MB/s)
233*7f296bb3SBarry Smith```
234*7f296bb3SBarry Smith
235*7f296bb3SBarry SmithAll processes are now mapped to cores on the same socket. As a result,
236*7f296bb3SBarry Smithonly the first memory channel is fully saturated at 25.5 GB/sec.
237*7f296bb3SBarry Smith
238*7f296bb3SBarry SmithOne must not assume that `mpirun` uses good defaults. To
239*7f296bb3SBarry Smithdemonstrate, compare the full output of `make streams` from {any}`subsec_bandwidth_vs_processes` first, followed by
240*7f296bb3SBarry Smiththe results obtained by passing `--bind-to core --map-by socket`:
241*7f296bb3SBarry Smith
242*7f296bb3SBarry Smith```console
243*7f296bb3SBarry Smith$ make streams
244*7f296bb3SBarry Smithnp  speedup
245*7f296bb3SBarry Smith1 1.0
246*7f296bb3SBarry Smith2 1.58
247*7f296bb3SBarry Smith3 2.19
248*7f296bb3SBarry Smith4 2.42
249*7f296bb3SBarry Smith5 2.63
250*7f296bb3SBarry Smith6 2.69
251*7f296bb3SBarry Smith7 2.31
252*7f296bb3SBarry Smith8 2.42
253*7f296bb3SBarry Smith9 2.37
254*7f296bb3SBarry Smith10 2.65
255*7f296bb3SBarry Smith11 2.3
256*7f296bb3SBarry Smith12 2.53
257*7f296bb3SBarry Smith13 2.43
258*7f296bb3SBarry Smith14 2.63
259*7f296bb3SBarry Smith15 2.74
260*7f296bb3SBarry Smith16 2.7
261*7f296bb3SBarry Smith17 3.28
262*7f296bb3SBarry Smith18 3.66
263*7f296bb3SBarry Smith19 3.95
264*7f296bb3SBarry Smith20 3.07
265*7f296bb3SBarry Smith21 3.82
266*7f296bb3SBarry Smith22 3.49
267*7f296bb3SBarry Smith23 3.79
268*7f296bb3SBarry Smith24 3.71
269*7f296bb3SBarry Smith```
270*7f296bb3SBarry Smith
271*7f296bb3SBarry Smith```console
272*7f296bb3SBarry Smith$ make streams MPI_BINDING="--bind-to core --map-by socket"
273*7f296bb3SBarry Smithnp  speedup
274*7f296bb3SBarry Smith1 1.0
275*7f296bb3SBarry Smith2 1.59
276*7f296bb3SBarry Smith3 2.66
277*7f296bb3SBarry Smith4 3.5
278*7f296bb3SBarry Smith5 3.56
279*7f296bb3SBarry Smith6 4.23
280*7f296bb3SBarry Smith7 3.95
281*7f296bb3SBarry Smith8 4.39
282*7f296bb3SBarry Smith9 4.09
283*7f296bb3SBarry Smith10 4.46
284*7f296bb3SBarry Smith11 4.15
285*7f296bb3SBarry Smith12 4.42
286*7f296bb3SBarry Smith13 3.71
287*7f296bb3SBarry Smith14 3.83
288*7f296bb3SBarry Smith15 4.08
289*7f296bb3SBarry Smith16 4.22
290*7f296bb3SBarry Smith17 4.18
291*7f296bb3SBarry Smith18 4.31
292*7f296bb3SBarry Smith19 4.22
293*7f296bb3SBarry Smith20 4.28
294*7f296bb3SBarry Smith21 4.25
295*7f296bb3SBarry Smith22 4.23
296*7f296bb3SBarry Smith23 4.28
297*7f296bb3SBarry Smith24 4.22
298*7f296bb3SBarry Smith```
299*7f296bb3SBarry Smith
300*7f296bb3SBarry SmithFor the non-optimized version on the left, the speedup obtained when
301*7f296bb3SBarry Smithusing any number of processes between 3 and 13 is essentially constant
302*7f296bb3SBarry Smithup to fluctuations, indicating that all processes were by default
303*7f296bb3SBarry Smithexecuted on the same socket. Only with 14 or more processes, the
304*7f296bb3SBarry Smithspeedup number increases again. In contrast, the results of
305*7f296bb3SBarry Smith
306*7f296bb3SBarry Smith
307*7f296bb3SBarry Smith`make streams`
308*7f296bb3SBarry Smith
309*7f296bb3SBarry Smith with proper processor placement shown second
310*7f296bb3SBarry Smithresulted in slightly higher overall parallel speedup (identical
311*7f296bb3SBarry Smithbaselines), in smaller performance fluctuations, and more than 90
312*7f296bb3SBarry Smithpercent of peak bandwidth with only six processes.
313*7f296bb3SBarry Smith
314*7f296bb3SBarry SmithMachines with job submission systems such as SLURM usually provide
315*7f296bb3SBarry Smithsimilar mechanisms for processor placements through options specified in
316*7f296bb3SBarry Smithjob submission scripts. Please consult the respective manuals.
317*7f296bb3SBarry Smith
318*7f296bb3SBarry Smith#### Additional Process Placement Considerations and Details
319*7f296bb3SBarry Smith
320*7f296bb3SBarry SmithFor a typical, memory bandwidth-limited PETSc application, the primary
321*7f296bb3SBarry Smithconsideration in placing MPI processes is ensuring that processes are
322*7f296bb3SBarry Smithevenly distributed among sockets, and hence using all available memory
323*7f296bb3SBarry Smithchannels. Increasingly complex processor designs and cache hierarchies,
324*7f296bb3SBarry Smithhowever, mean that performance may also be sensitive to how processes
325*7f296bb3SBarry Smithare bound to the resources within each socket. Performance on the two
326*7f296bb3SBarry Smithprocessor machine in the preceding example may be relatively insensitive
327*7f296bb3SBarry Smithto such placement decisions, because one L3 cache is shared by all cores
328*7f296bb3SBarry Smithwithin a NUMA domain, and each core has its own L2 and L1 caches.
329*7f296bb3SBarry SmithHowever, processors that are less “flat”, with more complex hierarchies,
330*7f296bb3SBarry Smithmay be more sensitive. In many AMD Opterons or the second-generation
331*7f296bb3SBarry Smith“Knights Landing” Intel Xeon Phi, for instance, L2 caches are shared
332*7f296bb3SBarry Smithbetween two cores. On these processors, placing consecutive MPI ranks on
333*7f296bb3SBarry Smithcores that share the same L2 cache may benefit performance if the two
334*7f296bb3SBarry Smithranks communicate frequently with each other, because the latency
335*7f296bb3SBarry Smithbetween cores sharing an L2 cache may be roughly half that of two cores
336*7f296bb3SBarry Smithnot sharing one. There may be benefit, however, in placing consecutive
337*7f296bb3SBarry Smithranks on cores that do not share an L2 cache, because (if there are
338*7f296bb3SBarry Smithfewer MPI ranks than cores) this increases the total L2 cache capacity
339*7f296bb3SBarry Smithand bandwidth available to the application. There is a trade-off to be
340*7f296bb3SBarry Smithconsidered between placing processes close together (in terms of shared
341*7f296bb3SBarry Smithresources) to optimize for efficient communication and synchronization
342*7f296bb3SBarry Smithvs. farther apart to maximize available resources (memory channels,
343*7f296bb3SBarry Smithcaches, I/O channels, etc.), and the best strategy will depend on the
344*7f296bb3SBarry Smithapplication and the software and hardware stack.
345*7f296bb3SBarry Smith
346*7f296bb3SBarry SmithDifferent process placement strategies can affect performance at least
347*7f296bb3SBarry Smithas much as some commonly explored settings, such as compiler
348*7f296bb3SBarry Smithoptimization levels. Unfortunately, exploration of this space is
349*7f296bb3SBarry Smithcomplicated by two factors: First, processor and core numberings may be
350*7f296bb3SBarry Smithcompletely arbitrary, changing with BIOS version, etc., and second—as
351*7f296bb3SBarry Smithalready noted—there is no standard mechanism used by MPI implementations
352*7f296bb3SBarry Smith(or job schedulers) to specify process affinity. To overcome the first
353*7f296bb3SBarry Smithissue, we recommend using the `lstopo` utility of the Portable
354*7f296bb3SBarry SmithHardware Locality (`hwloc`) software package (which can be installed
355*7f296bb3SBarry Smithby configuring PETSc with `–download-hwloc`) to understand the
356*7f296bb3SBarry Smithprocessor topology of your machine. We cannot fully address the second
357*7f296bb3SBarry Smithissue—consult the documentation for your MPI implementation and/or job
358*7f296bb3SBarry Smithscheduler—but we offer some general observations on understanding
359*7f296bb3SBarry Smithplacement options:
360*7f296bb3SBarry Smith
361*7f296bb3SBarry Smith- An MPI implementation may support a notion of *domains* in which a
362*7f296bb3SBarry Smith  process may be pinned. A domain may simply correspond to a single
363*7f296bb3SBarry Smith  core; however, the MPI implementation may allow a deal of flexibility
364*7f296bb3SBarry Smith  in specifying domains that encompass multiple cores, span sockets,
365*7f296bb3SBarry Smith  etc. Some implementations, such as Intel MPI, provide means to
366*7f296bb3SBarry Smith  specify whether domains should be “compact”—composed of cores sharing
367*7f296bb3SBarry Smith  resources such as caches—or “scatter”-ed, with little resource
368*7f296bb3SBarry Smith  sharing (possibly even spanning sockets).
369*7f296bb3SBarry Smith- Separate from the specification of domains, MPI implementations often
370*7f296bb3SBarry Smith  support different *orderings* in which MPI ranks should be bound to
371*7f296bb3SBarry Smith  these domains. Intel MPI, for instance, supports “compact” ordering
372*7f296bb3SBarry Smith  to place consecutive ranks close in terms of shared resources,
373*7f296bb3SBarry Smith  “scatter” to place them far apart, and “bunch” to map proportionally
374*7f296bb3SBarry Smith  to sockets while placing ranks as close together as possible within
375*7f296bb3SBarry Smith  the sockets.
376*7f296bb3SBarry Smith- An MPI implementation that supports process pinning should offer some
377*7f296bb3SBarry Smith  way to view the rank assignments. Use this output in conjunction with
378*7f296bb3SBarry Smith  the topology obtained via `lstopo` or a similar tool to determine
379*7f296bb3SBarry Smith  if the placements correspond to something you believe is reasonable
380*7f296bb3SBarry Smith  for your application. Do not assume that the MPI implementation is
381*7f296bb3SBarry Smith  doing something sensible by default!
382*7f296bb3SBarry Smith
383*7f296bb3SBarry Smith## Performance Pitfalls and Advice
384*7f296bb3SBarry Smith
385*7f296bb3SBarry SmithThis section looks into a potpourri of performance pitfalls encountered
386*7f296bb3SBarry Smithby users in the past. Many of these pitfalls require a deeper
387*7f296bb3SBarry Smithunderstanding of the system and experience to detect. The purpose of
388*7f296bb3SBarry Smiththis section is to summarize and share our experience so that these
389*7f296bb3SBarry Smithpitfalls can be avoided in the future.
390*7f296bb3SBarry Smith
391*7f296bb3SBarry Smith### Debug vs. Optimized Builds
392*7f296bb3SBarry Smith
393*7f296bb3SBarry SmithPETSc’s `configure` defaults to building PETSc with debug mode
394*7f296bb3SBarry Smithenabled. Any code development should be done in this mode, because it
395*7f296bb3SBarry Smithprovides handy debugging facilities such as accurate stack traces,
396*7f296bb3SBarry Smithmemory leak checks, and memory corruption checks. Note that PETSc has no
397*7f296bb3SBarry Smithreliable way of knowing whether a particular run is a production or
398*7f296bb3SBarry Smithdebug run. In the case that a user requests profiling information via
399*7f296bb3SBarry Smith`-log_view`, a debug build of PETSc issues the following warning:
400*7f296bb3SBarry Smith
401*7f296bb3SBarry Smith```none
402*7f296bb3SBarry Smith##########################################################
403*7f296bb3SBarry Smith#                                                        #
404*7f296bb3SBarry Smith#                          WARNING!!!                    #
405*7f296bb3SBarry Smith#                                                        #
406*7f296bb3SBarry Smith#   This code was compiled with a debugging option,      #
407*7f296bb3SBarry Smith#   To get timing results run configure                  #
408*7f296bb3SBarry Smith#   using --with-debugging=no, the performance will      #
409*7f296bb3SBarry Smith#   be generally two or three times faster.              #
410*7f296bb3SBarry Smith#                                                        #
411*7f296bb3SBarry Smith##########################################################
412*7f296bb3SBarry Smith```
413*7f296bb3SBarry Smith
414*7f296bb3SBarry SmithConversely, one way of checking whether a particular build of PETSc has
415*7f296bb3SBarry Smithdebugging enabled is to inspect the output of `-log_view`.
416*7f296bb3SBarry Smith
417*7f296bb3SBarry SmithDebug mode will generally be most useful for code development if
418*7f296bb3SBarry Smithappropriate compiler options are set to facilitate debugging. The
419*7f296bb3SBarry Smithcompiler should be instructed to generate binaries with debug symbols
420*7f296bb3SBarry Smith(command line option `-g` for most compilers), and the optimization
421*7f296bb3SBarry Smithlevel chosen should either completely disable optimizations (`-O0` for
422*7f296bb3SBarry Smithmost compilers) or enable only optimizations that do not interfere with
423*7f296bb3SBarry Smithdebugging (GCC, for instance, supports a `-Og` optimization level that
424*7f296bb3SBarry Smithdoes this).
425*7f296bb3SBarry Smith
426*7f296bb3SBarry SmithOnly once the new code is thoroughly tested and ready for production,
427*7f296bb3SBarry Smithone should disable debugging facilities by passing
428*7f296bb3SBarry Smith`--with-debugging=no` to
429*7f296bb3SBarry Smith
430*7f296bb3SBarry Smith`configure`. One should also ensure that an appropriate compiler
431*7f296bb3SBarry Smithoptimization level is set. Note that some compilers (e.g., Intel)
432*7f296bb3SBarry Smithdefault to fairly comprehensive optimization levels, while others (e.g.,
433*7f296bb3SBarry SmithGCC) default to no optimization at all. The best optimization flags will
434*7f296bb3SBarry Smithdepend on your code, the compiler, and the target architecture, but we
435*7f296bb3SBarry Smithoffer a few guidelines for finding those that will offer the best
436*7f296bb3SBarry Smithperformance:
437*7f296bb3SBarry Smith
438*7f296bb3SBarry Smith- Most compilers have a number of optimization levels (with level n
439*7f296bb3SBarry Smith  usually specified via `-On`) that provide a quick way to enable
440*7f296bb3SBarry Smith  sets of several optimization flags. We suggest trying the higher
441*7f296bb3SBarry Smith  optimization levels (the highest level is not guaranteed to produce
442*7f296bb3SBarry Smith  the fastest executable, so some experimentation may be merited). With
443*7f296bb3SBarry Smith  most recent processors now supporting some form of SIMD or vector
444*7f296bb3SBarry Smith  instructions, it is important to choose a level that enables the
445*7f296bb3SBarry Smith  compiler’s auto-vectorizer; many compilers do not enable
446*7f296bb3SBarry Smith  auto-vectorization at lower optimization levels (e.g., GCC does not
447*7f296bb3SBarry Smith  enable it below `-O3` and the Intel compiler does not enable it
448*7f296bb3SBarry Smith  below `-O2`).
449*7f296bb3SBarry Smith- For processors supporting newer vector instruction sets, such as
450*7f296bb3SBarry Smith  Intel AVX2 and AVX-512, it is also important to direct the compiler
451*7f296bb3SBarry Smith  to generate code that targets these processors (e.g., `-march=native`);
452*7f296bb3SBarry Smith  otherwise, the executables built will not
453*7f296bb3SBarry Smith  utilize the newer instructions sets and will not take advantage of
454*7f296bb3SBarry Smith  the vector processing units.
455*7f296bb3SBarry Smith- Beyond choosing the optimization levels, some value-unsafe
456*7f296bb3SBarry Smith  optimizations (such as using reciprocals of values instead of
457*7f296bb3SBarry Smith  dividing by those values, or allowing re-association of operands in a
458*7f296bb3SBarry Smith  series of calculations) for floating point calculations may yield
459*7f296bb3SBarry Smith  significant performance gains. Compilers often provide flags (e.g.,
460*7f296bb3SBarry Smith  `-ffast-math` in GCC) to enable a set of these optimizations, and
461*7f296bb3SBarry Smith  they may be turned on when using options for very aggressive
462*7f296bb3SBarry Smith  optimization (`-fast` or `-Ofast` in many compilers). These are
463*7f296bb3SBarry Smith  worth exploring to maximize performance, but, if employed, it
464*7f296bb3SBarry Smith  important to verify that these do not cause erroneous results with
465*7f296bb3SBarry Smith  your code, since calculations may violate the IEEE standard for
466*7f296bb3SBarry Smith  floating-point arithmetic.
467*7f296bb3SBarry Smith
468*7f296bb3SBarry Smith### Profiling
469*7f296bb3SBarry Smith
470*7f296bb3SBarry SmithUsers should not spend time optimizing a code until after having
471*7f296bb3SBarry Smithdetermined where it spends the bulk of its time on realistically sized
472*7f296bb3SBarry Smithproblems. As discussed in detail in {any}`ch_profiling`, the
473*7f296bb3SBarry SmithPETSc routines automatically log performance data if certain runtime
474*7f296bb3SBarry Smithoptions are specified.
475*7f296bb3SBarry Smith
476*7f296bb3SBarry SmithTo obtain a summary of where and how much time is spent in different
477*7f296bb3SBarry Smithsections of the code, use one of the following options:
478*7f296bb3SBarry Smith
479*7f296bb3SBarry Smith- Run the code with the option `-log_view` to print a performance
480*7f296bb3SBarry Smith  summary for various phases of the code.
481*7f296bb3SBarry Smith- Run the code with the option `-log_mpe` `[logfilename]`, which
482*7f296bb3SBarry Smith  creates a logfile of events suitable for viewing with Jumpshot (part
483*7f296bb3SBarry Smith  of MPICH).
484*7f296bb3SBarry Smith
485*7f296bb3SBarry SmithThen, focus on the sections where most of the time is spent. If you
486*7f296bb3SBarry Smithprovided your own callback routines, e.g. for residual evaluations,
487*7f296bb3SBarry Smithsearch the profiling output for routines such as `SNESFunctionEval` or
488*7f296bb3SBarry Smith`SNESJacobianEval`. If their relative time is significant (say, more
489*7f296bb3SBarry Smiththan 30 percent), consider optimizing these routines first. Generic
490*7f296bb3SBarry Smithinstructions on how to optimize your callback functions are difficult;
491*7f296bb3SBarry Smithyou may start by reading performance optimization guides for your
492*7f296bb3SBarry Smithsystem’s hardware.
493*7f296bb3SBarry Smith
494*7f296bb3SBarry Smith### Aggregation
495*7f296bb3SBarry Smith
496*7f296bb3SBarry SmithPerforming operations on chunks of data rather than a single element at
497*7f296bb3SBarry Smitha time can significantly enhance performance because of cache reuse or
498*7f296bb3SBarry Smithlower data motion. Typical examples are:
499*7f296bb3SBarry Smith
500*7f296bb3SBarry Smith- Insert several (many) elements of a matrix or vector at once, rather
501*7f296bb3SBarry Smith  than looping and inserting a single value at a time. In order to
502*7f296bb3SBarry Smith  access elements in of vector repeatedly, employ `VecGetArray()` to
503*7f296bb3SBarry Smith  allow direct manipulation of the vector elements.
504*7f296bb3SBarry Smith- When possible, use `VecMDot()` rather than a series of calls to
505*7f296bb3SBarry Smith  `VecDot()`.
506*7f296bb3SBarry Smith- If you require a sequence of matrix-vector products with the same
507*7f296bb3SBarry Smith  matrix, consider packing your vectors into a single matrix and use
508*7f296bb3SBarry Smith  matrix-matrix multiplications.
509*7f296bb3SBarry Smith- Users should employ a reasonable number of `PetscMalloc()` calls in
510*7f296bb3SBarry Smith  their codes. Hundreds or thousands of memory allocations may be
511*7f296bb3SBarry Smith  appropriate; however, if tens of thousands are being used, then
512*7f296bb3SBarry Smith  reducing the number of `PetscMalloc()` calls may be warranted. For
513*7f296bb3SBarry Smith  example, reusing space or allocating large chunks and dividing it
514*7f296bb3SBarry Smith  into pieces can produce a significant savings in allocation overhead.
515*7f296bb3SBarry Smith  {any}`sec_dsreuse` gives details.
516*7f296bb3SBarry Smith
517*7f296bb3SBarry SmithAggressive aggregation of data may result in inflexible datastructures
518*7f296bb3SBarry Smithand code that is hard to maintain. We advise users to keep these
519*7f296bb3SBarry Smithcompeting goals in mind and not blindly optimize for performance only.
520*7f296bb3SBarry Smith
521*7f296bb3SBarry Smith(sec_symbolfactor)=
522*7f296bb3SBarry Smith
523*7f296bb3SBarry Smith### Memory Allocation for Sparse Matrix Factorization
524*7f296bb3SBarry Smith
525*7f296bb3SBarry SmithWhen symbolically factoring an AIJ matrix, PETSc has to guess how much
526*7f296bb3SBarry Smithfill there will be. Careful use of the fill parameter in the
527*7f296bb3SBarry Smith`MatFactorInfo` structure when calling `MatLUFactorSymbolic()` or
528*7f296bb3SBarry Smith`MatILUFactorSymbolic()` can reduce greatly the number of mallocs and
529*7f296bb3SBarry Smithcopies required, and thus greatly improve the performance of the
530*7f296bb3SBarry Smithfactorization. One way to determine a good value for the fill parameter
531*7f296bb3SBarry Smithis to run a program with the option `-info`. The symbolic
532*7f296bb3SBarry Smithfactorization phase will then print information such as
533*7f296bb3SBarry Smith
534*7f296bb3SBarry Smith```none
535*7f296bb3SBarry SmithInfo:MatILUFactorSymbolic_SeqAIJ:Reallocs 12 Fill ratio:given 1 needed 2.16423
536*7f296bb3SBarry Smith```
537*7f296bb3SBarry Smith
538*7f296bb3SBarry SmithThis indicates that the user should have used a fill estimate factor of
539*7f296bb3SBarry Smithabout 2.17 (instead of 1) to prevent the 12 required mallocs and copies.
540*7f296bb3SBarry SmithThe command line option
541*7f296bb3SBarry Smith
542*7f296bb3SBarry Smith```none
543*7f296bb3SBarry Smith-pc_factor_fill 2.17
544*7f296bb3SBarry Smith```
545*7f296bb3SBarry Smith
546*7f296bb3SBarry Smithwill cause PETSc to preallocate the correct amount of space for
547*7f296bb3SBarry Smiththe factorization.
548*7f296bb3SBarry Smith
549*7f296bb3SBarry Smith(detecting_memory_problems)=
550*7f296bb3SBarry Smith
551*7f296bb3SBarry Smith### Detecting Memory Allocation Problems and Memory Usage
552*7f296bb3SBarry Smith
553*7f296bb3SBarry SmithPETSc provides tools to aid in understanding PETSc memory usage and detecting problems with
554*7f296bb3SBarry Smithmemory allocation, including leaks and use of uninitialized space. Internally, PETSc uses
555*7f296bb3SBarry Smiththe routines `PetscMalloc()` and `PetscFree()` for memory allocation; instead of directly calling `malloc()` and `free()`.
556*7f296bb3SBarry SmithThis allows PETSc to track its memory usage and perform error checking. Users are urged to use these routines as well when
557*7f296bb3SBarry Smithappropriate.
558*7f296bb3SBarry Smith
559*7f296bb3SBarry Smith- The option `-malloc_debug` turns on PETSc's extensive runtime error checking of memory for corruption.
560*7f296bb3SBarry Smith  This checking can be expensive, so should not be used for
561*7f296bb3SBarry Smith  production runs. The option `-malloc_test` is equivalent to `-malloc_debug`
562*7f296bb3SBarry Smith  but only works when PETSc is configured with `--with-debugging` (the default configuration).
563*7f296bb3SBarry Smith  We suggest setting the environmental variable `PETSC_OPTIONS=-malloc_test`
564*7f296bb3SBarry Smith  in your shell startup file to automatically enable runtime check memory for developing code but not
565*7f296bb3SBarry Smith  running optimized code. Using `-malloc_debug` or `-malloc_test` for large runs can slow them significantly, thus we
566*7f296bb3SBarry Smith  recommend turning them off if you code is painfully slow and you don't need the testing. In addition, you can use
567*7f296bb3SBarry Smith  `-check_pointer_intensity 0` for long run debug runs that do not need extensive memory corruption testing. This option
568*7f296bb3SBarry Smith  is occasionally added to the `PETSC_OPTIONS` environmental variable by some users.
569*7f296bb3SBarry Smith- The option
570*7f296bb3SBarry Smith  `-malloc_dump` will print a list of memory locations that have not been freed at the
571*7f296bb3SBarry Smith  conclusion of a program. If all memory has been freed no message
572*7f296bb3SBarry Smith  is printed. Note that
573*7f296bb3SBarry Smith  the option `-malloc_dump` activates a call to
574*7f296bb3SBarry Smith  `PetscMallocDump()` during `PetscFinalize()`. The user can also
575*7f296bb3SBarry Smith  call `PetscMallocDump()` elsewhere in a program.
576*7f296bb3SBarry Smith- Another useful option
577*7f296bb3SBarry Smith  is `-malloc_view`, which reports memory usage in all routines at the conclusion of the program.
578*7f296bb3SBarry Smith  Note that this option
579*7f296bb3SBarry Smith  activates logging by calling `PetscMallocViewSet()` in
580*7f296bb3SBarry Smith  `PetscInitialize()` and then prints the log by calling
581*7f296bb3SBarry Smith  `PetscMallocView()` in `PetscFinalize()`. The user can also call
582*7f296bb3SBarry Smith  these routines elsewhere in a program.
583*7f296bb3SBarry Smith- When finer granularity is
584*7f296bb3SBarry Smith  desired, the user can call `PetscMallocGetCurrentUsage()` and
585*7f296bb3SBarry Smith  `PetscMallocGetMaximumUsage()` for memory allocated by PETSc, or
586*7f296bb3SBarry Smith  `PetscMemoryGetCurrentUsage()` and `PetscMemoryGetMaximumUsage()`
587*7f296bb3SBarry Smith  for the total memory used by the program. Note that
588*7f296bb3SBarry Smith  `PetscMemorySetGetMaximumUsage()` must be called before
589*7f296bb3SBarry Smith  `PetscMemoryGetMaximumUsage()` (typically at the beginning of the
590*7f296bb3SBarry Smith  program).
591*7f296bb3SBarry Smith- The option `-memory_view` provides a high-level view of all memory usage,
592*7f296bb3SBarry Smith  not just the memory used by `PetscMalloc()`, at the conclusion of the program.
593*7f296bb3SBarry Smith- When running with `-log_view`, the additional option `-log_view_memory`
594*7f296bb3SBarry Smith  causes the display of additional columns of information about how much
595*7f296bb3SBarry Smith  memory was allocated and freed during each logged event. This is useful
596*7f296bb3SBarry Smith  to understand what phases of a computation require the most memory.
597*7f296bb3SBarry Smith
598*7f296bb3SBarry SmithOne can also use [Valgrind](http://valgrind.org) to track memory usage and find bugs, see {any}`FAQ: Valgrind usage<valgrind>`.
599*7f296bb3SBarry Smith
600*7f296bb3SBarry Smith(sec_dsreuse)=
601*7f296bb3SBarry Smith
602*7f296bb3SBarry Smith### Data Structure Reuse
603*7f296bb3SBarry Smith
604*7f296bb3SBarry SmithData structures should be reused whenever possible. For example, if a
605*7f296bb3SBarry Smithcode often creates new matrices or vectors, there often may be a way to
606*7f296bb3SBarry Smithreuse some of them. Very significant performance improvements can be
607*7f296bb3SBarry Smithachieved by reusing matrix data structures with the same nonzero
608*7f296bb3SBarry Smithpattern. If a code creates thousands of matrix or vector objects,
609*7f296bb3SBarry Smithperformance will be degraded. For example, when solving a nonlinear
610*7f296bb3SBarry Smithproblem or timestepping, reusing the matrices and their nonzero
611*7f296bb3SBarry Smithstructure for many steps when appropriate can make the code run
612*7f296bb3SBarry Smithsignificantly faster.
613*7f296bb3SBarry Smith
614*7f296bb3SBarry SmithA simple technique for saving work vectors, matrices, etc. is employing
615*7f296bb3SBarry Smitha user-defined context. In C and C++ such a context is merely a
616*7f296bb3SBarry Smithstructure in which various objects can be stashed; in Fortran a user
617*7f296bb3SBarry Smithcontext can be an integer array that contains both parameters and
618*7f296bb3SBarry Smithpointers to PETSc objects. See
619*7f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/snes/tutorials/ex5.c.html">SNES Tutorial ex5</a>
620*7f296bb3SBarry Smithand
621*7f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/snes/tutorials/ex5f90.F90.html">SNES Tutorial ex5f90</a>
622*7f296bb3SBarry Smithfor examples of user-defined application contexts in C and Fortran,
623*7f296bb3SBarry Smithrespectively.
624*7f296bb3SBarry Smith
625*7f296bb3SBarry Smith### Numerical Experiments
626*7f296bb3SBarry Smith
627*7f296bb3SBarry SmithPETSc users should run a variety of tests. For example, there are a
628*7f296bb3SBarry Smithlarge number of options for the linear and nonlinear equation solvers in
629*7f296bb3SBarry SmithPETSc, and different choices can make a *very* big difference in
630*7f296bb3SBarry Smithconvergence rates and execution times. PETSc employs defaults that are
631*7f296bb3SBarry Smithgenerally reasonable for a wide range of problems, but clearly these
632*7f296bb3SBarry Smithdefaults cannot be best for all cases. Users should experiment with many
633*7f296bb3SBarry Smithcombinations to determine what is best for a given problem and customize
634*7f296bb3SBarry Smiththe solvers accordingly.
635*7f296bb3SBarry Smith
636*7f296bb3SBarry Smith- Use the options `-snes_view`, `-ksp_view`, etc. (or the routines
637*7f296bb3SBarry Smith  `KSPView()`, `SNESView()`, etc.) to view the options that have
638*7f296bb3SBarry Smith  been used for a particular solver.
639*7f296bb3SBarry Smith- Run the code with the option `-help` for a list of the available
640*7f296bb3SBarry Smith  runtime commands.
641*7f296bb3SBarry Smith- Use the option `-info` to print details about the solvers’
642*7f296bb3SBarry Smith  operation.
643*7f296bb3SBarry Smith- Use the PETSc monitoring discussed in {any}`ch_profiling`
644*7f296bb3SBarry Smith  to evaluate the performance of various numerical methods.
645*7f296bb3SBarry Smith
646*7f296bb3SBarry Smith(sec_slestips)=
647*7f296bb3SBarry Smith
648*7f296bb3SBarry Smith### Tips for Efficient Use of Linear Solvers
649*7f296bb3SBarry Smith
650*7f296bb3SBarry SmithAs discussed in {any}`ch_ksp`, the default linear
651*7f296bb3SBarry Smithsolvers are
652*7f296bb3SBarry Smith
653*7f296bb3SBarry Smith- uniprocess: GMRES(30) with ILU(0) preconditioning
654*7f296bb3SBarry Smith- multiprocess: GMRES(30) with block Jacobi preconditioning, where
655*7f296bb3SBarry Smith  there is 1 block per process, and each block is solved with ILU(0)
656*7f296bb3SBarry Smith
657*7f296bb3SBarry SmithOne should experiment to determine alternatives that may be better for
658*7f296bb3SBarry Smithvarious applications. Recall that one can specify the `KSP` methods
659*7f296bb3SBarry Smithand preconditioners at runtime via the options:
660*7f296bb3SBarry Smith
661*7f296bb3SBarry Smith```none
662*7f296bb3SBarry Smith-ksp_type <ksp_name> -pc_type <pc_name>
663*7f296bb3SBarry Smith```
664*7f296bb3SBarry Smith
665*7f296bb3SBarry SmithOne can also specify a variety of runtime customizations for the
666*7f296bb3SBarry Smithsolvers, as discussed throughout the manual.
667*7f296bb3SBarry Smith
668*7f296bb3SBarry SmithIn particular, note that the default restart parameter for GMRES is 30,
669*7f296bb3SBarry Smithwhich may be too small for some large-scale problems. One can alter this
670*7f296bb3SBarry Smithparameter with the option `-ksp_gmres_restart <restart>` or by calling
671*7f296bb3SBarry Smith`KSPGMRESSetRestart()`. {any}`sec_ksp` gives
672*7f296bb3SBarry Smithinformation on setting alternative GMRES orthogonalization routines,
673*7f296bb3SBarry Smithwhich may provide much better parallel performance.
674*7f296bb3SBarry Smith
675*7f296bb3SBarry SmithFor elliptic problems one often obtains good performance and scalability
676*7f296bb3SBarry Smithwith multigrid solvers. Consult {any}`sec_amg` for
677*7f296bb3SBarry Smithavailable options. Our experience is that GAMG works particularly well
678*7f296bb3SBarry Smithfor elasticity problems, whereas hypre does well for scalar problems.
679*7f296bb3SBarry Smith
680*7f296bb3SBarry Smith### System-Related Problems
681*7f296bb3SBarry Smith
682*7f296bb3SBarry SmithThe performance of a code can be affected by a variety of factors,
683*7f296bb3SBarry Smithincluding the cache behavior, other users on the machine, etc. Below we
684*7f296bb3SBarry Smithbriefly describe some common problems and possibilities for overcoming
685*7f296bb3SBarry Smiththem.
686*7f296bb3SBarry Smith
687*7f296bb3SBarry Smith- **Problem too large for physical memory size**: When timing a
688*7f296bb3SBarry Smith  program, one should always leave at least a ten percent margin
689*7f296bb3SBarry Smith  between the total memory a process is using and the physical size of
690*7f296bb3SBarry Smith  the machine’s memory. One way to estimate the amount of memory used
691*7f296bb3SBarry Smith  by given process is with the Unix `getrusage` system routine.
692*7f296bb3SBarry Smith  The PETSc option `-malloc_view` reports all
693*7f296bb3SBarry Smith  memory usage, including any Fortran arrays in an application code.
694*7f296bb3SBarry Smith- **Effects of other users**: If other users are running jobs on the
695*7f296bb3SBarry Smith  same physical processor nodes on which a program is being profiled,
696*7f296bb3SBarry Smith  the timing results are essentially meaningless.
697*7f296bb3SBarry Smith- **Overhead of timing routines on certain machines**: On certain
698*7f296bb3SBarry Smith  machines, even calling the system clock in order to time routines is
699*7f296bb3SBarry Smith  slow; this skews all of the flop rates and timing results. The file
700*7f296bb3SBarry Smith  `$PETSC_DIR/src/benchmarks/PetscTime.c` (<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/benchmarks/PetscTime.c.html">source</a>)
701*7f296bb3SBarry Smith  contains a simple test problem that will approximate the amount of
702*7f296bb3SBarry Smith  time required to get the current time in a running program. On good
703*7f296bb3SBarry Smith  systems it will on the order of $10^{-6}$ seconds or less.
704*7f296bb3SBarry Smith- **Problem too large for good cache performance**: Certain machines
705*7f296bb3SBarry Smith  with lower memory bandwidths (slow memory access) attempt to
706*7f296bb3SBarry Smith  compensate by having a very large cache. Thus, if a significant
707*7f296bb3SBarry Smith  portion of an application fits within the cache, the program will
708*7f296bb3SBarry Smith  achieve very good performance; if the code is too large, the
709*7f296bb3SBarry Smith  performance can degrade markedly. To analyze whether this situation
710*7f296bb3SBarry Smith  affects a particular code, one can try plotting the total flop rate
711*7f296bb3SBarry Smith  as a function of problem size. If the flop rate decreases rapidly at
712*7f296bb3SBarry Smith  some point, then the problem may likely be too large for the cache
713*7f296bb3SBarry Smith  size.
714*7f296bb3SBarry Smith- **Inconsistent timings**: Inconsistent timings are likely due to
715*7f296bb3SBarry Smith  other users on the machine, thrashing (using more virtual memory than
716*7f296bb3SBarry Smith  available physical memory), or paging in of the initial executable.
717*7f296bb3SBarry Smith  {any}`sec_profaccuracy` provides information on
718*7f296bb3SBarry Smith  overcoming paging overhead when profiling a code. We have found on
719*7f296bb3SBarry Smith  all systems that if you follow all the advise above your timings will
720*7f296bb3SBarry Smith  be consistent within a variation of less than five percent.
721