xref: /petsc/doc/manual/performance.md (revision 6c5693054f5123506dab0f5da2d352ed973d0e50)
17f296bb3SBarry Smith(ch_performance)=
27f296bb3SBarry Smith
37f296bb3SBarry Smith# Hints for Performance Tuning
47f296bb3SBarry Smith
57f296bb3SBarry SmithThis chapter provides hints on how to get to achieve best performance
67f296bb3SBarry Smithwith PETSc, particularly on distributed-memory machines with multiple
77f296bb3SBarry SmithCPU sockets per node. We focus on machine-related performance
87f296bb3SBarry Smithoptimization here; algorithmic aspects like preconditioner selection are
97f296bb3SBarry Smithnot the focus of this section.
107f296bb3SBarry Smith
117f296bb3SBarry Smith## Maximizing Memory Bandwidth
127f296bb3SBarry Smith
137f296bb3SBarry SmithMost operations in PETSc deal with large datasets (typically vectors and
147f296bb3SBarry Smithsparse matrices) and perform relatively few arithmetic operations for
157f296bb3SBarry Smitheach byte loaded or stored from global memory. Therefore, the
167f296bb3SBarry Smith*arithmetic intensity* expressed as the ratio of floating point
177f296bb3SBarry Smithoperations to the number of bytes loaded and stored is usually well
187f296bb3SBarry Smithbelow unity for typical PETSc operations. On the other hand, modern CPUs
197f296bb3SBarry Smithare able to execute on the order of 10 floating point operations for
207f296bb3SBarry Smitheach byte loaded or stored. As a consequence, almost all PETSc
217f296bb3SBarry Smithoperations are limited by the rate at which data can be loaded or stored
227f296bb3SBarry Smith(*memory bandwidth limited*) rather than by the rate of floating point
237f296bb3SBarry Smithoperations.
247f296bb3SBarry Smith
257f296bb3SBarry SmithThis section discusses ways to maximize the memory bandwidth achieved by
267f296bb3SBarry Smithapplications based on PETSc. Where appropriate, we include benchmark
277f296bb3SBarry Smithresults in order to provide quantitative results on typical performance
287f296bb3SBarry Smithgains one can achieve through parallelization, both on a single compute
297f296bb3SBarry Smithnode and across nodes. In particular, we start with the answer to the
307f296bb3SBarry Smithcommon question of why performance generally does not increase 20-fold
317f296bb3SBarry Smithwith a 20-core CPU.
327f296bb3SBarry Smith
337f296bb3SBarry Smith(subsec_bandwidth_vs_processes)=
347f296bb3SBarry Smith
357f296bb3SBarry Smith### Memory Bandwidth vs. Processes
367f296bb3SBarry Smith
377f296bb3SBarry SmithConsider the addition of two large vectors, with the result written to a
387f296bb3SBarry Smiththird vector. Because there are no dependencies across the different
397f296bb3SBarry Smithentries of each vector, the operation is embarrassingly parallel.
407f296bb3SBarry Smith
417f296bb3SBarry Smith:::{figure} /images/manual/stream-results-intel.*
427f296bb3SBarry Smith:alt: Memory bandwidth obtained on Intel hardware (dual socket except KNL) over the
437f296bb3SBarry Smith:  number of processes used. One can get close to peak memory bandwidth with only a
447f296bb3SBarry Smith:  few processes.
457f296bb3SBarry Smith:name: fig_stream_intel
467f296bb3SBarry Smith:width: 80.0%
477f296bb3SBarry Smith
487f296bb3SBarry SmithMemory bandwidth obtained on Intel hardware (dual socket except KNL)
497f296bb3SBarry Smithover the number of processes used. One can get close to peak memory
507f296bb3SBarry Smithbandwidth with only a few processes.
517f296bb3SBarry Smith:::
527f296bb3SBarry Smith
537f296bb3SBarry SmithAs {numref}`fig_stream_intel` shows, the performance gains due to
547f296bb3SBarry Smithparallelization on different multi- and many-core CPUs quickly
557f296bb3SBarry Smithsaturates. The reason is that only a fraction of the total number of CPU
567f296bb3SBarry Smithcores is required to saturate the memory channels. For example, a
577f296bb3SBarry Smithdual-socket system equipped with Haswell 12-core Xeon CPUs achieves more
587f296bb3SBarry Smiththan 80 percent of achievable peak memory bandwidth with only four
597f296bb3SBarry Smithprocesses per socket (8 total), cf. {numref}`fig_stream_intel`.
607f296bb3SBarry SmithConsequently, running with more than 8 MPI ranks on such a system will
617f296bb3SBarry Smithnot increase performance substantially. For the same reason, PETSc-based
627f296bb3SBarry Smithapplications usually do not benefit from hyper-threading.
637f296bb3SBarry Smith
647f296bb3SBarry SmithPETSc provides a simple way to measure memory bandwidth for different
657f296bb3SBarry Smithnumbers of processes via the target `make streams` executed from
667f296bb3SBarry Smith`$PETSC_DIR`. The output provides an overview of the possible speedup
677f296bb3SBarry Smithone can obtain on the given machine (not necessarily a shared memory
687f296bb3SBarry Smithsystem). For example, the following is the most relevant output obtained
697f296bb3SBarry Smithon a dual-socket system equipped with two six-core-CPUs with
707f296bb3SBarry Smithhyperthreading:
717f296bb3SBarry Smith
727f296bb3SBarry Smith```none
737f296bb3SBarry Smithnp  speedup
747f296bb3SBarry Smith1 1.0
757f296bb3SBarry Smith2 1.58
767f296bb3SBarry Smith3 2.19
777f296bb3SBarry Smith4 2.42
787f296bb3SBarry Smith5 2.63
797f296bb3SBarry Smith6 2.69
807f296bb3SBarry Smith...
817f296bb3SBarry Smith21 3.82
827f296bb3SBarry Smith22 3.49
837f296bb3SBarry Smith23 3.79
847f296bb3SBarry Smith24 3.71
857f296bb3SBarry SmithEstimation of possible speedup of MPI programs based on Streams benchmark.
867f296bb3SBarry SmithIt appears you have 1 node(s)
877f296bb3SBarry Smith```
887f296bb3SBarry Smith
897f296bb3SBarry SmithOn this machine, one should expect a speed-up of typical memory
907f296bb3SBarry Smithbandwidth-bound PETSc applications of at most 4x when running multiple
917f296bb3SBarry SmithMPI ranks on the node. Most of the gains are already obtained when
927f296bb3SBarry Smithrunning with only 4-6 ranks. Because a smaller number of MPI ranks
937f296bb3SBarry Smithusually implies better preconditioners and better performance for
947f296bb3SBarry Smithsmaller problems, the best performance for PETSc applications may be
957f296bb3SBarry Smithobtained with fewer ranks than there are physical CPU cores available.
967f296bb3SBarry Smith
977f296bb3SBarry SmithFollowing the results from the above run of `make streams`, we
987f296bb3SBarry Smithrecommend to use additional nodes instead of placing additional MPI
997f296bb3SBarry Smithranks on the nodes. In particular, weak scaling (i.e. constant load per
1007f296bb3SBarry Smithprocess, increasing the number of processes) and strong scaling
1017f296bb3SBarry Smith(i.e. constant total work, increasing the number of processes) studies
1027f296bb3SBarry Smithshould keep the number of processes per node constant.
1037f296bb3SBarry Smith
1047f296bb3SBarry Smith### Non-Uniform Memory Access (NUMA) and Process Placement
1057f296bb3SBarry Smith
1067f296bb3SBarry SmithCPUs in nodes with more than one CPU socket are internally connected via
1077f296bb3SBarry Smitha high-speed fabric, cf. {numref}`fig_numa`, to enable data
1087f296bb3SBarry Smithexchange as well as cache coherency. Because main memory on modern
1097f296bb3SBarry Smithsystems is connected via the integrated memory controllers on each CPU,
1107f296bb3SBarry Smithmemory is accessed in a non-uniform way: A process running on one socket
1117f296bb3SBarry Smithhas direct access to the memory channels of the respective CPU, whereas
1127f296bb3SBarry Smithrequests for memory attached to a different CPU socket need to go
1137f296bb3SBarry Smiththrough the high-speed fabric. Consequently, best aggregate memory
1147f296bb3SBarry Smithbandwidth on the node is obtained when the memory controllers on each
1157f296bb3SBarry SmithCPU are fully saturated. However, full saturation of memory channels is
1167f296bb3SBarry Smithonly possible if the data is distributed across the different memory
1177f296bb3SBarry Smithchannels.
1187f296bb3SBarry Smith
1197f296bb3SBarry Smith:::{figure} /images/manual/numa.*
1207f296bb3SBarry Smith:alt: Schematic of a two-socket NUMA system. Processes should be spread across both
1217f296bb3SBarry Smith:  CPUs to obtain full bandwidth.
1227f296bb3SBarry Smith:name: fig_numa
1237f296bb3SBarry Smith:width: 90.0%
1247f296bb3SBarry Smith
1257f296bb3SBarry SmithSchematic of a two-socket NUMA system. Processes should be spread
1267f296bb3SBarry Smithacross both CPUs to obtain full bandwidth.
1277f296bb3SBarry Smith:::
1287f296bb3SBarry Smith
1297f296bb3SBarry SmithData in memory on modern machines is allocated by the operating system
1307f296bb3SBarry Smithbased on a first-touch policy. That is, memory is not allocated at the
1317f296bb3SBarry Smithpoint of issuing `malloc()`, but at the point when the respective
1327f296bb3SBarry Smithmemory segment is actually touched (read or write). Upon first-touch,
1337f296bb3SBarry Smithmemory is allocated on the memory channel associated with the respective
1347f296bb3SBarry SmithCPU the process is running on. Only if all memory on the respective CPU
1357f296bb3SBarry Smithis already in use (either allocated or as IO cache), memory available
1367f296bb3SBarry Smiththrough other sockets is considered.
1377f296bb3SBarry Smith
1387f296bb3SBarry SmithMaximum memory bandwidth can be achieved by ensuring that processes are
1397f296bb3SBarry Smithspread over all sockets in the respective node. For example, the
1407f296bb3SBarry Smithrecommended placement of a 8-way parallel run on a four-socket machine
1417f296bb3SBarry Smithis to assign two processes to each CPU socket. To do so, one needs to
1427f296bb3SBarry Smithknow the enumeration of cores and pass the requested information to
143*a8cf87e0SJunchao Zhang`mpiexec`. Consider the hardware topology information returned by
1447f296bb3SBarry Smith`lstopo` (part of the hwloc package) for the following two-socket
1457f296bb3SBarry Smithmachine, in which each CPU consists of six cores and supports
1467f296bb3SBarry Smithhyperthreading:
1477f296bb3SBarry Smith
1487f296bb3SBarry Smith```none
1497f296bb3SBarry SmithMachine (126GB total)
1507f296bb3SBarry Smith  NUMANode L#0 (P#0 63GB)
1517f296bb3SBarry Smith    Package L#0 + L3 L#0 (15MB)
1527f296bb3SBarry Smith      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
1537f296bb3SBarry Smith        PU L#0 (P#0)
1547f296bb3SBarry Smith        PU L#1 (P#12)
1557f296bb3SBarry Smith      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
1567f296bb3SBarry Smith        PU L#2 (P#1)
1577f296bb3SBarry Smith        PU L#3 (P#13)
1587f296bb3SBarry Smith      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
1597f296bb3SBarry Smith        PU L#4 (P#2)
1607f296bb3SBarry Smith        PU L#5 (P#14)
1617f296bb3SBarry Smith      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
1627f296bb3SBarry Smith        PU L#6 (P#3)
1637f296bb3SBarry Smith        PU L#7 (P#15)
1647f296bb3SBarry Smith      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
1657f296bb3SBarry Smith        PU L#8 (P#4)
1667f296bb3SBarry Smith        PU L#9 (P#16)
1677f296bb3SBarry Smith      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
1687f296bb3SBarry Smith        PU L#10 (P#5)
1697f296bb3SBarry Smith        PU L#11 (P#17)
1707f296bb3SBarry Smith  NUMANode L#1 (P#1 63GB)
1717f296bb3SBarry Smith    Package L#1 + L3 L#1 (15MB)
1727f296bb3SBarry Smith      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
1737f296bb3SBarry Smith        PU L#12 (P#6)
1747f296bb3SBarry Smith        PU L#13 (P#18)
1757f296bb3SBarry Smith      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
1767f296bb3SBarry Smith        PU L#14 (P#7)
1777f296bb3SBarry Smith        PU L#15 (P#19)
1787f296bb3SBarry Smith      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
1797f296bb3SBarry Smith        PU L#16 (P#8)
1807f296bb3SBarry Smith        PU L#17 (P#20)
1817f296bb3SBarry Smith      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
1827f296bb3SBarry Smith        PU L#18 (P#9)
1837f296bb3SBarry Smith        PU L#19 (P#21)
1847f296bb3SBarry Smith      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
1857f296bb3SBarry Smith        PU L#20 (P#10)
1867f296bb3SBarry Smith        PU L#21 (P#22)
1877f296bb3SBarry Smith      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
1887f296bb3SBarry Smith        PU L#22 (P#11)
1897f296bb3SBarry Smith        PU L#23 (P#23)
1907f296bb3SBarry Smith```
1917f296bb3SBarry Smith
1927f296bb3SBarry SmithThe relevant physical processor IDs are shown in parentheses prefixed by
1937f296bb3SBarry Smith`P#`. Here, IDs 0 and 12 share the same physical core and have a
1947f296bb3SBarry Smithcommon L2 cache. IDs 0, 12, 1, 13, 2, 14, 3, 15, 4, 16, 5, 17 share the
1957f296bb3SBarry Smithsame socket and have a common L3 cache.
1967f296bb3SBarry Smith
1977f296bb3SBarry SmithA good placement for a run with six processes is to locate three
1987f296bb3SBarry Smithprocesses on the first socket and three processes on the second socket.
1997f296bb3SBarry SmithUnfortunately, mechanisms for process placement vary across MPI
2007f296bb3SBarry Smithimplementations, so make sure to consult the manual of your MPI
2017f296bb3SBarry Smithimplementation. The following discussion is based on how processor
2027f296bb3SBarry Smithplacement is done with MPICH and Open MPI, where one needs to pass
203*a8cf87e0SJunchao Zhang`--bind-to core --map-by socket` to `mpiexec`:
2047f296bb3SBarry Smith
2057f296bb3SBarry Smith```console
206*a8cf87e0SJunchao Zhang$ mpiexec -n 6 --bind-to core --map-by socket ./stream
2077f296bb3SBarry Smithprocess 0 binding: 100000000000100000000000
2087f296bb3SBarry Smithprocess 1 binding: 000000100000000000100000
2097f296bb3SBarry Smithprocess 2 binding: 010000000000010000000000
2107f296bb3SBarry Smithprocess 3 binding: 000000010000000000010000
2117f296bb3SBarry Smithprocess 4 binding: 001000000000001000000000
2127f296bb3SBarry Smithprocess 5 binding: 000000001000000000001000
2137f296bb3SBarry SmithTriad:        45403.1949   Rate (MB/s)
2147f296bb3SBarry Smith```
2157f296bb3SBarry Smith
2167f296bb3SBarry SmithIn this configuration, process 0 is bound to the first physical core on
2177f296bb3SBarry Smiththe first socket (with IDs 0 and 12), process 1 is bound to the first
2187f296bb3SBarry Smithcore on the second socket (IDs 6 and 18), and similarly for the
2197f296bb3SBarry Smithremaining processes. The achieved bandwidth of 45 GB/sec is close to the
2207f296bb3SBarry Smithpractical peak of about 50 GB/sec available on the machine. If, however,
2217f296bb3SBarry Smithall MPI processes are located on the same socket, memory bandwidth drops
2227f296bb3SBarry Smithsignificantly:
2237f296bb3SBarry Smith
2247f296bb3SBarry Smith```console
225*a8cf87e0SJunchao Zhang$ mpiexec -n 6 --bind-to core --map-by core ./stream
2267f296bb3SBarry Smithprocess 0 binding: 100000000000100000000000
2277f296bb3SBarry Smithprocess 1 binding: 010000000000010000000000
2287f296bb3SBarry Smithprocess 2 binding: 001000000000001000000000
2297f296bb3SBarry Smithprocess 3 binding: 000100000000000100000000
2307f296bb3SBarry Smithprocess 4 binding: 000010000000000010000000
2317f296bb3SBarry Smithprocess 5 binding: 000001000000000001000000
2327f296bb3SBarry SmithTriad:        25510.7507   Rate (MB/s)
2337f296bb3SBarry Smith```
2347f296bb3SBarry Smith
2357f296bb3SBarry SmithAll processes are now mapped to cores on the same socket. As a result,
2367f296bb3SBarry Smithonly the first memory channel is fully saturated at 25.5 GB/sec.
2377f296bb3SBarry Smith
238*a8cf87e0SJunchao ZhangOne must not assume that `mpiexec` uses good defaults. To
2397f296bb3SBarry Smithdemonstrate, compare the full output of `make streams` from {any}`subsec_bandwidth_vs_processes` first, followed by
2407f296bb3SBarry Smiththe results obtained by passing `--bind-to core --map-by socket`:
2417f296bb3SBarry Smith
2427f296bb3SBarry Smith```console
2437f296bb3SBarry Smith$ make streams
2447f296bb3SBarry Smithnp  speedup
2457f296bb3SBarry Smith1 1.0
2467f296bb3SBarry Smith2 1.58
2477f296bb3SBarry Smith3 2.19
2487f296bb3SBarry Smith4 2.42
2497f296bb3SBarry Smith5 2.63
2507f296bb3SBarry Smith6 2.69
2517f296bb3SBarry Smith7 2.31
2527f296bb3SBarry Smith8 2.42
2537f296bb3SBarry Smith9 2.37
2547f296bb3SBarry Smith10 2.65
2557f296bb3SBarry Smith11 2.3
2567f296bb3SBarry Smith12 2.53
2577f296bb3SBarry Smith13 2.43
2587f296bb3SBarry Smith14 2.63
2597f296bb3SBarry Smith15 2.74
2607f296bb3SBarry Smith16 2.7
2617f296bb3SBarry Smith17 3.28
2627f296bb3SBarry Smith18 3.66
2637f296bb3SBarry Smith19 3.95
2647f296bb3SBarry Smith20 3.07
2657f296bb3SBarry Smith21 3.82
2667f296bb3SBarry Smith22 3.49
2677f296bb3SBarry Smith23 3.79
2687f296bb3SBarry Smith24 3.71
2697f296bb3SBarry Smith```
2707f296bb3SBarry Smith
2717f296bb3SBarry Smith```console
2727f296bb3SBarry Smith$ make streams MPI_BINDING="--bind-to core --map-by socket"
2737f296bb3SBarry Smithnp  speedup
2747f296bb3SBarry Smith1 1.0
2757f296bb3SBarry Smith2 1.59
2767f296bb3SBarry Smith3 2.66
2777f296bb3SBarry Smith4 3.5
2787f296bb3SBarry Smith5 3.56
2797f296bb3SBarry Smith6 4.23
2807f296bb3SBarry Smith7 3.95
2817f296bb3SBarry Smith8 4.39
2827f296bb3SBarry Smith9 4.09
2837f296bb3SBarry Smith10 4.46
2847f296bb3SBarry Smith11 4.15
2857f296bb3SBarry Smith12 4.42
2867f296bb3SBarry Smith13 3.71
2877f296bb3SBarry Smith14 3.83
2887f296bb3SBarry Smith15 4.08
2897f296bb3SBarry Smith16 4.22
2907f296bb3SBarry Smith17 4.18
2917f296bb3SBarry Smith18 4.31
2927f296bb3SBarry Smith19 4.22
2937f296bb3SBarry Smith20 4.28
2947f296bb3SBarry Smith21 4.25
2957f296bb3SBarry Smith22 4.23
2967f296bb3SBarry Smith23 4.28
2977f296bb3SBarry Smith24 4.22
2987f296bb3SBarry Smith```
2997f296bb3SBarry Smith
3007f296bb3SBarry SmithFor the non-optimized version on the left, the speedup obtained when
3017f296bb3SBarry Smithusing any number of processes between 3 and 13 is essentially constant
3027f296bb3SBarry Smithup to fluctuations, indicating that all processes were by default
3037f296bb3SBarry Smithexecuted on the same socket. Only with 14 or more processes, the
3047f296bb3SBarry Smithspeedup number increases again. In contrast, the results of
3057f296bb3SBarry Smith
3067f296bb3SBarry Smith
3077f296bb3SBarry Smith`make streams`
3087f296bb3SBarry Smith
3097f296bb3SBarry Smith with proper processor placement shown second
3107f296bb3SBarry Smithresulted in slightly higher overall parallel speedup (identical
3117f296bb3SBarry Smithbaselines), in smaller performance fluctuations, and more than 90
3127f296bb3SBarry Smithpercent of peak bandwidth with only six processes.
3137f296bb3SBarry Smith
3147f296bb3SBarry SmithMachines with job submission systems such as SLURM usually provide
3157f296bb3SBarry Smithsimilar mechanisms for processor placements through options specified in
3167f296bb3SBarry Smithjob submission scripts. Please consult the respective manuals.
3177f296bb3SBarry Smith
3187f296bb3SBarry Smith#### Additional Process Placement Considerations and Details
3197f296bb3SBarry Smith
3207f296bb3SBarry SmithFor a typical, memory bandwidth-limited PETSc application, the primary
3217f296bb3SBarry Smithconsideration in placing MPI processes is ensuring that processes are
3227f296bb3SBarry Smithevenly distributed among sockets, and hence using all available memory
3237f296bb3SBarry Smithchannels. Increasingly complex processor designs and cache hierarchies,
3247f296bb3SBarry Smithhowever, mean that performance may also be sensitive to how processes
3257f296bb3SBarry Smithare bound to the resources within each socket. Performance on the two
3267f296bb3SBarry Smithprocessor machine in the preceding example may be relatively insensitive
3277f296bb3SBarry Smithto such placement decisions, because one L3 cache is shared by all cores
3287f296bb3SBarry Smithwithin a NUMA domain, and each core has its own L2 and L1 caches.
3297f296bb3SBarry SmithHowever, processors that are less “flat”, with more complex hierarchies,
3307f296bb3SBarry Smithmay be more sensitive. In many AMD Opterons or the second-generation
3317f296bb3SBarry Smith“Knights Landing” Intel Xeon Phi, for instance, L2 caches are shared
3327f296bb3SBarry Smithbetween two cores. On these processors, placing consecutive MPI ranks on
3337f296bb3SBarry Smithcores that share the same L2 cache may benefit performance if the two
3347f296bb3SBarry Smithranks communicate frequently with each other, because the latency
3357f296bb3SBarry Smithbetween cores sharing an L2 cache may be roughly half that of two cores
3367f296bb3SBarry Smithnot sharing one. There may be benefit, however, in placing consecutive
3377f296bb3SBarry Smithranks on cores that do not share an L2 cache, because (if there are
3387f296bb3SBarry Smithfewer MPI ranks than cores) this increases the total L2 cache capacity
3397f296bb3SBarry Smithand bandwidth available to the application. There is a trade-off to be
3407f296bb3SBarry Smithconsidered between placing processes close together (in terms of shared
3417f296bb3SBarry Smithresources) to optimize for efficient communication and synchronization
3427f296bb3SBarry Smithvs. farther apart to maximize available resources (memory channels,
3437f296bb3SBarry Smithcaches, I/O channels, etc.), and the best strategy will depend on the
3447f296bb3SBarry Smithapplication and the software and hardware stack.
3457f296bb3SBarry Smith
3467f296bb3SBarry SmithDifferent process placement strategies can affect performance at least
3477f296bb3SBarry Smithas much as some commonly explored settings, such as compiler
3487f296bb3SBarry Smithoptimization levels. Unfortunately, exploration of this space is
3497f296bb3SBarry Smithcomplicated by two factors: First, processor and core numberings may be
3507f296bb3SBarry Smithcompletely arbitrary, changing with BIOS version, etc., and second—as
3517f296bb3SBarry Smithalready noted—there is no standard mechanism used by MPI implementations
3527f296bb3SBarry Smith(or job schedulers) to specify process affinity. To overcome the first
3537f296bb3SBarry Smithissue, we recommend using the `lstopo` utility of the Portable
3547f296bb3SBarry SmithHardware Locality (`hwloc`) software package (which can be installed
3557f296bb3SBarry Smithby configuring PETSc with `–download-hwloc`) to understand the
3567f296bb3SBarry Smithprocessor topology of your machine. We cannot fully address the second
3577f296bb3SBarry Smithissue—consult the documentation for your MPI implementation and/or job
3587f296bb3SBarry Smithscheduler—but we offer some general observations on understanding
3597f296bb3SBarry Smithplacement options:
3607f296bb3SBarry Smith
3617f296bb3SBarry Smith- An MPI implementation may support a notion of *domains* in which a
3627f296bb3SBarry Smith  process may be pinned. A domain may simply correspond to a single
3637f296bb3SBarry Smith  core; however, the MPI implementation may allow a deal of flexibility
3647f296bb3SBarry Smith  in specifying domains that encompass multiple cores, span sockets,
3657f296bb3SBarry Smith  etc. Some implementations, such as Intel MPI, provide means to
3667f296bb3SBarry Smith  specify whether domains should be “compact”—composed of cores sharing
3677f296bb3SBarry Smith  resources such as caches—or “scatter”-ed, with little resource
3687f296bb3SBarry Smith  sharing (possibly even spanning sockets).
3697f296bb3SBarry Smith- Separate from the specification of domains, MPI implementations often
3707f296bb3SBarry Smith  support different *orderings* in which MPI ranks should be bound to
3717f296bb3SBarry Smith  these domains. Intel MPI, for instance, supports “compact” ordering
3727f296bb3SBarry Smith  to place consecutive ranks close in terms of shared resources,
3737f296bb3SBarry Smith  “scatter” to place them far apart, and “bunch” to map proportionally
3747f296bb3SBarry Smith  to sockets while placing ranks as close together as possible within
3757f296bb3SBarry Smith  the sockets.
3767f296bb3SBarry Smith- An MPI implementation that supports process pinning should offer some
3777f296bb3SBarry Smith  way to view the rank assignments. Use this output in conjunction with
3787f296bb3SBarry Smith  the topology obtained via `lstopo` or a similar tool to determine
3797f296bb3SBarry Smith  if the placements correspond to something you believe is reasonable
3807f296bb3SBarry Smith  for your application. Do not assume that the MPI implementation is
3817f296bb3SBarry Smith  doing something sensible by default!
3827f296bb3SBarry Smith
3837f296bb3SBarry Smith## Performance Pitfalls and Advice
3847f296bb3SBarry Smith
3857f296bb3SBarry SmithThis section looks into a potpourri of performance pitfalls encountered
3867f296bb3SBarry Smithby users in the past. Many of these pitfalls require a deeper
3877f296bb3SBarry Smithunderstanding of the system and experience to detect. The purpose of
3887f296bb3SBarry Smiththis section is to summarize and share our experience so that these
3897f296bb3SBarry Smithpitfalls can be avoided in the future.
3907f296bb3SBarry Smith
3917f296bb3SBarry Smith### Debug vs. Optimized Builds
3927f296bb3SBarry Smith
3937f296bb3SBarry SmithPETSc’s `configure` defaults to building PETSc with debug mode
3947f296bb3SBarry Smithenabled. Any code development should be done in this mode, because it
3957f296bb3SBarry Smithprovides handy debugging facilities such as accurate stack traces,
3967f296bb3SBarry Smithmemory leak checks, and memory corruption checks. Note that PETSc has no
3977f296bb3SBarry Smithreliable way of knowing whether a particular run is a production or
3987f296bb3SBarry Smithdebug run. In the case that a user requests profiling information via
3997f296bb3SBarry Smith`-log_view`, a debug build of PETSc issues the following warning:
4007f296bb3SBarry Smith
4017f296bb3SBarry Smith```none
4027f296bb3SBarry Smith##########################################################
4037f296bb3SBarry Smith#                                                        #
4047f296bb3SBarry Smith#                          WARNING!!!                    #
4057f296bb3SBarry Smith#                                                        #
4067f296bb3SBarry Smith#   This code was compiled with a debugging option,      #
4077f296bb3SBarry Smith#   To get timing results run configure                  #
4087f296bb3SBarry Smith#   using --with-debugging=no, the performance will      #
4097f296bb3SBarry Smith#   be generally two or three times faster.              #
4107f296bb3SBarry Smith#                                                        #
4117f296bb3SBarry Smith##########################################################
4127f296bb3SBarry Smith```
4137f296bb3SBarry Smith
4147f296bb3SBarry SmithConversely, one way of checking whether a particular build of PETSc has
4157f296bb3SBarry Smithdebugging enabled is to inspect the output of `-log_view`.
4167f296bb3SBarry Smith
4177f296bb3SBarry SmithDebug mode will generally be most useful for code development if
4187f296bb3SBarry Smithappropriate compiler options are set to facilitate debugging. The
4197f296bb3SBarry Smithcompiler should be instructed to generate binaries with debug symbols
4207f296bb3SBarry Smith(command line option `-g` for most compilers), and the optimization
4217f296bb3SBarry Smithlevel chosen should either completely disable optimizations (`-O0` for
4227f296bb3SBarry Smithmost compilers) or enable only optimizations that do not interfere with
4237f296bb3SBarry Smithdebugging (GCC, for instance, supports a `-Og` optimization level that
4247f296bb3SBarry Smithdoes this).
4257f296bb3SBarry Smith
4267f296bb3SBarry SmithOnly once the new code is thoroughly tested and ready for production,
4277f296bb3SBarry Smithone should disable debugging facilities by passing
4287f296bb3SBarry Smith`--with-debugging=no` to
4297f296bb3SBarry Smith
4307f296bb3SBarry Smith`configure`. One should also ensure that an appropriate compiler
4317f296bb3SBarry Smithoptimization level is set. Note that some compilers (e.g., Intel)
4327f296bb3SBarry Smithdefault to fairly comprehensive optimization levels, while others (e.g.,
4337f296bb3SBarry SmithGCC) default to no optimization at all. The best optimization flags will
4347f296bb3SBarry Smithdepend on your code, the compiler, and the target architecture, but we
4357f296bb3SBarry Smithoffer a few guidelines for finding those that will offer the best
4367f296bb3SBarry Smithperformance:
4377f296bb3SBarry Smith
4387f296bb3SBarry Smith- Most compilers have a number of optimization levels (with level n
4397f296bb3SBarry Smith  usually specified via `-On`) that provide a quick way to enable
4407f296bb3SBarry Smith  sets of several optimization flags. We suggest trying the higher
4417f296bb3SBarry Smith  optimization levels (the highest level is not guaranteed to produce
4427f296bb3SBarry Smith  the fastest executable, so some experimentation may be merited). With
4437f296bb3SBarry Smith  most recent processors now supporting some form of SIMD or vector
4447f296bb3SBarry Smith  instructions, it is important to choose a level that enables the
4457f296bb3SBarry Smith  compiler’s auto-vectorizer; many compilers do not enable
4467f296bb3SBarry Smith  auto-vectorization at lower optimization levels (e.g., GCC does not
4477f296bb3SBarry Smith  enable it below `-O3` and the Intel compiler does not enable it
4487f296bb3SBarry Smith  below `-O2`).
4497f296bb3SBarry Smith- For processors supporting newer vector instruction sets, such as
4507f296bb3SBarry Smith  Intel AVX2 and AVX-512, it is also important to direct the compiler
4517f296bb3SBarry Smith  to generate code that targets these processors (e.g., `-march=native`);
4527f296bb3SBarry Smith  otherwise, the executables built will not
4537f296bb3SBarry Smith  utilize the newer instructions sets and will not take advantage of
4547f296bb3SBarry Smith  the vector processing units.
4557f296bb3SBarry Smith- Beyond choosing the optimization levels, some value-unsafe
4567f296bb3SBarry Smith  optimizations (such as using reciprocals of values instead of
4577f296bb3SBarry Smith  dividing by those values, or allowing re-association of operands in a
4587f296bb3SBarry Smith  series of calculations) for floating point calculations may yield
4597f296bb3SBarry Smith  significant performance gains. Compilers often provide flags (e.g.,
4607f296bb3SBarry Smith  `-ffast-math` in GCC) to enable a set of these optimizations, and
4617f296bb3SBarry Smith  they may be turned on when using options for very aggressive
4627f296bb3SBarry Smith  optimization (`-fast` or `-Ofast` in many compilers). These are
4637f296bb3SBarry Smith  worth exploring to maximize performance, but, if employed, it
4647f296bb3SBarry Smith  important to verify that these do not cause erroneous results with
4657f296bb3SBarry Smith  your code, since calculations may violate the IEEE standard for
4667f296bb3SBarry Smith  floating-point arithmetic.
4677f296bb3SBarry Smith
4687f296bb3SBarry Smith### Profiling
4697f296bb3SBarry Smith
4707f296bb3SBarry SmithUsers should not spend time optimizing a code until after having
4717f296bb3SBarry Smithdetermined where it spends the bulk of its time on realistically sized
4727f296bb3SBarry Smithproblems. As discussed in detail in {any}`ch_profiling`, the
4737f296bb3SBarry SmithPETSc routines automatically log performance data if certain runtime
4747f296bb3SBarry Smithoptions are specified.
4757f296bb3SBarry Smith
4767f296bb3SBarry SmithTo obtain a summary of where and how much time is spent in different
4777f296bb3SBarry Smithsections of the code, use one of the following options:
4787f296bb3SBarry Smith
4797f296bb3SBarry Smith- Run the code with the option `-log_view` to print a performance
4807f296bb3SBarry Smith  summary for various phases of the code.
4817f296bb3SBarry Smith- Run the code with the option `-log_mpe` `[logfilename]`, which
4827f296bb3SBarry Smith  creates a logfile of events suitable for viewing with Jumpshot (part
4837f296bb3SBarry Smith  of MPICH).
4847f296bb3SBarry Smith
4857f296bb3SBarry SmithThen, focus on the sections where most of the time is spent. If you
4867f296bb3SBarry Smithprovided your own callback routines, e.g. for residual evaluations,
4877f296bb3SBarry Smithsearch the profiling output for routines such as `SNESFunctionEval` or
4887f296bb3SBarry Smith`SNESJacobianEval`. If their relative time is significant (say, more
4897f296bb3SBarry Smiththan 30 percent), consider optimizing these routines first. Generic
4907f296bb3SBarry Smithinstructions on how to optimize your callback functions are difficult;
4917f296bb3SBarry Smithyou may start by reading performance optimization guides for your
4927f296bb3SBarry Smithsystem’s hardware.
4937f296bb3SBarry Smith
4947f296bb3SBarry Smith### Aggregation
4957f296bb3SBarry Smith
4967f296bb3SBarry SmithPerforming operations on chunks of data rather than a single element at
4977f296bb3SBarry Smitha time can significantly enhance performance because of cache reuse or
4987f296bb3SBarry Smithlower data motion. Typical examples are:
4997f296bb3SBarry Smith
5007f296bb3SBarry Smith- Insert several (many) elements of a matrix or vector at once, rather
5017f296bb3SBarry Smith  than looping and inserting a single value at a time. In order to
5027f296bb3SBarry Smith  access elements in of vector repeatedly, employ `VecGetArray()` to
5037f296bb3SBarry Smith  allow direct manipulation of the vector elements.
5047f296bb3SBarry Smith- When possible, use `VecMDot()` rather than a series of calls to
5057f296bb3SBarry Smith  `VecDot()`.
5067f296bb3SBarry Smith- If you require a sequence of matrix-vector products with the same
5077f296bb3SBarry Smith  matrix, consider packing your vectors into a single matrix and use
5087f296bb3SBarry Smith  matrix-matrix multiplications.
5097f296bb3SBarry Smith- Users should employ a reasonable number of `PetscMalloc()` calls in
5107f296bb3SBarry Smith  their codes. Hundreds or thousands of memory allocations may be
5117f296bb3SBarry Smith  appropriate; however, if tens of thousands are being used, then
5127f296bb3SBarry Smith  reducing the number of `PetscMalloc()` calls may be warranted. For
5137f296bb3SBarry Smith  example, reusing space or allocating large chunks and dividing it
5147f296bb3SBarry Smith  into pieces can produce a significant savings in allocation overhead.
5157f296bb3SBarry Smith  {any}`sec_dsreuse` gives details.
5167f296bb3SBarry Smith
5177f296bb3SBarry SmithAggressive aggregation of data may result in inflexible datastructures
5187f296bb3SBarry Smithand code that is hard to maintain. We advise users to keep these
5197f296bb3SBarry Smithcompeting goals in mind and not blindly optimize for performance only.
5207f296bb3SBarry Smith
5217f296bb3SBarry Smith(sec_symbolfactor)=
5227f296bb3SBarry Smith
5237f296bb3SBarry Smith### Memory Allocation for Sparse Matrix Factorization
5247f296bb3SBarry Smith
5257f296bb3SBarry SmithWhen symbolically factoring an AIJ matrix, PETSc has to guess how much
5267f296bb3SBarry Smithfill there will be. Careful use of the fill parameter in the
5277f296bb3SBarry Smith`MatFactorInfo` structure when calling `MatLUFactorSymbolic()` or
5287f296bb3SBarry Smith`MatILUFactorSymbolic()` can reduce greatly the number of mallocs and
5297f296bb3SBarry Smithcopies required, and thus greatly improve the performance of the
5307f296bb3SBarry Smithfactorization. One way to determine a good value for the fill parameter
5317f296bb3SBarry Smithis to run a program with the option `-info`. The symbolic
5327f296bb3SBarry Smithfactorization phase will then print information such as
5337f296bb3SBarry Smith
5347f296bb3SBarry Smith```none
5357f296bb3SBarry SmithInfo:MatILUFactorSymbolic_SeqAIJ:Reallocs 12 Fill ratio:given 1 needed 2.16423
5367f296bb3SBarry Smith```
5377f296bb3SBarry Smith
5387f296bb3SBarry SmithThis indicates that the user should have used a fill estimate factor of
5397f296bb3SBarry Smithabout 2.17 (instead of 1) to prevent the 12 required mallocs and copies.
5407f296bb3SBarry SmithThe command line option
5417f296bb3SBarry Smith
5427f296bb3SBarry Smith```none
5437f296bb3SBarry Smith-pc_factor_fill 2.17
5447f296bb3SBarry Smith```
5457f296bb3SBarry Smith
5467f296bb3SBarry Smithwill cause PETSc to preallocate the correct amount of space for
5477f296bb3SBarry Smiththe factorization.
5487f296bb3SBarry Smith
5497f296bb3SBarry Smith(detecting_memory_problems)=
5507f296bb3SBarry Smith
5517f296bb3SBarry Smith### Detecting Memory Allocation Problems and Memory Usage
5527f296bb3SBarry Smith
5537f296bb3SBarry SmithPETSc provides tools to aid in understanding PETSc memory usage and detecting problems with
5547f296bb3SBarry Smithmemory allocation, including leaks and use of uninitialized space. Internally, PETSc uses
5557f296bb3SBarry Smiththe routines `PetscMalloc()` and `PetscFree()` for memory allocation; instead of directly calling `malloc()` and `free()`.
5567f296bb3SBarry SmithThis allows PETSc to track its memory usage and perform error checking. Users are urged to use these routines as well when
5577f296bb3SBarry Smithappropriate.
5587f296bb3SBarry Smith
5597f296bb3SBarry Smith- The option `-malloc_debug` turns on PETSc's extensive runtime error checking of memory for corruption.
5607f296bb3SBarry Smith  This checking can be expensive, so should not be used for
5617f296bb3SBarry Smith  production runs. The option `-malloc_test` is equivalent to `-malloc_debug`
5627f296bb3SBarry Smith  but only works when PETSc is configured with `--with-debugging` (the default configuration).
5637f296bb3SBarry Smith  We suggest setting the environmental variable `PETSC_OPTIONS=-malloc_test`
5647f296bb3SBarry Smith  in your shell startup file to automatically enable runtime check memory for developing code but not
5657f296bb3SBarry Smith  running optimized code. Using `-malloc_debug` or `-malloc_test` for large runs can slow them significantly, thus we
5667f296bb3SBarry Smith  recommend turning them off if you code is painfully slow and you don't need the testing. In addition, you can use
5677f296bb3SBarry Smith  `-check_pointer_intensity 0` for long run debug runs that do not need extensive memory corruption testing. This option
5687f296bb3SBarry Smith  is occasionally added to the `PETSC_OPTIONS` environmental variable by some users.
5697f296bb3SBarry Smith- The option
5707f296bb3SBarry Smith  `-malloc_dump` will print a list of memory locations that have not been freed at the
5717f296bb3SBarry Smith  conclusion of a program. If all memory has been freed no message
5727f296bb3SBarry Smith  is printed. Note that
5737f296bb3SBarry Smith  the option `-malloc_dump` activates a call to
5747f296bb3SBarry Smith  `PetscMallocDump()` during `PetscFinalize()`. The user can also
5757f296bb3SBarry Smith  call `PetscMallocDump()` elsewhere in a program.
5767f296bb3SBarry Smith- Another useful option
5777f296bb3SBarry Smith  is `-malloc_view`, which reports memory usage in all routines at the conclusion of the program.
5787f296bb3SBarry Smith  Note that this option
5797f296bb3SBarry Smith  activates logging by calling `PetscMallocViewSet()` in
5807f296bb3SBarry Smith  `PetscInitialize()` and then prints the log by calling
5817f296bb3SBarry Smith  `PetscMallocView()` in `PetscFinalize()`. The user can also call
5827f296bb3SBarry Smith  these routines elsewhere in a program.
5837f296bb3SBarry Smith- When finer granularity is
5847f296bb3SBarry Smith  desired, the user can call `PetscMallocGetCurrentUsage()` and
5857f296bb3SBarry Smith  `PetscMallocGetMaximumUsage()` for memory allocated by PETSc, or
5867f296bb3SBarry Smith  `PetscMemoryGetCurrentUsage()` and `PetscMemoryGetMaximumUsage()`
5877f296bb3SBarry Smith  for the total memory used by the program. Note that
5887f296bb3SBarry Smith  `PetscMemorySetGetMaximumUsage()` must be called before
5897f296bb3SBarry Smith  `PetscMemoryGetMaximumUsage()` (typically at the beginning of the
5907f296bb3SBarry Smith  program).
5917f296bb3SBarry Smith- The option `-memory_view` provides a high-level view of all memory usage,
5927f296bb3SBarry Smith  not just the memory used by `PetscMalloc()`, at the conclusion of the program.
5937f296bb3SBarry Smith- When running with `-log_view`, the additional option `-log_view_memory`
5947f296bb3SBarry Smith  causes the display of additional columns of information about how much
5957f296bb3SBarry Smith  memory was allocated and freed during each logged event. This is useful
5967f296bb3SBarry Smith  to understand what phases of a computation require the most memory.
5977f296bb3SBarry Smith
5987f296bb3SBarry SmithOne can also use [Valgrind](http://valgrind.org) to track memory usage and find bugs, see {any}`FAQ: Valgrind usage<valgrind>`.
5997f296bb3SBarry Smith
6007f296bb3SBarry Smith(sec_dsreuse)=
6017f296bb3SBarry Smith
6027f296bb3SBarry Smith### Data Structure Reuse
6037f296bb3SBarry Smith
6047f296bb3SBarry SmithData structures should be reused whenever possible. For example, if a
6057f296bb3SBarry Smithcode often creates new matrices or vectors, there often may be a way to
6067f296bb3SBarry Smithreuse some of them. Very significant performance improvements can be
6077f296bb3SBarry Smithachieved by reusing matrix data structures with the same nonzero
6087f296bb3SBarry Smithpattern. If a code creates thousands of matrix or vector objects,
6097f296bb3SBarry Smithperformance will be degraded. For example, when solving a nonlinear
6107f296bb3SBarry Smithproblem or timestepping, reusing the matrices and their nonzero
6117f296bb3SBarry Smithstructure for many steps when appropriate can make the code run
6127f296bb3SBarry Smithsignificantly faster.
6137f296bb3SBarry Smith
6147f296bb3SBarry SmithA simple technique for saving work vectors, matrices, etc. is employing
6157f296bb3SBarry Smitha user-defined context. In C and C++ such a context is merely a
6167f296bb3SBarry Smithstructure in which various objects can be stashed; in Fortran a user
6177f296bb3SBarry Smithcontext can be an integer array that contains both parameters and
6187f296bb3SBarry Smithpointers to PETSc objects. See
6197f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/snes/tutorials/ex5.c.html">SNES Tutorial ex5</a>
6207f296bb3SBarry Smithand
6217f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/snes/tutorials/ex5f90.F90.html">SNES Tutorial ex5f90</a>
6227f296bb3SBarry Smithfor examples of user-defined application contexts in C and Fortran,
6237f296bb3SBarry Smithrespectively.
6247f296bb3SBarry Smith
6257f296bb3SBarry Smith### Numerical Experiments
6267f296bb3SBarry Smith
6277f296bb3SBarry SmithPETSc users should run a variety of tests. For example, there are a
6287f296bb3SBarry Smithlarge number of options for the linear and nonlinear equation solvers in
6297f296bb3SBarry SmithPETSc, and different choices can make a *very* big difference in
6307f296bb3SBarry Smithconvergence rates and execution times. PETSc employs defaults that are
6317f296bb3SBarry Smithgenerally reasonable for a wide range of problems, but clearly these
6327f296bb3SBarry Smithdefaults cannot be best for all cases. Users should experiment with many
6337f296bb3SBarry Smithcombinations to determine what is best for a given problem and customize
6347f296bb3SBarry Smiththe solvers accordingly.
6357f296bb3SBarry Smith
6367f296bb3SBarry Smith- Use the options `-snes_view`, `-ksp_view`, etc. (or the routines
6377f296bb3SBarry Smith  `KSPView()`, `SNESView()`, etc.) to view the options that have
6387f296bb3SBarry Smith  been used for a particular solver.
6397f296bb3SBarry Smith- Run the code with the option `-help` for a list of the available
6407f296bb3SBarry Smith  runtime commands.
6417f296bb3SBarry Smith- Use the option `-info` to print details about the solvers’
6427f296bb3SBarry Smith  operation.
6437f296bb3SBarry Smith- Use the PETSc monitoring discussed in {any}`ch_profiling`
6447f296bb3SBarry Smith  to evaluate the performance of various numerical methods.
6457f296bb3SBarry Smith
6467f296bb3SBarry Smith(sec_slestips)=
6477f296bb3SBarry Smith
6487f296bb3SBarry Smith### Tips for Efficient Use of Linear Solvers
6497f296bb3SBarry Smith
6507f296bb3SBarry SmithAs discussed in {any}`ch_ksp`, the default linear
6517f296bb3SBarry Smithsolvers are
6527f296bb3SBarry Smith
6537f296bb3SBarry Smith- uniprocess: GMRES(30) with ILU(0) preconditioning
6547f296bb3SBarry Smith- multiprocess: GMRES(30) with block Jacobi preconditioning, where
6557f296bb3SBarry Smith  there is 1 block per process, and each block is solved with ILU(0)
6567f296bb3SBarry Smith
6577f296bb3SBarry SmithOne should experiment to determine alternatives that may be better for
6587f296bb3SBarry Smithvarious applications. Recall that one can specify the `KSP` methods
6597f296bb3SBarry Smithand preconditioners at runtime via the options:
6607f296bb3SBarry Smith
6617f296bb3SBarry Smith```none
6627f296bb3SBarry Smith-ksp_type <ksp_name> -pc_type <pc_name>
6637f296bb3SBarry Smith```
6647f296bb3SBarry Smith
6657f296bb3SBarry SmithOne can also specify a variety of runtime customizations for the
6667f296bb3SBarry Smithsolvers, as discussed throughout the manual.
6677f296bb3SBarry Smith
6687f296bb3SBarry SmithIn particular, note that the default restart parameter for GMRES is 30,
6697f296bb3SBarry Smithwhich may be too small for some large-scale problems. One can alter this
6707f296bb3SBarry Smithparameter with the option `-ksp_gmres_restart <restart>` or by calling
6717f296bb3SBarry Smith`KSPGMRESSetRestart()`. {any}`sec_ksp` gives
6727f296bb3SBarry Smithinformation on setting alternative GMRES orthogonalization routines,
6737f296bb3SBarry Smithwhich may provide much better parallel performance.
6747f296bb3SBarry Smith
6757f296bb3SBarry SmithFor elliptic problems one often obtains good performance and scalability
6767f296bb3SBarry Smithwith multigrid solvers. Consult {any}`sec_amg` for
6777f296bb3SBarry Smithavailable options. Our experience is that GAMG works particularly well
6787f296bb3SBarry Smithfor elasticity problems, whereas hypre does well for scalar problems.
6797f296bb3SBarry Smith
6807f296bb3SBarry Smith### System-Related Problems
6817f296bb3SBarry Smith
6827f296bb3SBarry SmithThe performance of a code can be affected by a variety of factors,
6837f296bb3SBarry Smithincluding the cache behavior, other users on the machine, etc. Below we
6847f296bb3SBarry Smithbriefly describe some common problems and possibilities for overcoming
6857f296bb3SBarry Smiththem.
6867f296bb3SBarry Smith
6877f296bb3SBarry Smith- **Problem too large for physical memory size**: When timing a
6887f296bb3SBarry Smith  program, one should always leave at least a ten percent margin
6897f296bb3SBarry Smith  between the total memory a process is using and the physical size of
6907f296bb3SBarry Smith  the machine’s memory. One way to estimate the amount of memory used
6917f296bb3SBarry Smith  by given process is with the Unix `getrusage` system routine.
6927f296bb3SBarry Smith  The PETSc option `-malloc_view` reports all
6937f296bb3SBarry Smith  memory usage, including any Fortran arrays in an application code.
6947f296bb3SBarry Smith- **Effects of other users**: If other users are running jobs on the
6957f296bb3SBarry Smith  same physical processor nodes on which a program is being profiled,
6967f296bb3SBarry Smith  the timing results are essentially meaningless.
6977f296bb3SBarry Smith- **Overhead of timing routines on certain machines**: On certain
6987f296bb3SBarry Smith  machines, even calling the system clock in order to time routines is
6997f296bb3SBarry Smith  slow; this skews all of the flop rates and timing results. The file
7007f296bb3SBarry Smith  `$PETSC_DIR/src/benchmarks/PetscTime.c` (<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/benchmarks/PetscTime.c.html">source</a>)
7017f296bb3SBarry Smith  contains a simple test problem that will approximate the amount of
7027f296bb3SBarry Smith  time required to get the current time in a running program. On good
7037f296bb3SBarry Smith  systems it will on the order of $10^{-6}$ seconds or less.
7047f296bb3SBarry Smith- **Problem too large for good cache performance**: Certain machines
7057f296bb3SBarry Smith  with lower memory bandwidths (slow memory access) attempt to
7067f296bb3SBarry Smith  compensate by having a very large cache. Thus, if a significant
7077f296bb3SBarry Smith  portion of an application fits within the cache, the program will
7087f296bb3SBarry Smith  achieve very good performance; if the code is too large, the
7097f296bb3SBarry Smith  performance can degrade markedly. To analyze whether this situation
7107f296bb3SBarry Smith  affects a particular code, one can try plotting the total flop rate
7117f296bb3SBarry Smith  as a function of problem size. If the flop rate decreases rapidly at
7127f296bb3SBarry Smith  some point, then the problem may likely be too large for the cache
7137f296bb3SBarry Smith  size.
7147f296bb3SBarry Smith- **Inconsistent timings**: Inconsistent timings are likely due to
7157f296bb3SBarry Smith  other users on the machine, thrashing (using more virtual memory than
7167f296bb3SBarry Smith  available physical memory), or paging in of the initial executable.
7177f296bb3SBarry Smith  {any}`sec_profaccuracy` provides information on
7187f296bb3SBarry Smith  overcoming paging overhead when profiling a code. We have found on
7197f296bb3SBarry Smith  all systems that if you follow all the advise above your timings will
7207f296bb3SBarry Smith  be consistent within a variation of less than five percent.
721