doc/manual/performance.md

*7f296bb3SBarry Smith(ch_performance)=
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith# Hints for Performance Tuning
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithThis chapter provides hints on how to get to achieve best performance
*7f296bb3SBarry Smithwith PETSc, particularly on distributed-memory machines with multiple
*7f296bb3SBarry SmithCPU sockets per node. We focus on machine-related performance
*7f296bb3SBarry Smithoptimization here; algorithmic aspects like preconditioner selection are
*7f296bb3SBarry Smithnot the focus of this section.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith## Maximizing Memory Bandwidth
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithMost operations in PETSc deal with large datasets (typically vectors and
*7f296bb3SBarry Smithsparse matrices) and perform relatively few arithmetic operations for
*7f296bb3SBarry Smitheach byte loaded or stored from global memory. Therefore, the
*7f296bb3SBarry Smith*arithmetic intensity* expressed as the ratio of floating point
*7f296bb3SBarry Smithoperations to the number of bytes loaded and stored is usually well
*7f296bb3SBarry Smithbelow unity for typical PETSc operations. On the other hand, modern CPUs
*7f296bb3SBarry Smithare able to execute on the order of 10 floating point operations for
*7f296bb3SBarry Smitheach byte loaded or stored. As a consequence, almost all PETSc
*7f296bb3SBarry Smithoperations are limited by the rate at which data can be loaded or stored
*7f296bb3SBarry Smith(*memory bandwidth limited*) rather than by the rate of floating point
*7f296bb3SBarry Smithoperations.
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithThis section discusses ways to maximize the memory bandwidth achieved by
*7f296bb3SBarry Smithapplications based on PETSc. Where appropriate, we include benchmark
*7f296bb3SBarry Smithresults in order to provide quantitative results on typical performance
*7f296bb3SBarry Smithgains one can achieve through parallelization, both on a single compute
*7f296bb3SBarry Smithnode and across nodes. In particular, we start with the answer to the
*7f296bb3SBarry Smithcommon question of why performance generally does not increase 20-fold
*7f296bb3SBarry Smithwith a 20-core CPU.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith(subsec_bandwidth_vs_processes)=
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith### Memory Bandwidth vs. Processes
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithConsider the addition of two large vectors, with the result written to a
*7f296bb3SBarry Smiththird vector. Because there are no dependencies across the different
*7f296bb3SBarry Smithentries of each vector, the operation is embarrassingly parallel.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith:::{figure} /images/manual/stream-results-intel.*
*7f296bb3SBarry Smith:alt: Memory bandwidth obtained on Intel hardware (dual socket except KNL) over the
*7f296bb3SBarry Smith:  number of processes used. One can get close to peak memory bandwidth with only a
*7f296bb3SBarry Smith:  few processes.
*7f296bb3SBarry Smith:name: fig_stream_intel
*7f296bb3SBarry Smith:width: 80.0%
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithMemory bandwidth obtained on Intel hardware (dual socket except KNL)
*7f296bb3SBarry Smithover the number of processes used. One can get close to peak memory
*7f296bb3SBarry Smithbandwidth with only a few processes.
*7f296bb3SBarry Smith:::
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithAs {numref}`fig_stream_intel` shows, the performance gains due to
*7f296bb3SBarry Smithparallelization on different multi- and many-core CPUs quickly
*7f296bb3SBarry Smithsaturates. The reason is that only a fraction of the total number of CPU
*7f296bb3SBarry Smithcores is required to saturate the memory channels. For example, a
*7f296bb3SBarry Smithdual-socket system equipped with Haswell 12-core Xeon CPUs achieves more
*7f296bb3SBarry Smiththan 80 percent of achievable peak memory bandwidth with only four
*7f296bb3SBarry Smithprocesses per socket (8 total), cf. {numref}`fig_stream_intel`.
*7f296bb3SBarry SmithConsequently, running with more than 8 MPI ranks on such a system will
*7f296bb3SBarry Smithnot increase performance substantially. For the same reason, PETSc-based
*7f296bb3SBarry Smithapplications usually do not benefit from hyper-threading.
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithPETSc provides a simple way to measure memory bandwidth for different
*7f296bb3SBarry Smithnumbers of processes via the target `make streams` executed from
*7f296bb3SBarry Smith`$PETSC_DIR`. The output provides an overview of the possible speedup
*7f296bb3SBarry Smithone can obtain on the given machine (not necessarily a shared memory
*7f296bb3SBarry Smithsystem). For example, the following is the most relevant output obtained
*7f296bb3SBarry Smithon a dual-socket system equipped with two six-core-CPUs with
*7f296bb3SBarry Smithhyperthreading:
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith```none
*7f296bb3SBarry Smithnp  speedup
*7f296bb3SBarry Smith1 1.0
*7f296bb3SBarry Smith2 1.58
*7f296bb3SBarry Smith3 2.19
*7f296bb3SBarry Smith4 2.42
*7f296bb3SBarry Smith5 2.63
*7f296bb3SBarry Smith6 2.69
*7f296bb3SBarry Smith...
*7f296bb3SBarry Smith21 3.82
*7f296bb3SBarry Smith22 3.49
*7f296bb3SBarry Smith23 3.79
*7f296bb3SBarry Smith24 3.71
*7f296bb3SBarry SmithEstimation of possible speedup of MPI programs based on Streams benchmark.
*7f296bb3SBarry SmithIt appears you have 1 node(s)
*7f296bb3SBarry Smith```
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithOn this machine, one should expect a speed-up of typical memory
*7f296bb3SBarry Smithbandwidth-bound PETSc applications of at most 4x when running multiple
*7f296bb3SBarry SmithMPI ranks on the node. Most of the gains are already obtained when
*7f296bb3SBarry Smithrunning with only 4-6 ranks. Because a smaller number of MPI ranks
*7f296bb3SBarry Smithusually implies better preconditioners and better performance for
*7f296bb3SBarry Smithsmaller problems, the best performance for PETSc applications may be
*7f296bb3SBarry Smithobtained with fewer ranks than there are physical CPU cores available.
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithFollowing the results from the above run of `make streams`, we
*7f296bb3SBarry Smithrecommend to use additional nodes instead of placing additional MPI
*7f296bb3SBarry Smithranks on the nodes. In particular, weak scaling (i.e. constant load per
*7f296bb3SBarry Smithprocess, increasing the number of processes) and strong scaling
*7f296bb3SBarry Smith(i.e. constant total work, increasing the number of processes) studies
*7f296bb3SBarry Smithshould keep the number of processes per node constant.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith### Non-Uniform Memory Access (NUMA) and Process Placement
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithCPUs in nodes with more than one CPU socket are internally connected via
*7f296bb3SBarry Smitha high-speed fabric, cf. {numref}`fig_numa`, to enable data
*7f296bb3SBarry Smithexchange as well as cache coherency. Because main memory on modern
*7f296bb3SBarry Smithsystems is connected via the integrated memory controllers on each CPU,
*7f296bb3SBarry Smithmemory is accessed in a non-uniform way: A process running on one socket
*7f296bb3SBarry Smithhas direct access to the memory channels of the respective CPU, whereas
*7f296bb3SBarry Smithrequests for memory attached to a different CPU socket need to go
*7f296bb3SBarry Smiththrough the high-speed fabric. Consequently, best aggregate memory
*7f296bb3SBarry Smithbandwidth on the node is obtained when the memory controllers on each
*7f296bb3SBarry SmithCPU are fully saturated. However, full saturation of memory channels is
*7f296bb3SBarry Smithonly possible if the data is distributed across the different memory
*7f296bb3SBarry Smithchannels.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith:::{figure} /images/manual/numa.*
*7f296bb3SBarry Smith:alt: Schematic of a two-socket NUMA system. Processes should be spread across both
*7f296bb3SBarry Smith:  CPUs to obtain full bandwidth.
*7f296bb3SBarry Smith:name: fig_numa
*7f296bb3SBarry Smith:width: 90.0%
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithSchematic of a two-socket NUMA system. Processes should be spread
*7f296bb3SBarry Smithacross both CPUs to obtain full bandwidth.
*7f296bb3SBarry Smith:::
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithData in memory on modern machines is allocated by the operating system
*7f296bb3SBarry Smithbased on a first-touch policy. That is, memory is not allocated at the
*7f296bb3SBarry Smithpoint of issuing `malloc()`, but at the point when the respective
*7f296bb3SBarry Smithmemory segment is actually touched (read or write). Upon first-touch,
*7f296bb3SBarry Smithmemory is allocated on the memory channel associated with the respective
*7f296bb3SBarry SmithCPU the process is running on. Only if all memory on the respective CPU
*7f296bb3SBarry Smithis already in use (either allocated or as IO cache), memory available
*7f296bb3SBarry Smiththrough other sockets is considered.
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithMaximum memory bandwidth can be achieved by ensuring that processes are
*7f296bb3SBarry Smithspread over all sockets in the respective node. For example, the
*7f296bb3SBarry Smithrecommended placement of a 8-way parallel run on a four-socket machine
*7f296bb3SBarry Smithis to assign two processes to each CPU socket. To do so, one needs to
*7f296bb3SBarry Smithknow the enumeration of cores and pass the requested information to
*7f296bb3SBarry Smith`mpirun`. Consider the hardware topology information returned by
*7f296bb3SBarry Smith`lstopo` (part of the hwloc package) for the following two-socket
*7f296bb3SBarry Smithmachine, in which each CPU consists of six cores and supports
*7f296bb3SBarry Smithhyperthreading:
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith```none
*7f296bb3SBarry SmithMachine (126GB total)
*7f296bb3SBarry Smith  NUMANode L#0 (P#0 63GB)
*7f296bb3SBarry Smith    Package L#0 + L3 L#0 (15MB)
*7f296bb3SBarry Smith      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
*7f296bb3SBarry Smith        PU L#0 (P#0)
*7f296bb3SBarry Smith        PU L#1 (P#12)
*7f296bb3SBarry Smith      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
*7f296bb3SBarry Smith        PU L#2 (P#1)
*7f296bb3SBarry Smith        PU L#3 (P#13)
*7f296bb3SBarry Smith      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
*7f296bb3SBarry Smith        PU L#4 (P#2)
*7f296bb3SBarry Smith        PU L#5 (P#14)
*7f296bb3SBarry Smith      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
*7f296bb3SBarry Smith        PU L#6 (P#3)
*7f296bb3SBarry Smith        PU L#7 (P#15)
*7f296bb3SBarry Smith      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
*7f296bb3SBarry Smith        PU L#8 (P#4)
*7f296bb3SBarry Smith        PU L#9 (P#16)
*7f296bb3SBarry Smith      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
*7f296bb3SBarry Smith        PU L#10 (P#5)
*7f296bb3SBarry Smith        PU L#11 (P#17)
*7f296bb3SBarry Smith  NUMANode L#1 (P#1 63GB)
*7f296bb3SBarry Smith    Package L#1 + L3 L#1 (15MB)
*7f296bb3SBarry Smith      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
*7f296bb3SBarry Smith        PU L#12 (P#6)
*7f296bb3SBarry Smith        PU L#13 (P#18)
*7f296bb3SBarry Smith      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
*7f296bb3SBarry Smith        PU L#14 (P#7)
*7f296bb3SBarry Smith        PU L#15 (P#19)
*7f296bb3SBarry Smith      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
*7f296bb3SBarry Smith        PU L#16 (P#8)
*7f296bb3SBarry Smith        PU L#17 (P#20)
*7f296bb3SBarry Smith      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
*7f296bb3SBarry Smith        PU L#18 (P#9)
*7f296bb3SBarry Smith        PU L#19 (P#21)
*7f296bb3SBarry Smith      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
*7f296bb3SBarry Smith        PU L#20 (P#10)
*7f296bb3SBarry Smith        PU L#21 (P#22)
*7f296bb3SBarry Smith      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
*7f296bb3SBarry Smith        PU L#22 (P#11)
*7f296bb3SBarry Smith        PU L#23 (P#23)
*7f296bb3SBarry Smith```
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithThe relevant physical processor IDs are shown in parentheses prefixed by
*7f296bb3SBarry Smith`P#`. Here, IDs 0 and 12 share the same physical core and have a
*7f296bb3SBarry Smithcommon L2 cache. IDs 0, 12, 1, 13, 2, 14, 3, 15, 4, 16, 5, 17 share the
*7f296bb3SBarry Smithsame socket and have a common L3 cache.
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithA good placement for a run with six processes is to locate three
*7f296bb3SBarry Smithprocesses on the first socket and three processes on the second socket.
*7f296bb3SBarry SmithUnfortunately, mechanisms for process placement vary across MPI
*7f296bb3SBarry Smithimplementations, so make sure to consult the manual of your MPI
*7f296bb3SBarry Smithimplementation. The following discussion is based on how processor
*7f296bb3SBarry Smithplacement is done with MPICH and Open MPI, where one needs to pass
*7f296bb3SBarry Smith`--bind-to core --map-by socket` to `mpirun`:
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith```console
*7f296bb3SBarry Smith$ mpirun -n 6 --bind-to core --map-by socket ./stream
*7f296bb3SBarry Smithprocess 0 binding: 100000000000100000000000
*7f296bb3SBarry Smithprocess 1 binding: 000000100000000000100000
*7f296bb3SBarry Smithprocess 2 binding: 010000000000010000000000
*7f296bb3SBarry Smithprocess 3 binding: 000000010000000000010000
*7f296bb3SBarry Smithprocess 4 binding: 001000000000001000000000
*7f296bb3SBarry Smithprocess 5 binding: 000000001000000000001000
*7f296bb3SBarry SmithTriad:        45403.1949   Rate (MB/s)
*7f296bb3SBarry Smith```
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithIn this configuration, process 0 is bound to the first physical core on
*7f296bb3SBarry Smiththe first socket (with IDs 0 and 12), process 1 is bound to the first
*7f296bb3SBarry Smithcore on the second socket (IDs 6 and 18), and similarly for the
*7f296bb3SBarry Smithremaining processes. The achieved bandwidth of 45 GB/sec is close to the
*7f296bb3SBarry Smithpractical peak of about 50 GB/sec available on the machine. If, however,
*7f296bb3SBarry Smithall MPI processes are located on the same socket, memory bandwidth drops
*7f296bb3SBarry Smithsignificantly:
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith```console
*7f296bb3SBarry Smith$ mpirun -n 6 --bind-to core --map-by core ./stream
*7f296bb3SBarry Smithprocess 0 binding: 100000000000100000000000
*7f296bb3SBarry Smithprocess 1 binding: 010000000000010000000000
*7f296bb3SBarry Smithprocess 2 binding: 001000000000001000000000
*7f296bb3SBarry Smithprocess 3 binding: 000100000000000100000000
*7f296bb3SBarry Smithprocess 4 binding: 000010000000000010000000
*7f296bb3SBarry Smithprocess 5 binding: 000001000000000001000000
*7f296bb3SBarry SmithTriad:        25510.7507   Rate (MB/s)
*7f296bb3SBarry Smith```
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithAll processes are now mapped to cores on the same socket. As a result,
*7f296bb3SBarry Smithonly the first memory channel is fully saturated at 25.5 GB/sec.
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithOne must not assume that `mpirun` uses good defaults. To
*7f296bb3SBarry Smithdemonstrate, compare the full output of `make streams` from {any}`subsec_bandwidth_vs_processes` first, followed by
*7f296bb3SBarry Smiththe results obtained by passing `--bind-to core --map-by socket`:
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith```console
*7f296bb3SBarry Smith$ make streams
*7f296bb3SBarry Smithnp  speedup
*7f296bb3SBarry Smith1 1.0
*7f296bb3SBarry Smith2 1.58
*7f296bb3SBarry Smith3 2.19
*7f296bb3SBarry Smith4 2.42
*7f296bb3SBarry Smith5 2.63
*7f296bb3SBarry Smith6 2.69
*7f296bb3SBarry Smith7 2.31
*7f296bb3SBarry Smith8 2.42
*7f296bb3SBarry Smith9 2.37
*7f296bb3SBarry Smith10 2.65
*7f296bb3SBarry Smith11 2.3
*7f296bb3SBarry Smith12 2.53
*7f296bb3SBarry Smith13 2.43
*7f296bb3SBarry Smith14 2.63
*7f296bb3SBarry Smith15 2.74
*7f296bb3SBarry Smith16 2.7
*7f296bb3SBarry Smith17 3.28
*7f296bb3SBarry Smith18 3.66
*7f296bb3SBarry Smith19 3.95
*7f296bb3SBarry Smith20 3.07
*7f296bb3SBarry Smith21 3.82
*7f296bb3SBarry Smith22 3.49
*7f296bb3SBarry Smith23 3.79
*7f296bb3SBarry Smith24 3.71
*7f296bb3SBarry Smith```
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith```console
*7f296bb3SBarry Smith$ make streams MPI_BINDING="--bind-to core --map-by socket"
*7f296bb3SBarry Smithnp  speedup
*7f296bb3SBarry Smith1 1.0
*7f296bb3SBarry Smith2 1.59
*7f296bb3SBarry Smith3 2.66
*7f296bb3SBarry Smith4 3.5
*7f296bb3SBarry Smith5 3.56
*7f296bb3SBarry Smith6 4.23
*7f296bb3SBarry Smith7 3.95
*7f296bb3SBarry Smith8 4.39
*7f296bb3SBarry Smith9 4.09
*7f296bb3SBarry Smith10 4.46
*7f296bb3SBarry Smith11 4.15
*7f296bb3SBarry Smith12 4.42
*7f296bb3SBarry Smith13 3.71
*7f296bb3SBarry Smith14 3.83
*7f296bb3SBarry Smith15 4.08
*7f296bb3SBarry Smith16 4.22
*7f296bb3SBarry Smith17 4.18
*7f296bb3SBarry Smith18 4.31
*7f296bb3SBarry Smith19 4.22
*7f296bb3SBarry Smith20 4.28
*7f296bb3SBarry Smith21 4.25
*7f296bb3SBarry Smith22 4.23
*7f296bb3SBarry Smith23 4.28
*7f296bb3SBarry Smith24 4.22
*7f296bb3SBarry Smith```
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithFor the non-optimized version on the left, the speedup obtained when
*7f296bb3SBarry Smithusing any number of processes between 3 and 13 is essentially constant
*7f296bb3SBarry Smithup to fluctuations, indicating that all processes were by default
*7f296bb3SBarry Smithexecuted on the same socket. Only with 14 or more processes, the
*7f296bb3SBarry Smithspeedup number increases again. In contrast, the results of
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith`make streams`
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith with proper processor placement shown second
*7f296bb3SBarry Smithresulted in slightly higher overall parallel speedup (identical
*7f296bb3SBarry Smithbaselines), in smaller performance fluctuations, and more than 90
*7f296bb3SBarry Smithpercent of peak bandwidth with only six processes.
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithMachines with job submission systems such as SLURM usually provide
*7f296bb3SBarry Smithsimilar mechanisms for processor placements through options specified in
*7f296bb3SBarry Smithjob submission scripts. Please consult the respective manuals.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith#### Additional Process Placement Considerations and Details
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithFor a typical, memory bandwidth-limited PETSc application, the primary
*7f296bb3SBarry Smithconsideration in placing MPI processes is ensuring that processes are
*7f296bb3SBarry Smithevenly distributed among sockets, and hence using all available memory
*7f296bb3SBarry Smithchannels. Increasingly complex processor designs and cache hierarchies,
*7f296bb3SBarry Smithhowever, mean that performance may also be sensitive to how processes
*7f296bb3SBarry Smithare bound to the resources within each socket. Performance on the two
*7f296bb3SBarry Smithprocessor machine in the preceding example may be relatively insensitive
*7f296bb3SBarry Smithto such placement decisions, because one L3 cache is shared by all cores
*7f296bb3SBarry Smithwithin a NUMA domain, and each core has its own L2 and L1 caches.
*7f296bb3SBarry SmithHowever, processors that are less “flat”, with more complex hierarchies,
*7f296bb3SBarry Smithmay be more sensitive. In many AMD Opterons or the second-generation
*7f296bb3SBarry Smith“Knights Landing” Intel Xeon Phi, for instance, L2 caches are shared
*7f296bb3SBarry Smithbetween two cores. On these processors, placing consecutive MPI ranks on
*7f296bb3SBarry Smithcores that share the same L2 cache may benefit performance if the two
*7f296bb3SBarry Smithranks communicate frequently with each other, because the latency
*7f296bb3SBarry Smithbetween cores sharing an L2 cache may be roughly half that of two cores
*7f296bb3SBarry Smithnot sharing one. There may be benefit, however, in placing consecutive
*7f296bb3SBarry Smithranks on cores that do not share an L2 cache, because (if there are
*7f296bb3SBarry Smithfewer MPI ranks than cores) this increases the total L2 cache capacity
*7f296bb3SBarry Smithand bandwidth available to the application. There is a trade-off to be
*7f296bb3SBarry Smithconsidered between placing processes close together (in terms of shared
*7f296bb3SBarry Smithresources) to optimize for efficient communication and synchronization
*7f296bb3SBarry Smithvs. farther apart to maximize available resources (memory channels,
*7f296bb3SBarry Smithcaches, I/O channels, etc.), and the best strategy will depend on the
*7f296bb3SBarry Smithapplication and the software and hardware stack.
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithDifferent process placement strategies can affect performance at least
*7f296bb3SBarry Smithas much as some commonly explored settings, such as compiler
*7f296bb3SBarry Smithoptimization levels. Unfortunately, exploration of this space is
*7f296bb3SBarry Smithcomplicated by two factors: First, processor and core numberings may be
*7f296bb3SBarry Smithcompletely arbitrary, changing with BIOS version, etc., and second—as
*7f296bb3SBarry Smithalready noted—there is no standard mechanism used by MPI implementations
*7f296bb3SBarry Smith(or job schedulers) to specify process affinity. To overcome the first
*7f296bb3SBarry Smithissue, we recommend using the `lstopo` utility of the Portable
*7f296bb3SBarry SmithHardware Locality (`hwloc`) software package (which can be installed
*7f296bb3SBarry Smithby configuring PETSc with `–download-hwloc`) to understand the
*7f296bb3SBarry Smithprocessor topology of your machine. We cannot fully address the second
*7f296bb3SBarry Smithissue—consult the documentation for your MPI implementation and/or job
*7f296bb3SBarry Smithscheduler—but we offer some general observations on understanding
*7f296bb3SBarry Smithplacement options:
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith- An MPI implementation may support a notion of *domains* in which a
*7f296bb3SBarry Smith  process may be pinned. A domain may simply correspond to a single
*7f296bb3SBarry Smith  core; however, the MPI implementation may allow a deal of flexibility
*7f296bb3SBarry Smith  in specifying domains that encompass multiple cores, span sockets,
*7f296bb3SBarry Smith  etc. Some implementations, such as Intel MPI, provide means to
*7f296bb3SBarry Smith  specify whether domains should be “compact”—composed of cores sharing
*7f296bb3SBarry Smith  resources such as caches—or “scatter”-ed, with little resource
*7f296bb3SBarry Smith  sharing (possibly even spanning sockets).
*7f296bb3SBarry Smith- Separate from the specification of domains, MPI implementations often
*7f296bb3SBarry Smith  support different *orderings* in which MPI ranks should be bound to
*7f296bb3SBarry Smith  these domains. Intel MPI, for instance, supports “compact” ordering
*7f296bb3SBarry Smith  to place consecutive ranks close in terms of shared resources,
*7f296bb3SBarry Smith  “scatter” to place them far apart, and “bunch” to map proportionally
*7f296bb3SBarry Smith  to sockets while placing ranks as close together as possible within
*7f296bb3SBarry Smith  the sockets.
*7f296bb3SBarry Smith- An MPI implementation that supports process pinning should offer some
*7f296bb3SBarry Smith  way to view the rank assignments. Use this output in conjunction with
*7f296bb3SBarry Smith  the topology obtained via `lstopo` or a similar tool to determine
*7f296bb3SBarry Smith  if the placements correspond to something you believe is reasonable
*7f296bb3SBarry Smith  for your application. Do not assume that the MPI implementation is
*7f296bb3SBarry Smith  doing something sensible by default!
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith## Performance Pitfalls and Advice
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithThis section looks into a potpourri of performance pitfalls encountered
*7f296bb3SBarry Smithby users in the past. Many of these pitfalls require a deeper
*7f296bb3SBarry Smithunderstanding of the system and experience to detect. The purpose of
*7f296bb3SBarry Smiththis section is to summarize and share our experience so that these
*7f296bb3SBarry Smithpitfalls can be avoided in the future.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith### Debug vs. Optimized Builds
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithPETSc’s `configure` defaults to building PETSc with debug mode
*7f296bb3SBarry Smithenabled. Any code development should be done in this mode, because it
*7f296bb3SBarry Smithprovides handy debugging facilities such as accurate stack traces,
*7f296bb3SBarry Smithmemory leak checks, and memory corruption checks. Note that PETSc has no
*7f296bb3SBarry Smithreliable way of knowing whether a particular run is a production or
*7f296bb3SBarry Smithdebug run. In the case that a user requests profiling information via
*7f296bb3SBarry Smith`-log_view`, a debug build of PETSc issues the following warning:
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith```none
*7f296bb3SBarry Smith##########################################################
*7f296bb3SBarry Smith#                                                        #
*7f296bb3SBarry Smith#                          WARNING!!!                    #
*7f296bb3SBarry Smith#                                                        #
*7f296bb3SBarry Smith#   This code was compiled with a debugging option,      #
*7f296bb3SBarry Smith#   To get timing results run configure                  #
*7f296bb3SBarry Smith#   using --with-debugging=no, the performance will      #
*7f296bb3SBarry Smith#   be generally two or three times faster.              #
*7f296bb3SBarry Smith#                                                        #
*7f296bb3SBarry Smith##########################################################
*7f296bb3SBarry Smith```
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithConversely, one way of checking whether a particular build of PETSc has
*7f296bb3SBarry Smithdebugging enabled is to inspect the output of `-log_view`.
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithDebug mode will generally be most useful for code development if
*7f296bb3SBarry Smithappropriate compiler options are set to facilitate debugging. The
*7f296bb3SBarry Smithcompiler should be instructed to generate binaries with debug symbols
*7f296bb3SBarry Smith(command line option `-g` for most compilers), and the optimization
*7f296bb3SBarry Smithlevel chosen should either completely disable optimizations (`-O0` for
*7f296bb3SBarry Smithmost compilers) or enable only optimizations that do not interfere with
*7f296bb3SBarry Smithdebugging (GCC, for instance, supports a `-Og` optimization level that
*7f296bb3SBarry Smithdoes this).
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithOnly once the new code is thoroughly tested and ready for production,
*7f296bb3SBarry Smithone should disable debugging facilities by passing
*7f296bb3SBarry Smith`--with-debugging=no` to
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith`configure`. One should also ensure that an appropriate compiler
*7f296bb3SBarry Smithoptimization level is set. Note that some compilers (e.g., Intel)
*7f296bb3SBarry Smithdefault to fairly comprehensive optimization levels, while others (e.g.,
*7f296bb3SBarry SmithGCC) default to no optimization at all. The best optimization flags will
*7f296bb3SBarry Smithdepend on your code, the compiler, and the target architecture, but we
*7f296bb3SBarry Smithoffer a few guidelines for finding those that will offer the best
*7f296bb3SBarry Smithperformance:
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith- Most compilers have a number of optimization levels (with level n
*7f296bb3SBarry Smith  usually specified via `-On`) that provide a quick way to enable
*7f296bb3SBarry Smith  sets of several optimization flags. We suggest trying the higher
*7f296bb3SBarry Smith  optimization levels (the highest level is not guaranteed to produce
*7f296bb3SBarry Smith  the fastest executable, so some experimentation may be merited). With
*7f296bb3SBarry Smith  most recent processors now supporting some form of SIMD or vector
*7f296bb3SBarry Smith  instructions, it is important to choose a level that enables the
*7f296bb3SBarry Smith  compiler’s auto-vectorizer; many compilers do not enable
*7f296bb3SBarry Smith  auto-vectorization at lower optimization levels (e.g., GCC does not
*7f296bb3SBarry Smith  enable it below `-O3` and the Intel compiler does not enable it
*7f296bb3SBarry Smith  below `-O2`).
*7f296bb3SBarry Smith- For processors supporting newer vector instruction sets, such as
*7f296bb3SBarry Smith  Intel AVX2 and AVX-512, it is also important to direct the compiler
*7f296bb3SBarry Smith  to generate code that targets these processors (e.g., `-march=native`);
*7f296bb3SBarry Smith  otherwise, the executables built will not
*7f296bb3SBarry Smith  utilize the newer instructions sets and will not take advantage of
*7f296bb3SBarry Smith  the vector processing units.
*7f296bb3SBarry Smith- Beyond choosing the optimization levels, some value-unsafe
*7f296bb3SBarry Smith  optimizations (such as using reciprocals of values instead of
*7f296bb3SBarry Smith  dividing by those values, or allowing re-association of operands in a
*7f296bb3SBarry Smith  series of calculations) for floating point calculations may yield
*7f296bb3SBarry Smith  significant performance gains. Compilers often provide flags (e.g.,
*7f296bb3SBarry Smith  `-ffast-math` in GCC) to enable a set of these optimizations, and
*7f296bb3SBarry Smith  they may be turned on when using options for very aggressive
*7f296bb3SBarry Smith  optimization (`-fast` or `-Ofast` in many compilers). These are
*7f296bb3SBarry Smith  worth exploring to maximize performance, but, if employed, it
*7f296bb3SBarry Smith  important to verify that these do not cause erroneous results with
*7f296bb3SBarry Smith  your code, since calculations may violate the IEEE standard for
*7f296bb3SBarry Smith  floating-point arithmetic.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith### Profiling
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithUsers should not spend time optimizing a code until after having
*7f296bb3SBarry Smithdetermined where it spends the bulk of its time on realistically sized
*7f296bb3SBarry Smithproblems. As discussed in detail in {any}`ch_profiling`, the
*7f296bb3SBarry SmithPETSc routines automatically log performance data if certain runtime
*7f296bb3SBarry Smithoptions are specified.
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithTo obtain a summary of where and how much time is spent in different
*7f296bb3SBarry Smithsections of the code, use one of the following options:
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith- Run the code with the option `-log_view` to print a performance
*7f296bb3SBarry Smith  summary for various phases of the code.
*7f296bb3SBarry Smith- Run the code with the option `-log_mpe` `[logfilename]`, which
*7f296bb3SBarry Smith  creates a logfile of events suitable for viewing with Jumpshot (part
*7f296bb3SBarry Smith  of MPICH).
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithThen, focus on the sections where most of the time is spent. If you
*7f296bb3SBarry Smithprovided your own callback routines, e.g. for residual evaluations,
*7f296bb3SBarry Smithsearch the profiling output for routines such as `SNESFunctionEval` or
*7f296bb3SBarry Smith`SNESJacobianEval`. If their relative time is significant (say, more
*7f296bb3SBarry Smiththan 30 percent), consider optimizing these routines first. Generic
*7f296bb3SBarry Smithinstructions on how to optimize your callback functions are difficult;
*7f296bb3SBarry Smithyou may start by reading performance optimization guides for your
*7f296bb3SBarry Smithsystem’s hardware.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith### Aggregation
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithPerforming operations on chunks of data rather than a single element at
*7f296bb3SBarry Smitha time can significantly enhance performance because of cache reuse or
*7f296bb3SBarry Smithlower data motion. Typical examples are:
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith- Insert several (many) elements of a matrix or vector at once, rather
*7f296bb3SBarry Smith  than looping and inserting a single value at a time. In order to
*7f296bb3SBarry Smith  access elements in of vector repeatedly, employ `VecGetArray()` to
*7f296bb3SBarry Smith  allow direct manipulation of the vector elements.
*7f296bb3SBarry Smith- When possible, use `VecMDot()` rather than a series of calls to
*7f296bb3SBarry Smith  `VecDot()`.
*7f296bb3SBarry Smith- If you require a sequence of matrix-vector products with the same
*7f296bb3SBarry Smith  matrix, consider packing your vectors into a single matrix and use
*7f296bb3SBarry Smith  matrix-matrix multiplications.
*7f296bb3SBarry Smith- Users should employ a reasonable number of `PetscMalloc()` calls in
*7f296bb3SBarry Smith  their codes. Hundreds or thousands of memory allocations may be
*7f296bb3SBarry Smith  appropriate; however, if tens of thousands are being used, then
*7f296bb3SBarry Smith  reducing the number of `PetscMalloc()` calls may be warranted. For
*7f296bb3SBarry Smith  example, reusing space or allocating large chunks and dividing it
*7f296bb3SBarry Smith  into pieces can produce a significant savings in allocation overhead.
*7f296bb3SBarry Smith  {any}`sec_dsreuse` gives details.
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithAggressive aggregation of data may result in inflexible datastructures
*7f296bb3SBarry Smithand code that is hard to maintain. We advise users to keep these
*7f296bb3SBarry Smithcompeting goals in mind and not blindly optimize for performance only.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith(sec_symbolfactor)=
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith### Memory Allocation for Sparse Matrix Factorization
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithWhen symbolically factoring an AIJ matrix, PETSc has to guess how much
*7f296bb3SBarry Smithfill there will be. Careful use of the fill parameter in the
*7f296bb3SBarry Smith`MatFactorInfo` structure when calling `MatLUFactorSymbolic()` or
*7f296bb3SBarry Smith`MatILUFactorSymbolic()` can reduce greatly the number of mallocs and
*7f296bb3SBarry Smithcopies required, and thus greatly improve the performance of the
*7f296bb3SBarry Smithfactorization. One way to determine a good value for the fill parameter
*7f296bb3SBarry Smithis to run a program with the option `-info`. The symbolic
*7f296bb3SBarry Smithfactorization phase will then print information such as
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith```none
*7f296bb3SBarry SmithInfo:MatILUFactorSymbolic_SeqAIJ:Reallocs 12 Fill ratio:given 1 needed 2.16423
*7f296bb3SBarry Smith```
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithThis indicates that the user should have used a fill estimate factor of
*7f296bb3SBarry Smithabout 2.17 (instead of 1) to prevent the 12 required mallocs and copies.
*7f296bb3SBarry SmithThe command line option
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith```none
*7f296bb3SBarry Smith-pc_factor_fill 2.17
*7f296bb3SBarry Smith```
*7f296bb3SBarry Smith
*7f296bb3SBarry Smithwill cause PETSc to preallocate the correct amount of space for
*7f296bb3SBarry Smiththe factorization.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith(detecting_memory_problems)=
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith### Detecting Memory Allocation Problems and Memory Usage
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithPETSc provides tools to aid in understanding PETSc memory usage and detecting problems with
*7f296bb3SBarry Smithmemory allocation, including leaks and use of uninitialized space. Internally, PETSc uses
*7f296bb3SBarry Smiththe routines `PetscMalloc()` and `PetscFree()` for memory allocation; instead of directly calling `malloc()` and `free()`.
*7f296bb3SBarry SmithThis allows PETSc to track its memory usage and perform error checking. Users are urged to use these routines as well when
*7f296bb3SBarry Smithappropriate.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith- The option `-malloc_debug` turns on PETSc's extensive runtime error checking of memory for corruption.
*7f296bb3SBarry Smith  This checking can be expensive, so should not be used for
*7f296bb3SBarry Smith  production runs. The option `-malloc_test` is equivalent to `-malloc_debug`
*7f296bb3SBarry Smith  but only works when PETSc is configured with `--with-debugging` (the default configuration).
*7f296bb3SBarry Smith  We suggest setting the environmental variable `PETSC_OPTIONS=-malloc_test`
*7f296bb3SBarry Smith  in your shell startup file to automatically enable runtime check memory for developing code but not
*7f296bb3SBarry Smith  running optimized code. Using `-malloc_debug` or `-malloc_test` for large runs can slow them significantly, thus we
*7f296bb3SBarry Smith  recommend turning them off if you code is painfully slow and you don't need the testing. In addition, you can use
*7f296bb3SBarry Smith  `-check_pointer_intensity 0` for long run debug runs that do not need extensive memory corruption testing. This option
*7f296bb3SBarry Smith  is occasionally added to the `PETSC_OPTIONS` environmental variable by some users.
*7f296bb3SBarry Smith- The option
*7f296bb3SBarry Smith  `-malloc_dump` will print a list of memory locations that have not been freed at the
*7f296bb3SBarry Smith  conclusion of a program. If all memory has been freed no message
*7f296bb3SBarry Smith  is printed. Note that
*7f296bb3SBarry Smith  the option `-malloc_dump` activates a call to
*7f296bb3SBarry Smith  `PetscMallocDump()` during `PetscFinalize()`. The user can also
*7f296bb3SBarry Smith  call `PetscMallocDump()` elsewhere in a program.
*7f296bb3SBarry Smith- Another useful option
*7f296bb3SBarry Smith  is `-malloc_view`, which reports memory usage in all routines at the conclusion of the program.
*7f296bb3SBarry Smith  Note that this option
*7f296bb3SBarry Smith  activates logging by calling `PetscMallocViewSet()` in
*7f296bb3SBarry Smith  `PetscInitialize()` and then prints the log by calling
*7f296bb3SBarry Smith  `PetscMallocView()` in `PetscFinalize()`. The user can also call
*7f296bb3SBarry Smith  these routines elsewhere in a program.
*7f296bb3SBarry Smith- When finer granularity is
*7f296bb3SBarry Smith  desired, the user can call `PetscMallocGetCurrentUsage()` and
*7f296bb3SBarry Smith  `PetscMallocGetMaximumUsage()` for memory allocated by PETSc, or
*7f296bb3SBarry Smith  `PetscMemoryGetCurrentUsage()` and `PetscMemoryGetMaximumUsage()`
*7f296bb3SBarry Smith  for the total memory used by the program. Note that
*7f296bb3SBarry Smith  `PetscMemorySetGetMaximumUsage()` must be called before
*7f296bb3SBarry Smith  `PetscMemoryGetMaximumUsage()` (typically at the beginning of the
*7f296bb3SBarry Smith  program).
*7f296bb3SBarry Smith- The option `-memory_view` provides a high-level view of all memory usage,
*7f296bb3SBarry Smith  not just the memory used by `PetscMalloc()`, at the conclusion of the program.
*7f296bb3SBarry Smith- When running with `-log_view`, the additional option `-log_view_memory`
*7f296bb3SBarry Smith  causes the display of additional columns of information about how much
*7f296bb3SBarry Smith  memory was allocated and freed during each logged event. This is useful
*7f296bb3SBarry Smith  to understand what phases of a computation require the most memory.
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithOne can also use [Valgrind](http://valgrind.org) to track memory usage and find bugs, see {any}`FAQ: Valgrind usage<valgrind>`.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith(sec_dsreuse)=
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith### Data Structure Reuse
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithData structures should be reused whenever possible. For example, if a
*7f296bb3SBarry Smithcode often creates new matrices or vectors, there often may be a way to
*7f296bb3SBarry Smithreuse some of them. Very significant performance improvements can be
*7f296bb3SBarry Smithachieved by reusing matrix data structures with the same nonzero
*7f296bb3SBarry Smithpattern. If a code creates thousands of matrix or vector objects,
*7f296bb3SBarry Smithperformance will be degraded. For example, when solving a nonlinear
*7f296bb3SBarry Smithproblem or timestepping, reusing the matrices and their nonzero
*7f296bb3SBarry Smithstructure for many steps when appropriate can make the code run
*7f296bb3SBarry Smithsignificantly faster.
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithA simple technique for saving work vectors, matrices, etc. is employing
*7f296bb3SBarry Smitha user-defined context. In C and C++ such a context is merely a
*7f296bb3SBarry Smithstructure in which various objects can be stashed; in Fortran a user
*7f296bb3SBarry Smithcontext can be an integer array that contains both parameters and
*7f296bb3SBarry Smithpointers to PETSc objects. See
*7f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/snes/tutorials/ex5.c.html">SNES Tutorial ex5</a>
*7f296bb3SBarry Smithand
*7f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/snes/tutorials/ex5f90.F90.html">SNES Tutorial ex5f90</a>
*7f296bb3SBarry Smithfor examples of user-defined application contexts in C and Fortran,
*7f296bb3SBarry Smithrespectively.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith### Numerical Experiments
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithPETSc users should run a variety of tests. For example, there are a
*7f296bb3SBarry Smithlarge number of options for the linear and nonlinear equation solvers in
*7f296bb3SBarry SmithPETSc, and different choices can make a *very* big difference in
*7f296bb3SBarry Smithconvergence rates and execution times. PETSc employs defaults that are
*7f296bb3SBarry Smithgenerally reasonable for a wide range of problems, but clearly these
*7f296bb3SBarry Smithdefaults cannot be best for all cases. Users should experiment with many
*7f296bb3SBarry Smithcombinations to determine what is best for a given problem and customize
*7f296bb3SBarry Smiththe solvers accordingly.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith- Use the options `-snes_view`, `-ksp_view`, etc. (or the routines
*7f296bb3SBarry Smith  `KSPView()`, `SNESView()`, etc.) to view the options that have
*7f296bb3SBarry Smith  been used for a particular solver.
*7f296bb3SBarry Smith- Run the code with the option `-help` for a list of the available
*7f296bb3SBarry Smith  runtime commands.
*7f296bb3SBarry Smith- Use the option `-info` to print details about the solvers’
*7f296bb3SBarry Smith  operation.
*7f296bb3SBarry Smith- Use the PETSc monitoring discussed in {any}`ch_profiling`
*7f296bb3SBarry Smith  to evaluate the performance of various numerical methods.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith(sec_slestips)=
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith### Tips for Efficient Use of Linear Solvers
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithAs discussed in {any}`ch_ksp`, the default linear
*7f296bb3SBarry Smithsolvers are
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith- uniprocess: GMRES(30) with ILU(0) preconditioning
*7f296bb3SBarry Smith- multiprocess: GMRES(30) with block Jacobi preconditioning, where
*7f296bb3SBarry Smith  there is 1 block per process, and each block is solved with ILU(0)
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithOne should experiment to determine alternatives that may be better for
*7f296bb3SBarry Smithvarious applications. Recall that one can specify the `KSP` methods
*7f296bb3SBarry Smithand preconditioners at runtime via the options:
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith```none
*7f296bb3SBarry Smith-ksp_type <ksp_name> -pc_type <pc_name>
*7f296bb3SBarry Smith```
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithOne can also specify a variety of runtime customizations for the
*7f296bb3SBarry Smithsolvers, as discussed throughout the manual.
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithIn particular, note that the default restart parameter for GMRES is 30,
*7f296bb3SBarry Smithwhich may be too small for some large-scale problems. One can alter this
*7f296bb3SBarry Smithparameter with the option `-ksp_gmres_restart <restart>` or by calling
*7f296bb3SBarry Smith`KSPGMRESSetRestart()`. {any}`sec_ksp` gives
*7f296bb3SBarry Smithinformation on setting alternative GMRES orthogonalization routines,
*7f296bb3SBarry Smithwhich may provide much better parallel performance.
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithFor elliptic problems one often obtains good performance and scalability
*7f296bb3SBarry Smithwith multigrid solvers. Consult {any}`sec_amg` for
*7f296bb3SBarry Smithavailable options. Our experience is that GAMG works particularly well
*7f296bb3SBarry Smithfor elasticity problems, whereas hypre does well for scalar problems.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith### System-Related Problems
*7f296bb3SBarry Smith
*7f296bb3SBarry SmithThe performance of a code can be affected by a variety of factors,
*7f296bb3SBarry Smithincluding the cache behavior, other users on the machine, etc. Below we
*7f296bb3SBarry Smithbriefly describe some common problems and possibilities for overcoming
*7f296bb3SBarry Smiththem.
*7f296bb3SBarry Smith
*7f296bb3SBarry Smith- **Problem too large for physical memory size**: When timing a
*7f296bb3SBarry Smith  program, one should always leave at least a ten percent margin
*7f296bb3SBarry Smith  between the total memory a process is using and the physical size of
*7f296bb3SBarry Smith  the machine’s memory. One way to estimate the amount of memory used
*7f296bb3SBarry Smith  by given process is with the Unix `getrusage` system routine.
*7f296bb3SBarry Smith  The PETSc option `-malloc_view` reports all
*7f296bb3SBarry Smith  memory usage, including any Fortran arrays in an application code.
*7f296bb3SBarry Smith- **Effects of other users**: If other users are running jobs on the
*7f296bb3SBarry Smith  same physical processor nodes on which a program is being profiled,
*7f296bb3SBarry Smith  the timing results are essentially meaningless.
*7f296bb3SBarry Smith- **Overhead of timing routines on certain machines**: On certain
*7f296bb3SBarry Smith  machines, even calling the system clock in order to time routines is
*7f296bb3SBarry Smith  slow; this skews all of the flop rates and timing results. The file
*7f296bb3SBarry Smith  `$PETSC_DIR/src/benchmarks/PetscTime.c` (<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/benchmarks/PetscTime.c.html">source</a>)
*7f296bb3SBarry Smith  contains a simple test problem that will approximate the amount of
*7f296bb3SBarry Smith  time required to get the current time in a running program. On good
*7f296bb3SBarry Smith  systems it will on the order of $10^{-6}$ seconds or less.
*7f296bb3SBarry Smith- **Problem too large for good cache performance**: Certain machines
*7f296bb3SBarry Smith  with lower memory bandwidths (slow memory access) attempt to
*7f296bb3SBarry Smith  compensate by having a very large cache. Thus, if a significant
*7f296bb3SBarry Smith  portion of an application fits within the cache, the program will
*7f296bb3SBarry Smith  achieve very good performance; if the code is too large, the
*7f296bb3SBarry Smith  performance can degrade markedly. To analyze whether this situation
*7f296bb3SBarry Smith  affects a particular code, one can try plotting the total flop rate
*7f296bb3SBarry Smith  as a function of problem size. If the flop rate decreases rapidly at
*7f296bb3SBarry Smith  some point, then the problem may likely be too large for the cache
*7f296bb3SBarry Smith  size.
*7f296bb3SBarry Smith- **Inconsistent timings**: Inconsistent timings are likely due to
*7f296bb3SBarry Smith  other users on the machine, thrashing (using more virtual memory than
*7f296bb3SBarry Smith  available physical memory), or paging in of the initial executable.
*7f296bb3SBarry Smith  {any}`sec_profaccuracy` provides information on
*7f296bb3SBarry Smith  overcoming paging overhead when profiling a code. We have found on
*7f296bb3SBarry Smith  all systems that if you follow all the advise above your timings will
*7f296bb3SBarry Smith  be consistent within a variation of less than five percent.