doc/manual/performance.md

7f296bb3SBarry Smith(ch_performance)=
7f296bb3SBarry Smith
7f296bb3SBarry Smith# Hints for Performance Tuning
7f296bb3SBarry Smith
7f296bb3SBarry SmithThis chapter provides hints on how to get to achieve best performance
7f296bb3SBarry Smithwith PETSc, particularly on distributed-memory machines with multiple
7f296bb3SBarry SmithCPU sockets per node. We focus on machine-related performance
7f296bb3SBarry Smithoptimization here; algorithmic aspects like preconditioner selection are
7f296bb3SBarry Smithnot the focus of this section.
7f296bb3SBarry Smith
7f296bb3SBarry Smith## Maximizing Memory Bandwidth
7f296bb3SBarry Smith
7f296bb3SBarry SmithMost operations in PETSc deal with large datasets (typically vectors and
7f296bb3SBarry Smithsparse matrices) and perform relatively few arithmetic operations for
7f296bb3SBarry Smitheach byte loaded or stored from global memory. Therefore, the
7f296bb3SBarry Smith*arithmetic intensity* expressed as the ratio of floating point
7f296bb3SBarry Smithoperations to the number of bytes loaded and stored is usually well
7f296bb3SBarry Smithbelow unity for typical PETSc operations. On the other hand, modern CPUs
7f296bb3SBarry Smithare able to execute on the order of 10 floating point operations for
7f296bb3SBarry Smitheach byte loaded or stored. As a consequence, almost all PETSc
7f296bb3SBarry Smithoperations are limited by the rate at which data can be loaded or stored
7f296bb3SBarry Smith(*memory bandwidth limited*) rather than by the rate of floating point
7f296bb3SBarry Smithoperations.
7f296bb3SBarry Smith
7f296bb3SBarry SmithThis section discusses ways to maximize the memory bandwidth achieved by
7f296bb3SBarry Smithapplications based on PETSc. Where appropriate, we include benchmark
7f296bb3SBarry Smithresults in order to provide quantitative results on typical performance
7f296bb3SBarry Smithgains one can achieve through parallelization, both on a single compute
7f296bb3SBarry Smithnode and across nodes. In particular, we start with the answer to the
7f296bb3SBarry Smithcommon question of why performance generally does not increase 20-fold
7f296bb3SBarry Smithwith a 20-core CPU.
7f296bb3SBarry Smith
7f296bb3SBarry Smith(subsec_bandwidth_vs_processes)=
7f296bb3SBarry Smith
7f296bb3SBarry Smith### Memory Bandwidth vs. Processes
7f296bb3SBarry Smith
7f296bb3SBarry SmithConsider the addition of two large vectors, with the result written to a
7f296bb3SBarry Smiththird vector. Because there are no dependencies across the different
7f296bb3SBarry Smithentries of each vector, the operation is embarrassingly parallel.
7f296bb3SBarry Smith
7f296bb3SBarry Smith:::{figure} /images/manual/stream-results-intel.*
7f296bb3SBarry Smith:alt: Memory bandwidth obtained on Intel hardware (dual socket except KNL) over the
7f296bb3SBarry Smith:  number of processes used. One can get close to peak memory bandwidth with only a
7f296bb3SBarry Smith:  few processes.
7f296bb3SBarry Smith:name: fig_stream_intel
7f296bb3SBarry Smith:width: 80.0%
7f296bb3SBarry Smith
7f296bb3SBarry SmithMemory bandwidth obtained on Intel hardware (dual socket except KNL)
7f296bb3SBarry Smithover the number of processes used. One can get close to peak memory
7f296bb3SBarry Smithbandwidth with only a few processes.
7f296bb3SBarry Smith:::
7f296bb3SBarry Smith
7f296bb3SBarry SmithAs {numref}`fig_stream_intel` shows, the performance gains due to
7f296bb3SBarry Smithparallelization on different multi- and many-core CPUs quickly
7f296bb3SBarry Smithsaturates. The reason is that only a fraction of the total number of CPU
7f296bb3SBarry Smithcores is required to saturate the memory channels. For example, a
7f296bb3SBarry Smithdual-socket system equipped with Haswell 12-core Xeon CPUs achieves more
7f296bb3SBarry Smiththan 80 percent of achievable peak memory bandwidth with only four
7f296bb3SBarry Smithprocesses per socket (8 total), cf. {numref}`fig_stream_intel`.
7f296bb3SBarry SmithConsequently, running with more than 8 MPI ranks on such a system will
7f296bb3SBarry Smithnot increase performance substantially. For the same reason, PETSc-based
7f296bb3SBarry Smithapplications usually do not benefit from hyper-threading.
7f296bb3SBarry Smith
7f296bb3SBarry SmithPETSc provides a simple way to measure memory bandwidth for different
7f296bb3SBarry Smithnumbers of processes via the target `make streams` executed from
7f296bb3SBarry Smith`$PETSC_DIR`. The output provides an overview of the possible speedup
7f296bb3SBarry Smithone can obtain on the given machine (not necessarily a shared memory
7f296bb3SBarry Smithsystem). For example, the following is the most relevant output obtained
7f296bb3SBarry Smithon a dual-socket system equipped with two six-core-CPUs with
7f296bb3SBarry Smithhyperthreading:
7f296bb3SBarry Smith
7f296bb3SBarry Smith```none
7f296bb3SBarry Smithnp  speedup
7f296bb3SBarry Smith1 1.0
7f296bb3SBarry Smith2 1.58
7f296bb3SBarry Smith3 2.19
7f296bb3SBarry Smith4 2.42
7f296bb3SBarry Smith5 2.63
7f296bb3SBarry Smith6 2.69
7f296bb3SBarry Smith...
7f296bb3SBarry Smith21 3.82
7f296bb3SBarry Smith22 3.49
7f296bb3SBarry Smith23 3.79
7f296bb3SBarry Smith24 3.71
7f296bb3SBarry SmithEstimation of possible speedup of MPI programs based on Streams benchmark.
7f296bb3SBarry SmithIt appears you have 1 node(s)
7f296bb3SBarry Smith```
7f296bb3SBarry Smith
7f296bb3SBarry SmithOn this machine, one should expect a speed-up of typical memory
7f296bb3SBarry Smithbandwidth-bound PETSc applications of at most 4x when running multiple
7f296bb3SBarry SmithMPI ranks on the node. Most of the gains are already obtained when
7f296bb3SBarry Smithrunning with only 4-6 ranks. Because a smaller number of MPI ranks
7f296bb3SBarry Smithusually implies better preconditioners and better performance for
7f296bb3SBarry Smithsmaller problems, the best performance for PETSc applications may be
7f296bb3SBarry Smithobtained with fewer ranks than there are physical CPU cores available.
7f296bb3SBarry Smith
7f296bb3SBarry SmithFollowing the results from the above run of `make streams`, we
7f296bb3SBarry Smithrecommend to use additional nodes instead of placing additional MPI
7f296bb3SBarry Smithranks on the nodes. In particular, weak scaling (i.e. constant load per
7f296bb3SBarry Smithprocess, increasing the number of processes) and strong scaling
7f296bb3SBarry Smith(i.e. constant total work, increasing the number of processes) studies
7f296bb3SBarry Smithshould keep the number of processes per node constant.
7f296bb3SBarry Smith
7f296bb3SBarry Smith### Non-Uniform Memory Access (NUMA) and Process Placement
7f296bb3SBarry Smith
7f296bb3SBarry SmithCPUs in nodes with more than one CPU socket are internally connected via
7f296bb3SBarry Smitha high-speed fabric, cf. {numref}`fig_numa`, to enable data
7f296bb3SBarry Smithexchange as well as cache coherency. Because main memory on modern
7f296bb3SBarry Smithsystems is connected via the integrated memory controllers on each CPU,
7f296bb3SBarry Smithmemory is accessed in a non-uniform way: A process running on one socket
7f296bb3SBarry Smithhas direct access to the memory channels of the respective CPU, whereas
7f296bb3SBarry Smithrequests for memory attached to a different CPU socket need to go
7f296bb3SBarry Smiththrough the high-speed fabric. Consequently, best aggregate memory
7f296bb3SBarry Smithbandwidth on the node is obtained when the memory controllers on each
7f296bb3SBarry SmithCPU are fully saturated. However, full saturation of memory channels is
7f296bb3SBarry Smithonly possible if the data is distributed across the different memory
7f296bb3SBarry Smithchannels.
7f296bb3SBarry Smith
7f296bb3SBarry Smith:::{figure} /images/manual/numa.*
7f296bb3SBarry Smith:alt: Schematic of a two-socket NUMA system. Processes should be spread across both
7f296bb3SBarry Smith:  CPUs to obtain full bandwidth.
7f296bb3SBarry Smith:name: fig_numa
7f296bb3SBarry Smith:width: 90.0%
7f296bb3SBarry Smith
7f296bb3SBarry SmithSchematic of a two-socket NUMA system. Processes should be spread
7f296bb3SBarry Smithacross both CPUs to obtain full bandwidth.
7f296bb3SBarry Smith:::
7f296bb3SBarry Smith
7f296bb3SBarry SmithData in memory on modern machines is allocated by the operating system
7f296bb3SBarry Smithbased on a first-touch policy. That is, memory is not allocated at the
7f296bb3SBarry Smithpoint of issuing `malloc()`, but at the point when the respective
7f296bb3SBarry Smithmemory segment is actually touched (read or write). Upon first-touch,
7f296bb3SBarry Smithmemory is allocated on the memory channel associated with the respective
7f296bb3SBarry SmithCPU the process is running on. Only if all memory on the respective CPU
7f296bb3SBarry Smithis already in use (either allocated or as IO cache), memory available
7f296bb3SBarry Smiththrough other sockets is considered.
7f296bb3SBarry Smith
7f296bb3SBarry SmithMaximum memory bandwidth can be achieved by ensuring that processes are
7f296bb3SBarry Smithspread over all sockets in the respective node. For example, the
7f296bb3SBarry Smithrecommended placement of a 8-way parallel run on a four-socket machine
7f296bb3SBarry Smithis to assign two processes to each CPU socket. To do so, one needs to
7f296bb3SBarry Smithknow the enumeration of cores and pass the requested information to
*a8cf87e0SJunchao Zhang`mpiexec`. Consider the hardware topology information returned by
7f296bb3SBarry Smith`lstopo` (part of the hwloc package) for the following two-socket
7f296bb3SBarry Smithmachine, in which each CPU consists of six cores and supports
7f296bb3SBarry Smithhyperthreading:
7f296bb3SBarry Smith
7f296bb3SBarry Smith```none
7f296bb3SBarry SmithMachine (126GB total)
7f296bb3SBarry Smith  NUMANode L#0 (P#0 63GB)
7f296bb3SBarry Smith    Package L#0 + L3 L#0 (15MB)
7f296bb3SBarry Smith      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
7f296bb3SBarry Smith        PU L#0 (P#0)
7f296bb3SBarry Smith        PU L#1 (P#12)
7f296bb3SBarry Smith      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
7f296bb3SBarry Smith        PU L#2 (P#1)
7f296bb3SBarry Smith        PU L#3 (P#13)
7f296bb3SBarry Smith      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
7f296bb3SBarry Smith        PU L#4 (P#2)
7f296bb3SBarry Smith        PU L#5 (P#14)
7f296bb3SBarry Smith      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
7f296bb3SBarry Smith        PU L#6 (P#3)
7f296bb3SBarry Smith        PU L#7 (P#15)
7f296bb3SBarry Smith      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
7f296bb3SBarry Smith        PU L#8 (P#4)
7f296bb3SBarry Smith        PU L#9 (P#16)
7f296bb3SBarry Smith      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
7f296bb3SBarry Smith        PU L#10 (P#5)
7f296bb3SBarry Smith        PU L#11 (P#17)
7f296bb3SBarry Smith  NUMANode L#1 (P#1 63GB)
7f296bb3SBarry Smith    Package L#1 + L3 L#1 (15MB)
7f296bb3SBarry Smith      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
7f296bb3SBarry Smith        PU L#12 (P#6)
7f296bb3SBarry Smith        PU L#13 (P#18)
7f296bb3SBarry Smith      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
7f296bb3SBarry Smith        PU L#14 (P#7)
7f296bb3SBarry Smith        PU L#15 (P#19)
7f296bb3SBarry Smith      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
7f296bb3SBarry Smith        PU L#16 (P#8)
7f296bb3SBarry Smith        PU L#17 (P#20)
7f296bb3SBarry Smith      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
7f296bb3SBarry Smith        PU L#18 (P#9)
7f296bb3SBarry Smith        PU L#19 (P#21)
7f296bb3SBarry Smith      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
7f296bb3SBarry Smith        PU L#20 (P#10)
7f296bb3SBarry Smith        PU L#21 (P#22)
7f296bb3SBarry Smith      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
7f296bb3SBarry Smith        PU L#22 (P#11)
7f296bb3SBarry Smith        PU L#23 (P#23)
7f296bb3SBarry Smith```
7f296bb3SBarry Smith
7f296bb3SBarry SmithThe relevant physical processor IDs are shown in parentheses prefixed by
7f296bb3SBarry Smith`P#`. Here, IDs 0 and 12 share the same physical core and have a
7f296bb3SBarry Smithcommon L2 cache. IDs 0, 12, 1, 13, 2, 14, 3, 15, 4, 16, 5, 17 share the
7f296bb3SBarry Smithsame socket and have a common L3 cache.
7f296bb3SBarry Smith
7f296bb3SBarry SmithA good placement for a run with six processes is to locate three
7f296bb3SBarry Smithprocesses on the first socket and three processes on the second socket.
7f296bb3SBarry SmithUnfortunately, mechanisms for process placement vary across MPI
7f296bb3SBarry Smithimplementations, so make sure to consult the manual of your MPI
7f296bb3SBarry Smithimplementation. The following discussion is based on how processor
7f296bb3SBarry Smithplacement is done with MPICH and Open MPI, where one needs to pass
*a8cf87e0SJunchao Zhang`--bind-to core --map-by socket` to `mpiexec`:
7f296bb3SBarry Smith
7f296bb3SBarry Smith```console
*a8cf87e0SJunchao Zhang$ mpiexec -n 6 --bind-to core --map-by socket ./stream
7f296bb3SBarry Smithprocess 0 binding: 100000000000100000000000
7f296bb3SBarry Smithprocess 1 binding: 000000100000000000100000
7f296bb3SBarry Smithprocess 2 binding: 010000000000010000000000
7f296bb3SBarry Smithprocess 3 binding: 000000010000000000010000
7f296bb3SBarry Smithprocess 4 binding: 001000000000001000000000
7f296bb3SBarry Smithprocess 5 binding: 000000001000000000001000
7f296bb3SBarry SmithTriad:        45403.1949   Rate (MB/s)
7f296bb3SBarry Smith```
7f296bb3SBarry Smith
7f296bb3SBarry SmithIn this configuration, process 0 is bound to the first physical core on
7f296bb3SBarry Smiththe first socket (with IDs 0 and 12), process 1 is bound to the first
7f296bb3SBarry Smithcore on the second socket (IDs 6 and 18), and similarly for the
7f296bb3SBarry Smithremaining processes. The achieved bandwidth of 45 GB/sec is close to the
7f296bb3SBarry Smithpractical peak of about 50 GB/sec available on the machine. If, however,
7f296bb3SBarry Smithall MPI processes are located on the same socket, memory bandwidth drops
7f296bb3SBarry Smithsignificantly:
7f296bb3SBarry Smith
7f296bb3SBarry Smith```console
*a8cf87e0SJunchao Zhang$ mpiexec -n 6 --bind-to core --map-by core ./stream
7f296bb3SBarry Smithprocess 0 binding: 100000000000100000000000
7f296bb3SBarry Smithprocess 1 binding: 010000000000010000000000
7f296bb3SBarry Smithprocess 2 binding: 001000000000001000000000
7f296bb3SBarry Smithprocess 3 binding: 000100000000000100000000
7f296bb3SBarry Smithprocess 4 binding: 000010000000000010000000
7f296bb3SBarry Smithprocess 5 binding: 000001000000000001000000
7f296bb3SBarry SmithTriad:        25510.7507   Rate (MB/s)
7f296bb3SBarry Smith```
7f296bb3SBarry Smith
7f296bb3SBarry SmithAll processes are now mapped to cores on the same socket. As a result,
7f296bb3SBarry Smithonly the first memory channel is fully saturated at 25.5 GB/sec.
7f296bb3SBarry Smith
*a8cf87e0SJunchao ZhangOne must not assume that `mpiexec` uses good defaults. To
7f296bb3SBarry Smithdemonstrate, compare the full output of `make streams` from {any}`subsec_bandwidth_vs_processes` first, followed by
7f296bb3SBarry Smiththe results obtained by passing `--bind-to core --map-by socket`:
7f296bb3SBarry Smith
7f296bb3SBarry Smith```console
7f296bb3SBarry Smith$ make streams
7f296bb3SBarry Smithnp  speedup
7f296bb3SBarry Smith1 1.0
7f296bb3SBarry Smith2 1.58
7f296bb3SBarry Smith3 2.19
7f296bb3SBarry Smith4 2.42
7f296bb3SBarry Smith5 2.63
7f296bb3SBarry Smith6 2.69
7f296bb3SBarry Smith7 2.31
7f296bb3SBarry Smith8 2.42
7f296bb3SBarry Smith9 2.37
7f296bb3SBarry Smith10 2.65
7f296bb3SBarry Smith11 2.3
7f296bb3SBarry Smith12 2.53
7f296bb3SBarry Smith13 2.43
7f296bb3SBarry Smith14 2.63
7f296bb3SBarry Smith15 2.74
7f296bb3SBarry Smith16 2.7
7f296bb3SBarry Smith17 3.28
7f296bb3SBarry Smith18 3.66
7f296bb3SBarry Smith19 3.95
7f296bb3SBarry Smith20 3.07
7f296bb3SBarry Smith21 3.82
7f296bb3SBarry Smith22 3.49
7f296bb3SBarry Smith23 3.79
7f296bb3SBarry Smith24 3.71
7f296bb3SBarry Smith```
7f296bb3SBarry Smith
7f296bb3SBarry Smith```console
7f296bb3SBarry Smith$ make streams MPI_BINDING="--bind-to core --map-by socket"
7f296bb3SBarry Smithnp  speedup
7f296bb3SBarry Smith1 1.0
7f296bb3SBarry Smith2 1.59
7f296bb3SBarry Smith3 2.66
7f296bb3SBarry Smith4 3.5
7f296bb3SBarry Smith5 3.56
7f296bb3SBarry Smith6 4.23
7f296bb3SBarry Smith7 3.95
7f296bb3SBarry Smith8 4.39
7f296bb3SBarry Smith9 4.09
7f296bb3SBarry Smith10 4.46
7f296bb3SBarry Smith11 4.15
7f296bb3SBarry Smith12 4.42
7f296bb3SBarry Smith13 3.71
7f296bb3SBarry Smith14 3.83
7f296bb3SBarry Smith15 4.08
7f296bb3SBarry Smith16 4.22
7f296bb3SBarry Smith17 4.18
7f296bb3SBarry Smith18 4.31
7f296bb3SBarry Smith19 4.22
7f296bb3SBarry Smith20 4.28
7f296bb3SBarry Smith21 4.25
7f296bb3SBarry Smith22 4.23
7f296bb3SBarry Smith23 4.28
7f296bb3SBarry Smith24 4.22
7f296bb3SBarry Smith```
7f296bb3SBarry Smith
7f296bb3SBarry SmithFor the non-optimized version on the left, the speedup obtained when
7f296bb3SBarry Smithusing any number of processes between 3 and 13 is essentially constant
7f296bb3SBarry Smithup to fluctuations, indicating that all processes were by default
7f296bb3SBarry Smithexecuted on the same socket. Only with 14 or more processes, the
7f296bb3SBarry Smithspeedup number increases again. In contrast, the results of
7f296bb3SBarry Smith
7f296bb3SBarry Smith
7f296bb3SBarry Smith`make streams`
7f296bb3SBarry Smith
7f296bb3SBarry Smith with proper processor placement shown second
7f296bb3SBarry Smithresulted in slightly higher overall parallel speedup (identical
7f296bb3SBarry Smithbaselines), in smaller performance fluctuations, and more than 90
7f296bb3SBarry Smithpercent of peak bandwidth with only six processes.
7f296bb3SBarry Smith
7f296bb3SBarry SmithMachines with job submission systems such as SLURM usually provide
7f296bb3SBarry Smithsimilar mechanisms for processor placements through options specified in
7f296bb3SBarry Smithjob submission scripts. Please consult the respective manuals.
7f296bb3SBarry Smith
7f296bb3SBarry Smith#### Additional Process Placement Considerations and Details
7f296bb3SBarry Smith
7f296bb3SBarry SmithFor a typical, memory bandwidth-limited PETSc application, the primary
7f296bb3SBarry Smithconsideration in placing MPI processes is ensuring that processes are
7f296bb3SBarry Smithevenly distributed among sockets, and hence using all available memory
7f296bb3SBarry Smithchannels. Increasingly complex processor designs and cache hierarchies,
7f296bb3SBarry Smithhowever, mean that performance may also be sensitive to how processes
7f296bb3SBarry Smithare bound to the resources within each socket. Performance on the two
7f296bb3SBarry Smithprocessor machine in the preceding example may be relatively insensitive
7f296bb3SBarry Smithto such placement decisions, because one L3 cache is shared by all cores
7f296bb3SBarry Smithwithin a NUMA domain, and each core has its own L2 and L1 caches.
7f296bb3SBarry SmithHowever, processors that are less “flat”, with more complex hierarchies,
7f296bb3SBarry Smithmay be more sensitive. In many AMD Opterons or the second-generation
7f296bb3SBarry Smith“Knights Landing” Intel Xeon Phi, for instance, L2 caches are shared
7f296bb3SBarry Smithbetween two cores. On these processors, placing consecutive MPI ranks on
7f296bb3SBarry Smithcores that share the same L2 cache may benefit performance if the two
7f296bb3SBarry Smithranks communicate frequently with each other, because the latency
7f296bb3SBarry Smithbetween cores sharing an L2 cache may be roughly half that of two cores
7f296bb3SBarry Smithnot sharing one. There may be benefit, however, in placing consecutive
7f296bb3SBarry Smithranks on cores that do not share an L2 cache, because (if there are
7f296bb3SBarry Smithfewer MPI ranks than cores) this increases the total L2 cache capacity
7f296bb3SBarry Smithand bandwidth available to the application. There is a trade-off to be
7f296bb3SBarry Smithconsidered between placing processes close together (in terms of shared
7f296bb3SBarry Smithresources) to optimize for efficient communication and synchronization
7f296bb3SBarry Smithvs. farther apart to maximize available resources (memory channels,
7f296bb3SBarry Smithcaches, I/O channels, etc.), and the best strategy will depend on the
7f296bb3SBarry Smithapplication and the software and hardware stack.
7f296bb3SBarry Smith
7f296bb3SBarry SmithDifferent process placement strategies can affect performance at least
7f296bb3SBarry Smithas much as some commonly explored settings, such as compiler
7f296bb3SBarry Smithoptimization levels. Unfortunately, exploration of this space is
7f296bb3SBarry Smithcomplicated by two factors: First, processor and core numberings may be
7f296bb3SBarry Smithcompletely arbitrary, changing with BIOS version, etc., and second—as
7f296bb3SBarry Smithalready noted—there is no standard mechanism used by MPI implementations
7f296bb3SBarry Smith(or job schedulers) to specify process affinity. To overcome the first
7f296bb3SBarry Smithissue, we recommend using the `lstopo` utility of the Portable
7f296bb3SBarry SmithHardware Locality (`hwloc`) software package (which can be installed
7f296bb3SBarry Smithby configuring PETSc with `–download-hwloc`) to understand the
7f296bb3SBarry Smithprocessor topology of your machine. We cannot fully address the second
7f296bb3SBarry Smithissue—consult the documentation for your MPI implementation and/or job
7f296bb3SBarry Smithscheduler—but we offer some general observations on understanding
7f296bb3SBarry Smithplacement options:
7f296bb3SBarry Smith
7f296bb3SBarry Smith- An MPI implementation may support a notion of *domains* in which a
7f296bb3SBarry Smith  process may be pinned. A domain may simply correspond to a single
7f296bb3SBarry Smith  core; however, the MPI implementation may allow a deal of flexibility
7f296bb3SBarry Smith  in specifying domains that encompass multiple cores, span sockets,
7f296bb3SBarry Smith  etc. Some implementations, such as Intel MPI, provide means to
7f296bb3SBarry Smith  specify whether domains should be “compact”—composed of cores sharing
7f296bb3SBarry Smith  resources such as caches—or “scatter”-ed, with little resource
7f296bb3SBarry Smith  sharing (possibly even spanning sockets).
7f296bb3SBarry Smith- Separate from the specification of domains, MPI implementations often
7f296bb3SBarry Smith  support different *orderings* in which MPI ranks should be bound to
7f296bb3SBarry Smith  these domains. Intel MPI, for instance, supports “compact” ordering
7f296bb3SBarry Smith  to place consecutive ranks close in terms of shared resources,
7f296bb3SBarry Smith  “scatter” to place them far apart, and “bunch” to map proportionally
7f296bb3SBarry Smith  to sockets while placing ranks as close together as possible within
7f296bb3SBarry Smith  the sockets.
7f296bb3SBarry Smith- An MPI implementation that supports process pinning should offer some
7f296bb3SBarry Smith  way to view the rank assignments. Use this output in conjunction with
7f296bb3SBarry Smith  the topology obtained via `lstopo` or a similar tool to determine
7f296bb3SBarry Smith  if the placements correspond to something you believe is reasonable
7f296bb3SBarry Smith  for your application. Do not assume that the MPI implementation is
7f296bb3SBarry Smith  doing something sensible by default!
7f296bb3SBarry Smith
7f296bb3SBarry Smith## Performance Pitfalls and Advice
7f296bb3SBarry Smith
7f296bb3SBarry SmithThis section looks into a potpourri of performance pitfalls encountered
7f296bb3SBarry Smithby users in the past. Many of these pitfalls require a deeper
7f296bb3SBarry Smithunderstanding of the system and experience to detect. The purpose of
7f296bb3SBarry Smiththis section is to summarize and share our experience so that these
7f296bb3SBarry Smithpitfalls can be avoided in the future.
7f296bb3SBarry Smith
7f296bb3SBarry Smith### Debug vs. Optimized Builds
7f296bb3SBarry Smith
7f296bb3SBarry SmithPETSc’s `configure` defaults to building PETSc with debug mode
7f296bb3SBarry Smithenabled. Any code development should be done in this mode, because it
7f296bb3SBarry Smithprovides handy debugging facilities such as accurate stack traces,
7f296bb3SBarry Smithmemory leak checks, and memory corruption checks. Note that PETSc has no
7f296bb3SBarry Smithreliable way of knowing whether a particular run is a production or
7f296bb3SBarry Smithdebug run. In the case that a user requests profiling information via
7f296bb3SBarry Smith`-log_view`, a debug build of PETSc issues the following warning:
7f296bb3SBarry Smith
7f296bb3SBarry Smith```none
7f296bb3SBarry Smith##########################################################
7f296bb3SBarry Smith#                                                        #
7f296bb3SBarry Smith#                          WARNING!!!                    #
7f296bb3SBarry Smith#                                                        #
7f296bb3SBarry Smith#   This code was compiled with a debugging option,      #
7f296bb3SBarry Smith#   To get timing results run configure                  #
7f296bb3SBarry Smith#   using --with-debugging=no, the performance will      #
7f296bb3SBarry Smith#   be generally two or three times faster.              #
7f296bb3SBarry Smith#                                                        #
7f296bb3SBarry Smith##########################################################
7f296bb3SBarry Smith```
7f296bb3SBarry Smith
7f296bb3SBarry SmithConversely, one way of checking whether a particular build of PETSc has
7f296bb3SBarry Smithdebugging enabled is to inspect the output of `-log_view`.
7f296bb3SBarry Smith
7f296bb3SBarry SmithDebug mode will generally be most useful for code development if
7f296bb3SBarry Smithappropriate compiler options are set to facilitate debugging. The
7f296bb3SBarry Smithcompiler should be instructed to generate binaries with debug symbols
7f296bb3SBarry Smith(command line option `-g` for most compilers), and the optimization
7f296bb3SBarry Smithlevel chosen should either completely disable optimizations (`-O0` for
7f296bb3SBarry Smithmost compilers) or enable only optimizations that do not interfere with
7f296bb3SBarry Smithdebugging (GCC, for instance, supports a `-Og` optimization level that
7f296bb3SBarry Smithdoes this).
7f296bb3SBarry Smith
7f296bb3SBarry SmithOnly once the new code is thoroughly tested and ready for production,
7f296bb3SBarry Smithone should disable debugging facilities by passing
7f296bb3SBarry Smith`--with-debugging=no` to
7f296bb3SBarry Smith
7f296bb3SBarry Smith`configure`. One should also ensure that an appropriate compiler
7f296bb3SBarry Smithoptimization level is set. Note that some compilers (e.g., Intel)
7f296bb3SBarry Smithdefault to fairly comprehensive optimization levels, while others (e.g.,
7f296bb3SBarry SmithGCC) default to no optimization at all. The best optimization flags will
7f296bb3SBarry Smithdepend on your code, the compiler, and the target architecture, but we
7f296bb3SBarry Smithoffer a few guidelines for finding those that will offer the best
7f296bb3SBarry Smithperformance:
7f296bb3SBarry Smith
7f296bb3SBarry Smith- Most compilers have a number of optimization levels (with level n
7f296bb3SBarry Smith  usually specified via `-On`) that provide a quick way to enable
7f296bb3SBarry Smith  sets of several optimization flags. We suggest trying the higher
7f296bb3SBarry Smith  optimization levels (the highest level is not guaranteed to produce
7f296bb3SBarry Smith  the fastest executable, so some experimentation may be merited). With
7f296bb3SBarry Smith  most recent processors now supporting some form of SIMD or vector
7f296bb3SBarry Smith  instructions, it is important to choose a level that enables the
7f296bb3SBarry Smith  compiler’s auto-vectorizer; many compilers do not enable
7f296bb3SBarry Smith  auto-vectorization at lower optimization levels (e.g., GCC does not
7f296bb3SBarry Smith  enable it below `-O3` and the Intel compiler does not enable it
7f296bb3SBarry Smith  below `-O2`).
7f296bb3SBarry Smith- For processors supporting newer vector instruction sets, such as
7f296bb3SBarry Smith  Intel AVX2 and AVX-512, it is also important to direct the compiler
7f296bb3SBarry Smith  to generate code that targets these processors (e.g., `-march=native`);
7f296bb3SBarry Smith  otherwise, the executables built will not
7f296bb3SBarry Smith  utilize the newer instructions sets and will not take advantage of
7f296bb3SBarry Smith  the vector processing units.
7f296bb3SBarry Smith- Beyond choosing the optimization levels, some value-unsafe
7f296bb3SBarry Smith  optimizations (such as using reciprocals of values instead of
7f296bb3SBarry Smith  dividing by those values, or allowing re-association of operands in a
7f296bb3SBarry Smith  series of calculations) for floating point calculations may yield
7f296bb3SBarry Smith  significant performance gains. Compilers often provide flags (e.g.,
7f296bb3SBarry Smith  `-ffast-math` in GCC) to enable a set of these optimizations, and
7f296bb3SBarry Smith  they may be turned on when using options for very aggressive
7f296bb3SBarry Smith  optimization (`-fast` or `-Ofast` in many compilers). These are
7f296bb3SBarry Smith  worth exploring to maximize performance, but, if employed, it
7f296bb3SBarry Smith  important to verify that these do not cause erroneous results with
7f296bb3SBarry Smith  your code, since calculations may violate the IEEE standard for
7f296bb3SBarry Smith  floating-point arithmetic.
7f296bb3SBarry Smith
7f296bb3SBarry Smith### Profiling
7f296bb3SBarry Smith
7f296bb3SBarry SmithUsers should not spend time optimizing a code until after having
7f296bb3SBarry Smithdetermined where it spends the bulk of its time on realistically sized
7f296bb3SBarry Smithproblems. As discussed in detail in {any}`ch_profiling`, the
7f296bb3SBarry SmithPETSc routines automatically log performance data if certain runtime
7f296bb3SBarry Smithoptions are specified.
7f296bb3SBarry Smith
7f296bb3SBarry SmithTo obtain a summary of where and how much time is spent in different
7f296bb3SBarry Smithsections of the code, use one of the following options:
7f296bb3SBarry Smith
7f296bb3SBarry Smith- Run the code with the option `-log_view` to print a performance
7f296bb3SBarry Smith  summary for various phases of the code.
7f296bb3SBarry Smith- Run the code with the option `-log_mpe` `[logfilename]`, which
7f296bb3SBarry Smith  creates a logfile of events suitable for viewing with Jumpshot (part
7f296bb3SBarry Smith  of MPICH).
7f296bb3SBarry Smith
7f296bb3SBarry SmithThen, focus on the sections where most of the time is spent. If you
7f296bb3SBarry Smithprovided your own callback routines, e.g. for residual evaluations,
7f296bb3SBarry Smithsearch the profiling output for routines such as `SNESFunctionEval` or
7f296bb3SBarry Smith`SNESJacobianEval`. If their relative time is significant (say, more
7f296bb3SBarry Smiththan 30 percent), consider optimizing these routines first. Generic
7f296bb3SBarry Smithinstructions on how to optimize your callback functions are difficult;
7f296bb3SBarry Smithyou may start by reading performance optimization guides for your
7f296bb3SBarry Smithsystem’s hardware.
7f296bb3SBarry Smith
7f296bb3SBarry Smith### Aggregation
7f296bb3SBarry Smith
7f296bb3SBarry SmithPerforming operations on chunks of data rather than a single element at
7f296bb3SBarry Smitha time can significantly enhance performance because of cache reuse or
7f296bb3SBarry Smithlower data motion. Typical examples are:
7f296bb3SBarry Smith
7f296bb3SBarry Smith- Insert several (many) elements of a matrix or vector at once, rather
7f296bb3SBarry Smith  than looping and inserting a single value at a time. In order to
7f296bb3SBarry Smith  access elements in of vector repeatedly, employ `VecGetArray()` to
7f296bb3SBarry Smith  allow direct manipulation of the vector elements.
7f296bb3SBarry Smith- When possible, use `VecMDot()` rather than a series of calls to
7f296bb3SBarry Smith  `VecDot()`.
7f296bb3SBarry Smith- If you require a sequence of matrix-vector products with the same
7f296bb3SBarry Smith  matrix, consider packing your vectors into a single matrix and use
7f296bb3SBarry Smith  matrix-matrix multiplications.
7f296bb3SBarry Smith- Users should employ a reasonable number of `PetscMalloc()` calls in
7f296bb3SBarry Smith  their codes. Hundreds or thousands of memory allocations may be
7f296bb3SBarry Smith  appropriate; however, if tens of thousands are being used, then
7f296bb3SBarry Smith  reducing the number of `PetscMalloc()` calls may be warranted. For
7f296bb3SBarry Smith  example, reusing space or allocating large chunks and dividing it
7f296bb3SBarry Smith  into pieces can produce a significant savings in allocation overhead.
7f296bb3SBarry Smith  {any}`sec_dsreuse` gives details.
7f296bb3SBarry Smith
7f296bb3SBarry SmithAggressive aggregation of data may result in inflexible datastructures
7f296bb3SBarry Smithand code that is hard to maintain. We advise users to keep these
7f296bb3SBarry Smithcompeting goals in mind and not blindly optimize for performance only.
7f296bb3SBarry Smith
7f296bb3SBarry Smith(sec_symbolfactor)=
7f296bb3SBarry Smith
7f296bb3SBarry Smith### Memory Allocation for Sparse Matrix Factorization
7f296bb3SBarry Smith
7f296bb3SBarry SmithWhen symbolically factoring an AIJ matrix, PETSc has to guess how much
7f296bb3SBarry Smithfill there will be. Careful use of the fill parameter in the
7f296bb3SBarry Smith`MatFactorInfo` structure when calling `MatLUFactorSymbolic()` or
7f296bb3SBarry Smith`MatILUFactorSymbolic()` can reduce greatly the number of mallocs and
7f296bb3SBarry Smithcopies required, and thus greatly improve the performance of the
7f296bb3SBarry Smithfactorization. One way to determine a good value for the fill parameter
7f296bb3SBarry Smithis to run a program with the option `-info`. The symbolic
7f296bb3SBarry Smithfactorization phase will then print information such as
7f296bb3SBarry Smith
7f296bb3SBarry Smith```none
7f296bb3SBarry SmithInfo:MatILUFactorSymbolic_SeqAIJ:Reallocs 12 Fill ratio:given 1 needed 2.16423
7f296bb3SBarry Smith```
7f296bb3SBarry Smith
7f296bb3SBarry SmithThis indicates that the user should have used a fill estimate factor of
7f296bb3SBarry Smithabout 2.17 (instead of 1) to prevent the 12 required mallocs and copies.
7f296bb3SBarry SmithThe command line option
7f296bb3SBarry Smith
7f296bb3SBarry Smith```none
7f296bb3SBarry Smith-pc_factor_fill 2.17
7f296bb3SBarry Smith```
7f296bb3SBarry Smith
7f296bb3SBarry Smithwill cause PETSc to preallocate the correct amount of space for
7f296bb3SBarry Smiththe factorization.
7f296bb3SBarry Smith
7f296bb3SBarry Smith(detecting_memory_problems)=
7f296bb3SBarry Smith
7f296bb3SBarry Smith### Detecting Memory Allocation Problems and Memory Usage
7f296bb3SBarry Smith
7f296bb3SBarry SmithPETSc provides tools to aid in understanding PETSc memory usage and detecting problems with
7f296bb3SBarry Smithmemory allocation, including leaks and use of uninitialized space. Internally, PETSc uses
7f296bb3SBarry Smiththe routines `PetscMalloc()` and `PetscFree()` for memory allocation; instead of directly calling `malloc()` and `free()`.
7f296bb3SBarry SmithThis allows PETSc to track its memory usage and perform error checking. Users are urged to use these routines as well when
7f296bb3SBarry Smithappropriate.
7f296bb3SBarry Smith
7f296bb3SBarry Smith- The option `-malloc_debug` turns on PETSc's extensive runtime error checking of memory for corruption.
7f296bb3SBarry Smith  This checking can be expensive, so should not be used for
7f296bb3SBarry Smith  production runs. The option `-malloc_test` is equivalent to `-malloc_debug`
7f296bb3SBarry Smith  but only works when PETSc is configured with `--with-debugging` (the default configuration).
7f296bb3SBarry Smith  We suggest setting the environmental variable `PETSC_OPTIONS=-malloc_test`
7f296bb3SBarry Smith  in your shell startup file to automatically enable runtime check memory for developing code but not
7f296bb3SBarry Smith  running optimized code. Using `-malloc_debug` or `-malloc_test` for large runs can slow them significantly, thus we
7f296bb3SBarry Smith  recommend turning them off if you code is painfully slow and you don't need the testing. In addition, you can use
7f296bb3SBarry Smith  `-check_pointer_intensity 0` for long run debug runs that do not need extensive memory corruption testing. This option
7f296bb3SBarry Smith  is occasionally added to the `PETSC_OPTIONS` environmental variable by some users.
7f296bb3SBarry Smith- The option
7f296bb3SBarry Smith  `-malloc_dump` will print a list of memory locations that have not been freed at the
7f296bb3SBarry Smith  conclusion of a program. If all memory has been freed no message
7f296bb3SBarry Smith  is printed. Note that
7f296bb3SBarry Smith  the option `-malloc_dump` activates a call to
7f296bb3SBarry Smith  `PetscMallocDump()` during `PetscFinalize()`. The user can also
7f296bb3SBarry Smith  call `PetscMallocDump()` elsewhere in a program.
7f296bb3SBarry Smith- Another useful option
7f296bb3SBarry Smith  is `-malloc_view`, which reports memory usage in all routines at the conclusion of the program.
7f296bb3SBarry Smith  Note that this option
7f296bb3SBarry Smith  activates logging by calling `PetscMallocViewSet()` in
7f296bb3SBarry Smith  `PetscInitialize()` and then prints the log by calling
7f296bb3SBarry Smith  `PetscMallocView()` in `PetscFinalize()`. The user can also call
7f296bb3SBarry Smith  these routines elsewhere in a program.
7f296bb3SBarry Smith- When finer granularity is
7f296bb3SBarry Smith  desired, the user can call `PetscMallocGetCurrentUsage()` and
7f296bb3SBarry Smith  `PetscMallocGetMaximumUsage()` for memory allocated by PETSc, or
7f296bb3SBarry Smith  `PetscMemoryGetCurrentUsage()` and `PetscMemoryGetMaximumUsage()`
7f296bb3SBarry Smith  for the total memory used by the program. Note that
7f296bb3SBarry Smith  `PetscMemorySetGetMaximumUsage()` must be called before
7f296bb3SBarry Smith  `PetscMemoryGetMaximumUsage()` (typically at the beginning of the
7f296bb3SBarry Smith  program).
7f296bb3SBarry Smith- The option `-memory_view` provides a high-level view of all memory usage,
7f296bb3SBarry Smith  not just the memory used by `PetscMalloc()`, at the conclusion of the program.
7f296bb3SBarry Smith- When running with `-log_view`, the additional option `-log_view_memory`
7f296bb3SBarry Smith  causes the display of additional columns of information about how much
7f296bb3SBarry Smith  memory was allocated and freed during each logged event. This is useful
7f296bb3SBarry Smith  to understand what phases of a computation require the most memory.
7f296bb3SBarry Smith
7f296bb3SBarry SmithOne can also use [Valgrind](http://valgrind.org) to track memory usage and find bugs, see {any}`FAQ: Valgrind usage<valgrind>`.
7f296bb3SBarry Smith
7f296bb3SBarry Smith(sec_dsreuse)=
7f296bb3SBarry Smith
7f296bb3SBarry Smith### Data Structure Reuse
7f296bb3SBarry Smith
7f296bb3SBarry SmithData structures should be reused whenever possible. For example, if a
7f296bb3SBarry Smithcode often creates new matrices or vectors, there often may be a way to
7f296bb3SBarry Smithreuse some of them. Very significant performance improvements can be
7f296bb3SBarry Smithachieved by reusing matrix data structures with the same nonzero
7f296bb3SBarry Smithpattern. If a code creates thousands of matrix or vector objects,
7f296bb3SBarry Smithperformance will be degraded. For example, when solving a nonlinear
7f296bb3SBarry Smithproblem or timestepping, reusing the matrices and their nonzero
7f296bb3SBarry Smithstructure for many steps when appropriate can make the code run
7f296bb3SBarry Smithsignificantly faster.
7f296bb3SBarry Smith
7f296bb3SBarry SmithA simple technique for saving work vectors, matrices, etc. is employing
7f296bb3SBarry Smitha user-defined context. In C and C++ such a context is merely a
7f296bb3SBarry Smithstructure in which various objects can be stashed; in Fortran a user
7f296bb3SBarry Smithcontext can be an integer array that contains both parameters and
7f296bb3SBarry Smithpointers to PETSc objects. See
7f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/snes/tutorials/ex5.c.html">SNES Tutorial ex5</a>
7f296bb3SBarry Smithand
7f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/snes/tutorials/ex5f90.F90.html">SNES Tutorial ex5f90</a>
7f296bb3SBarry Smithfor examples of user-defined application contexts in C and Fortran,
7f296bb3SBarry Smithrespectively.
7f296bb3SBarry Smith
7f296bb3SBarry Smith### Numerical Experiments
7f296bb3SBarry Smith
7f296bb3SBarry SmithPETSc users should run a variety of tests. For example, there are a
7f296bb3SBarry Smithlarge number of options for the linear and nonlinear equation solvers in
7f296bb3SBarry SmithPETSc, and different choices can make a *very* big difference in
7f296bb3SBarry Smithconvergence rates and execution times. PETSc employs defaults that are
7f296bb3SBarry Smithgenerally reasonable for a wide range of problems, but clearly these
7f296bb3SBarry Smithdefaults cannot be best for all cases. Users should experiment with many
7f296bb3SBarry Smithcombinations to determine what is best for a given problem and customize
7f296bb3SBarry Smiththe solvers accordingly.
7f296bb3SBarry Smith
7f296bb3SBarry Smith- Use the options `-snes_view`, `-ksp_view`, etc. (or the routines
7f296bb3SBarry Smith  `KSPView()`, `SNESView()`, etc.) to view the options that have
7f296bb3SBarry Smith  been used for a particular solver.
7f296bb3SBarry Smith- Run the code with the option `-help` for a list of the available
7f296bb3SBarry Smith  runtime commands.
7f296bb3SBarry Smith- Use the option `-info` to print details about the solvers’
7f296bb3SBarry Smith  operation.
7f296bb3SBarry Smith- Use the PETSc monitoring discussed in {any}`ch_profiling`
7f296bb3SBarry Smith  to evaluate the performance of various numerical methods.
7f296bb3SBarry Smith
7f296bb3SBarry Smith(sec_slestips)=
7f296bb3SBarry Smith
7f296bb3SBarry Smith### Tips for Efficient Use of Linear Solvers
7f296bb3SBarry Smith
7f296bb3SBarry SmithAs discussed in {any}`ch_ksp`, the default linear
7f296bb3SBarry Smithsolvers are
7f296bb3SBarry Smith
7f296bb3SBarry Smith- uniprocess: GMRES(30) with ILU(0) preconditioning
7f296bb3SBarry Smith- multiprocess: GMRES(30) with block Jacobi preconditioning, where
7f296bb3SBarry Smith  there is 1 block per process, and each block is solved with ILU(0)
7f296bb3SBarry Smith
7f296bb3SBarry SmithOne should experiment to determine alternatives that may be better for
7f296bb3SBarry Smithvarious applications. Recall that one can specify the `KSP` methods
7f296bb3SBarry Smithand preconditioners at runtime via the options:
7f296bb3SBarry Smith
7f296bb3SBarry Smith```none
7f296bb3SBarry Smith-ksp_type <ksp_name> -pc_type <pc_name>
7f296bb3SBarry Smith```
7f296bb3SBarry Smith
7f296bb3SBarry SmithOne can also specify a variety of runtime customizations for the
7f296bb3SBarry Smithsolvers, as discussed throughout the manual.
7f296bb3SBarry Smith
7f296bb3SBarry SmithIn particular, note that the default restart parameter for GMRES is 30,
7f296bb3SBarry Smithwhich may be too small for some large-scale problems. One can alter this
7f296bb3SBarry Smithparameter with the option `-ksp_gmres_restart <restart>` or by calling
7f296bb3SBarry Smith`KSPGMRESSetRestart()`. {any}`sec_ksp` gives
7f296bb3SBarry Smithinformation on setting alternative GMRES orthogonalization routines,
7f296bb3SBarry Smithwhich may provide much better parallel performance.
7f296bb3SBarry Smith
7f296bb3SBarry SmithFor elliptic problems one often obtains good performance and scalability
7f296bb3SBarry Smithwith multigrid solvers. Consult {any}`sec_amg` for
7f296bb3SBarry Smithavailable options. Our experience is that GAMG works particularly well
7f296bb3SBarry Smithfor elasticity problems, whereas hypre does well for scalar problems.
7f296bb3SBarry Smith
7f296bb3SBarry Smith### System-Related Problems
7f296bb3SBarry Smith
7f296bb3SBarry SmithThe performance of a code can be affected by a variety of factors,
7f296bb3SBarry Smithincluding the cache behavior, other users on the machine, etc. Below we
7f296bb3SBarry Smithbriefly describe some common problems and possibilities for overcoming
7f296bb3SBarry Smiththem.
7f296bb3SBarry Smith
7f296bb3SBarry Smith- **Problem too large for physical memory size**: When timing a
7f296bb3SBarry Smith  program, one should always leave at least a ten percent margin
7f296bb3SBarry Smith  between the total memory a process is using and the physical size of
7f296bb3SBarry Smith  the machine’s memory. One way to estimate the amount of memory used
7f296bb3SBarry Smith  by given process is with the Unix `getrusage` system routine.
7f296bb3SBarry Smith  The PETSc option `-malloc_view` reports all
7f296bb3SBarry Smith  memory usage, including any Fortran arrays in an application code.
7f296bb3SBarry Smith- **Effects of other users**: If other users are running jobs on the
7f296bb3SBarry Smith  same physical processor nodes on which a program is being profiled,
7f296bb3SBarry Smith  the timing results are essentially meaningless.
7f296bb3SBarry Smith- **Overhead of timing routines on certain machines**: On certain
7f296bb3SBarry Smith  machines, even calling the system clock in order to time routines is
7f296bb3SBarry Smith  slow; this skews all of the flop rates and timing results. The file
7f296bb3SBarry Smith  `$PETSC_DIR/src/benchmarks/PetscTime.c` (<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/benchmarks/PetscTime.c.html">source</a>)
7f296bb3SBarry Smith  contains a simple test problem that will approximate the amount of
7f296bb3SBarry Smith  time required to get the current time in a running program. On good
7f296bb3SBarry Smith  systems it will on the order of $10^{-6}$ seconds or less.
7f296bb3SBarry Smith- **Problem too large for good cache performance**: Certain machines
7f296bb3SBarry Smith  with lower memory bandwidths (slow memory access) attempt to
7f296bb3SBarry Smith  compensate by having a very large cache. Thus, if a significant
7f296bb3SBarry Smith  portion of an application fits within the cache, the program will
7f296bb3SBarry Smith  achieve very good performance; if the code is too large, the
7f296bb3SBarry Smith  performance can degrade markedly. To analyze whether this situation
7f296bb3SBarry Smith  affects a particular code, one can try plotting the total flop rate
7f296bb3SBarry Smith  as a function of problem size. If the flop rate decreases rapidly at
7f296bb3SBarry Smith  some point, then the problem may likely be too large for the cache
7f296bb3SBarry Smith  size.
7f296bb3SBarry Smith- **Inconsistent timings**: Inconsistent timings are likely due to
7f296bb3SBarry Smith  other users on the machine, thrashing (using more virtual memory than
7f296bb3SBarry Smith  available physical memory), or paging in of the initial executable.
7f296bb3SBarry Smith  {any}`sec_profaccuracy` provides information on
7f296bb3SBarry Smith  overcoming paging overhead when profiling a code. We have found on
7f296bb3SBarry Smith  all systems that if you follow all the advise above your timings will
7f296bb3SBarry Smith  be consistent within a variation of less than five percent.