17f296bb3SBarry Smith(ch_performance)= 27f296bb3SBarry Smith 37f296bb3SBarry Smith# Hints for Performance Tuning 47f296bb3SBarry Smith 57f296bb3SBarry SmithThis chapter provides hints on how to get to achieve best performance 67f296bb3SBarry Smithwith PETSc, particularly on distributed-memory machines with multiple 77f296bb3SBarry SmithCPU sockets per node. We focus on machine-related performance 87f296bb3SBarry Smithoptimization here; algorithmic aspects like preconditioner selection are 97f296bb3SBarry Smithnot the focus of this section. 107f296bb3SBarry Smith 117f296bb3SBarry Smith## Maximizing Memory Bandwidth 127f296bb3SBarry Smith 137f296bb3SBarry SmithMost operations in PETSc deal with large datasets (typically vectors and 147f296bb3SBarry Smithsparse matrices) and perform relatively few arithmetic operations for 157f296bb3SBarry Smitheach byte loaded or stored from global memory. Therefore, the 167f296bb3SBarry Smith*arithmetic intensity* expressed as the ratio of floating point 177f296bb3SBarry Smithoperations to the number of bytes loaded and stored is usually well 187f296bb3SBarry Smithbelow unity for typical PETSc operations. On the other hand, modern CPUs 197f296bb3SBarry Smithare able to execute on the order of 10 floating point operations for 207f296bb3SBarry Smitheach byte loaded or stored. As a consequence, almost all PETSc 217f296bb3SBarry Smithoperations are limited by the rate at which data can be loaded or stored 227f296bb3SBarry Smith(*memory bandwidth limited*) rather than by the rate of floating point 237f296bb3SBarry Smithoperations. 247f296bb3SBarry Smith 257f296bb3SBarry SmithThis section discusses ways to maximize the memory bandwidth achieved by 267f296bb3SBarry Smithapplications based on PETSc. Where appropriate, we include benchmark 277f296bb3SBarry Smithresults in order to provide quantitative results on typical performance 287f296bb3SBarry Smithgains one can achieve through parallelization, both on a single compute 297f296bb3SBarry Smithnode and across nodes. In particular, we start with the answer to the 307f296bb3SBarry Smithcommon question of why performance generally does not increase 20-fold 317f296bb3SBarry Smithwith a 20-core CPU. 327f296bb3SBarry Smith 337f296bb3SBarry Smith(subsec_bandwidth_vs_processes)= 347f296bb3SBarry Smith 357f296bb3SBarry Smith### Memory Bandwidth vs. Processes 367f296bb3SBarry Smith 377f296bb3SBarry SmithConsider the addition of two large vectors, with the result written to a 387f296bb3SBarry Smiththird vector. Because there are no dependencies across the different 397f296bb3SBarry Smithentries of each vector, the operation is embarrassingly parallel. 407f296bb3SBarry Smith 417f296bb3SBarry Smith:::{figure} /images/manual/stream-results-intel.* 427f296bb3SBarry Smith:alt: Memory bandwidth obtained on Intel hardware (dual socket except KNL) over the 437f296bb3SBarry Smith: number of processes used. One can get close to peak memory bandwidth with only a 447f296bb3SBarry Smith: few processes. 457f296bb3SBarry Smith:name: fig_stream_intel 467f296bb3SBarry Smith:width: 80.0% 477f296bb3SBarry Smith 487f296bb3SBarry SmithMemory bandwidth obtained on Intel hardware (dual socket except KNL) 497f296bb3SBarry Smithover the number of processes used. One can get close to peak memory 507f296bb3SBarry Smithbandwidth with only a few processes. 517f296bb3SBarry Smith::: 527f296bb3SBarry Smith 537f296bb3SBarry SmithAs {numref}`fig_stream_intel` shows, the performance gains due to 547f296bb3SBarry Smithparallelization on different multi- and many-core CPUs quickly 557f296bb3SBarry Smithsaturates. The reason is that only a fraction of the total number of CPU 567f296bb3SBarry Smithcores is required to saturate the memory channels. For example, a 577f296bb3SBarry Smithdual-socket system equipped with Haswell 12-core Xeon CPUs achieves more 587f296bb3SBarry Smiththan 80 percent of achievable peak memory bandwidth with only four 597f296bb3SBarry Smithprocesses per socket (8 total), cf. {numref}`fig_stream_intel`. 607f296bb3SBarry SmithConsequently, running with more than 8 MPI ranks on such a system will 617f296bb3SBarry Smithnot increase performance substantially. For the same reason, PETSc-based 627f296bb3SBarry Smithapplications usually do not benefit from hyper-threading. 637f296bb3SBarry Smith 647f296bb3SBarry SmithPETSc provides a simple way to measure memory bandwidth for different 657f296bb3SBarry Smithnumbers of processes via the target `make streams` executed from 667f296bb3SBarry Smith`$PETSC_DIR`. The output provides an overview of the possible speedup 677f296bb3SBarry Smithone can obtain on the given machine (not necessarily a shared memory 687f296bb3SBarry Smithsystem). For example, the following is the most relevant output obtained 697f296bb3SBarry Smithon a dual-socket system equipped with two six-core-CPUs with 707f296bb3SBarry Smithhyperthreading: 717f296bb3SBarry Smith 727f296bb3SBarry Smith```none 737f296bb3SBarry Smithnp speedup 747f296bb3SBarry Smith1 1.0 757f296bb3SBarry Smith2 1.58 767f296bb3SBarry Smith3 2.19 777f296bb3SBarry Smith4 2.42 787f296bb3SBarry Smith5 2.63 797f296bb3SBarry Smith6 2.69 807f296bb3SBarry Smith... 817f296bb3SBarry Smith21 3.82 827f296bb3SBarry Smith22 3.49 837f296bb3SBarry Smith23 3.79 847f296bb3SBarry Smith24 3.71 857f296bb3SBarry SmithEstimation of possible speedup of MPI programs based on Streams benchmark. 867f296bb3SBarry SmithIt appears you have 1 node(s) 877f296bb3SBarry Smith``` 887f296bb3SBarry Smith 897f296bb3SBarry SmithOn this machine, one should expect a speed-up of typical memory 907f296bb3SBarry Smithbandwidth-bound PETSc applications of at most 4x when running multiple 917f296bb3SBarry SmithMPI ranks on the node. Most of the gains are already obtained when 927f296bb3SBarry Smithrunning with only 4-6 ranks. Because a smaller number of MPI ranks 937f296bb3SBarry Smithusually implies better preconditioners and better performance for 947f296bb3SBarry Smithsmaller problems, the best performance for PETSc applications may be 957f296bb3SBarry Smithobtained with fewer ranks than there are physical CPU cores available. 967f296bb3SBarry Smith 977f296bb3SBarry SmithFollowing the results from the above run of `make streams`, we 987f296bb3SBarry Smithrecommend to use additional nodes instead of placing additional MPI 997f296bb3SBarry Smithranks on the nodes. In particular, weak scaling (i.e. constant load per 1007f296bb3SBarry Smithprocess, increasing the number of processes) and strong scaling 1017f296bb3SBarry Smith(i.e. constant total work, increasing the number of processes) studies 1027f296bb3SBarry Smithshould keep the number of processes per node constant. 1037f296bb3SBarry Smith 1047f296bb3SBarry Smith### Non-Uniform Memory Access (NUMA) and Process Placement 1057f296bb3SBarry Smith 1067f296bb3SBarry SmithCPUs in nodes with more than one CPU socket are internally connected via 1077f296bb3SBarry Smitha high-speed fabric, cf. {numref}`fig_numa`, to enable data 1087f296bb3SBarry Smithexchange as well as cache coherency. Because main memory on modern 1097f296bb3SBarry Smithsystems is connected via the integrated memory controllers on each CPU, 1107f296bb3SBarry Smithmemory is accessed in a non-uniform way: A process running on one socket 1117f296bb3SBarry Smithhas direct access to the memory channels of the respective CPU, whereas 1127f296bb3SBarry Smithrequests for memory attached to a different CPU socket need to go 1137f296bb3SBarry Smiththrough the high-speed fabric. Consequently, best aggregate memory 1147f296bb3SBarry Smithbandwidth on the node is obtained when the memory controllers on each 1157f296bb3SBarry SmithCPU are fully saturated. However, full saturation of memory channels is 1167f296bb3SBarry Smithonly possible if the data is distributed across the different memory 1177f296bb3SBarry Smithchannels. 1187f296bb3SBarry Smith 1197f296bb3SBarry Smith:::{figure} /images/manual/numa.* 1207f296bb3SBarry Smith:alt: Schematic of a two-socket NUMA system. Processes should be spread across both 1217f296bb3SBarry Smith: CPUs to obtain full bandwidth. 1227f296bb3SBarry Smith:name: fig_numa 1237f296bb3SBarry Smith:width: 90.0% 1247f296bb3SBarry Smith 1257f296bb3SBarry SmithSchematic of a two-socket NUMA system. Processes should be spread 1267f296bb3SBarry Smithacross both CPUs to obtain full bandwidth. 1277f296bb3SBarry Smith::: 1287f296bb3SBarry Smith 1297f296bb3SBarry SmithData in memory on modern machines is allocated by the operating system 1307f296bb3SBarry Smithbased on a first-touch policy. That is, memory is not allocated at the 1317f296bb3SBarry Smithpoint of issuing `malloc()`, but at the point when the respective 1327f296bb3SBarry Smithmemory segment is actually touched (read or write). Upon first-touch, 1337f296bb3SBarry Smithmemory is allocated on the memory channel associated with the respective 1347f296bb3SBarry SmithCPU the process is running on. Only if all memory on the respective CPU 1357f296bb3SBarry Smithis already in use (either allocated or as IO cache), memory available 1367f296bb3SBarry Smiththrough other sockets is considered. 1377f296bb3SBarry Smith 1387f296bb3SBarry SmithMaximum memory bandwidth can be achieved by ensuring that processes are 1397f296bb3SBarry Smithspread over all sockets in the respective node. For example, the 1407f296bb3SBarry Smithrecommended placement of a 8-way parallel run on a four-socket machine 1417f296bb3SBarry Smithis to assign two processes to each CPU socket. To do so, one needs to 1427f296bb3SBarry Smithknow the enumeration of cores and pass the requested information to 143*a8cf87e0SJunchao Zhang`mpiexec`. Consider the hardware topology information returned by 1447f296bb3SBarry Smith`lstopo` (part of the hwloc package) for the following two-socket 1457f296bb3SBarry Smithmachine, in which each CPU consists of six cores and supports 1467f296bb3SBarry Smithhyperthreading: 1477f296bb3SBarry Smith 1487f296bb3SBarry Smith```none 1497f296bb3SBarry SmithMachine (126GB total) 1507f296bb3SBarry Smith NUMANode L#0 (P#0 63GB) 1517f296bb3SBarry Smith Package L#0 + L3 L#0 (15MB) 1527f296bb3SBarry Smith L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 1537f296bb3SBarry Smith PU L#0 (P#0) 1547f296bb3SBarry Smith PU L#1 (P#12) 1557f296bb3SBarry Smith L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 1567f296bb3SBarry Smith PU L#2 (P#1) 1577f296bb3SBarry Smith PU L#3 (P#13) 1587f296bb3SBarry Smith L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 1597f296bb3SBarry Smith PU L#4 (P#2) 1607f296bb3SBarry Smith PU L#5 (P#14) 1617f296bb3SBarry Smith L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 1627f296bb3SBarry Smith PU L#6 (P#3) 1637f296bb3SBarry Smith PU L#7 (P#15) 1647f296bb3SBarry Smith L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 1657f296bb3SBarry Smith PU L#8 (P#4) 1667f296bb3SBarry Smith PU L#9 (P#16) 1677f296bb3SBarry Smith L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 1687f296bb3SBarry Smith PU L#10 (P#5) 1697f296bb3SBarry Smith PU L#11 (P#17) 1707f296bb3SBarry Smith NUMANode L#1 (P#1 63GB) 1717f296bb3SBarry Smith Package L#1 + L3 L#1 (15MB) 1727f296bb3SBarry Smith L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 1737f296bb3SBarry Smith PU L#12 (P#6) 1747f296bb3SBarry Smith PU L#13 (P#18) 1757f296bb3SBarry Smith L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 1767f296bb3SBarry Smith PU L#14 (P#7) 1777f296bb3SBarry Smith PU L#15 (P#19) 1787f296bb3SBarry Smith L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 1797f296bb3SBarry Smith PU L#16 (P#8) 1807f296bb3SBarry Smith PU L#17 (P#20) 1817f296bb3SBarry Smith L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 1827f296bb3SBarry Smith PU L#18 (P#9) 1837f296bb3SBarry Smith PU L#19 (P#21) 1847f296bb3SBarry Smith L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 1857f296bb3SBarry Smith PU L#20 (P#10) 1867f296bb3SBarry Smith PU L#21 (P#22) 1877f296bb3SBarry Smith L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 1887f296bb3SBarry Smith PU L#22 (P#11) 1897f296bb3SBarry Smith PU L#23 (P#23) 1907f296bb3SBarry Smith``` 1917f296bb3SBarry Smith 1927f296bb3SBarry SmithThe relevant physical processor IDs are shown in parentheses prefixed by 1937f296bb3SBarry Smith`P#`. Here, IDs 0 and 12 share the same physical core and have a 1947f296bb3SBarry Smithcommon L2 cache. IDs 0, 12, 1, 13, 2, 14, 3, 15, 4, 16, 5, 17 share the 1957f296bb3SBarry Smithsame socket and have a common L3 cache. 1967f296bb3SBarry Smith 1977f296bb3SBarry SmithA good placement for a run with six processes is to locate three 1987f296bb3SBarry Smithprocesses on the first socket and three processes on the second socket. 1997f296bb3SBarry SmithUnfortunately, mechanisms for process placement vary across MPI 2007f296bb3SBarry Smithimplementations, so make sure to consult the manual of your MPI 2017f296bb3SBarry Smithimplementation. The following discussion is based on how processor 2027f296bb3SBarry Smithplacement is done with MPICH and Open MPI, where one needs to pass 203*a8cf87e0SJunchao Zhang`--bind-to core --map-by socket` to `mpiexec`: 2047f296bb3SBarry Smith 2057f296bb3SBarry Smith```console 206*a8cf87e0SJunchao Zhang$ mpiexec -n 6 --bind-to core --map-by socket ./stream 2077f296bb3SBarry Smithprocess 0 binding: 100000000000100000000000 2087f296bb3SBarry Smithprocess 1 binding: 000000100000000000100000 2097f296bb3SBarry Smithprocess 2 binding: 010000000000010000000000 2107f296bb3SBarry Smithprocess 3 binding: 000000010000000000010000 2117f296bb3SBarry Smithprocess 4 binding: 001000000000001000000000 2127f296bb3SBarry Smithprocess 5 binding: 000000001000000000001000 2137f296bb3SBarry SmithTriad: 45403.1949 Rate (MB/s) 2147f296bb3SBarry Smith``` 2157f296bb3SBarry Smith 2167f296bb3SBarry SmithIn this configuration, process 0 is bound to the first physical core on 2177f296bb3SBarry Smiththe first socket (with IDs 0 and 12), process 1 is bound to the first 2187f296bb3SBarry Smithcore on the second socket (IDs 6 and 18), and similarly for the 2197f296bb3SBarry Smithremaining processes. The achieved bandwidth of 45 GB/sec is close to the 2207f296bb3SBarry Smithpractical peak of about 50 GB/sec available on the machine. If, however, 2217f296bb3SBarry Smithall MPI processes are located on the same socket, memory bandwidth drops 2227f296bb3SBarry Smithsignificantly: 2237f296bb3SBarry Smith 2247f296bb3SBarry Smith```console 225*a8cf87e0SJunchao Zhang$ mpiexec -n 6 --bind-to core --map-by core ./stream 2267f296bb3SBarry Smithprocess 0 binding: 100000000000100000000000 2277f296bb3SBarry Smithprocess 1 binding: 010000000000010000000000 2287f296bb3SBarry Smithprocess 2 binding: 001000000000001000000000 2297f296bb3SBarry Smithprocess 3 binding: 000100000000000100000000 2307f296bb3SBarry Smithprocess 4 binding: 000010000000000010000000 2317f296bb3SBarry Smithprocess 5 binding: 000001000000000001000000 2327f296bb3SBarry SmithTriad: 25510.7507 Rate (MB/s) 2337f296bb3SBarry Smith``` 2347f296bb3SBarry Smith 2357f296bb3SBarry SmithAll processes are now mapped to cores on the same socket. As a result, 2367f296bb3SBarry Smithonly the first memory channel is fully saturated at 25.5 GB/sec. 2377f296bb3SBarry Smith 238*a8cf87e0SJunchao ZhangOne must not assume that `mpiexec` uses good defaults. To 2397f296bb3SBarry Smithdemonstrate, compare the full output of `make streams` from {any}`subsec_bandwidth_vs_processes` first, followed by 2407f296bb3SBarry Smiththe results obtained by passing `--bind-to core --map-by socket`: 2417f296bb3SBarry Smith 2427f296bb3SBarry Smith```console 2437f296bb3SBarry Smith$ make streams 2447f296bb3SBarry Smithnp speedup 2457f296bb3SBarry Smith1 1.0 2467f296bb3SBarry Smith2 1.58 2477f296bb3SBarry Smith3 2.19 2487f296bb3SBarry Smith4 2.42 2497f296bb3SBarry Smith5 2.63 2507f296bb3SBarry Smith6 2.69 2517f296bb3SBarry Smith7 2.31 2527f296bb3SBarry Smith8 2.42 2537f296bb3SBarry Smith9 2.37 2547f296bb3SBarry Smith10 2.65 2557f296bb3SBarry Smith11 2.3 2567f296bb3SBarry Smith12 2.53 2577f296bb3SBarry Smith13 2.43 2587f296bb3SBarry Smith14 2.63 2597f296bb3SBarry Smith15 2.74 2607f296bb3SBarry Smith16 2.7 2617f296bb3SBarry Smith17 3.28 2627f296bb3SBarry Smith18 3.66 2637f296bb3SBarry Smith19 3.95 2647f296bb3SBarry Smith20 3.07 2657f296bb3SBarry Smith21 3.82 2667f296bb3SBarry Smith22 3.49 2677f296bb3SBarry Smith23 3.79 2687f296bb3SBarry Smith24 3.71 2697f296bb3SBarry Smith``` 2707f296bb3SBarry Smith 2717f296bb3SBarry Smith```console 2727f296bb3SBarry Smith$ make streams MPI_BINDING="--bind-to core --map-by socket" 2737f296bb3SBarry Smithnp speedup 2747f296bb3SBarry Smith1 1.0 2757f296bb3SBarry Smith2 1.59 2767f296bb3SBarry Smith3 2.66 2777f296bb3SBarry Smith4 3.5 2787f296bb3SBarry Smith5 3.56 2797f296bb3SBarry Smith6 4.23 2807f296bb3SBarry Smith7 3.95 2817f296bb3SBarry Smith8 4.39 2827f296bb3SBarry Smith9 4.09 2837f296bb3SBarry Smith10 4.46 2847f296bb3SBarry Smith11 4.15 2857f296bb3SBarry Smith12 4.42 2867f296bb3SBarry Smith13 3.71 2877f296bb3SBarry Smith14 3.83 2887f296bb3SBarry Smith15 4.08 2897f296bb3SBarry Smith16 4.22 2907f296bb3SBarry Smith17 4.18 2917f296bb3SBarry Smith18 4.31 2927f296bb3SBarry Smith19 4.22 2937f296bb3SBarry Smith20 4.28 2947f296bb3SBarry Smith21 4.25 2957f296bb3SBarry Smith22 4.23 2967f296bb3SBarry Smith23 4.28 2977f296bb3SBarry Smith24 4.22 2987f296bb3SBarry Smith``` 2997f296bb3SBarry Smith 3007f296bb3SBarry SmithFor the non-optimized version on the left, the speedup obtained when 3017f296bb3SBarry Smithusing any number of processes between 3 and 13 is essentially constant 3027f296bb3SBarry Smithup to fluctuations, indicating that all processes were by default 3037f296bb3SBarry Smithexecuted on the same socket. Only with 14 or more processes, the 3047f296bb3SBarry Smithspeedup number increases again. In contrast, the results of 3057f296bb3SBarry Smith 3067f296bb3SBarry Smith 3077f296bb3SBarry Smith`make streams` 3087f296bb3SBarry Smith 3097f296bb3SBarry Smith with proper processor placement shown second 3107f296bb3SBarry Smithresulted in slightly higher overall parallel speedup (identical 3117f296bb3SBarry Smithbaselines), in smaller performance fluctuations, and more than 90 3127f296bb3SBarry Smithpercent of peak bandwidth with only six processes. 3137f296bb3SBarry Smith 3147f296bb3SBarry SmithMachines with job submission systems such as SLURM usually provide 3157f296bb3SBarry Smithsimilar mechanisms for processor placements through options specified in 3167f296bb3SBarry Smithjob submission scripts. Please consult the respective manuals. 3177f296bb3SBarry Smith 3187f296bb3SBarry Smith#### Additional Process Placement Considerations and Details 3197f296bb3SBarry Smith 3207f296bb3SBarry SmithFor a typical, memory bandwidth-limited PETSc application, the primary 3217f296bb3SBarry Smithconsideration in placing MPI processes is ensuring that processes are 3227f296bb3SBarry Smithevenly distributed among sockets, and hence using all available memory 3237f296bb3SBarry Smithchannels. Increasingly complex processor designs and cache hierarchies, 3247f296bb3SBarry Smithhowever, mean that performance may also be sensitive to how processes 3257f296bb3SBarry Smithare bound to the resources within each socket. Performance on the two 3267f296bb3SBarry Smithprocessor machine in the preceding example may be relatively insensitive 3277f296bb3SBarry Smithto such placement decisions, because one L3 cache is shared by all cores 3287f296bb3SBarry Smithwithin a NUMA domain, and each core has its own L2 and L1 caches. 3297f296bb3SBarry SmithHowever, processors that are less “flat”, with more complex hierarchies, 3307f296bb3SBarry Smithmay be more sensitive. In many AMD Opterons or the second-generation 3317f296bb3SBarry Smith“Knights Landing” Intel Xeon Phi, for instance, L2 caches are shared 3327f296bb3SBarry Smithbetween two cores. On these processors, placing consecutive MPI ranks on 3337f296bb3SBarry Smithcores that share the same L2 cache may benefit performance if the two 3347f296bb3SBarry Smithranks communicate frequently with each other, because the latency 3357f296bb3SBarry Smithbetween cores sharing an L2 cache may be roughly half that of two cores 3367f296bb3SBarry Smithnot sharing one. There may be benefit, however, in placing consecutive 3377f296bb3SBarry Smithranks on cores that do not share an L2 cache, because (if there are 3387f296bb3SBarry Smithfewer MPI ranks than cores) this increases the total L2 cache capacity 3397f296bb3SBarry Smithand bandwidth available to the application. There is a trade-off to be 3407f296bb3SBarry Smithconsidered between placing processes close together (in terms of shared 3417f296bb3SBarry Smithresources) to optimize for efficient communication and synchronization 3427f296bb3SBarry Smithvs. farther apart to maximize available resources (memory channels, 3437f296bb3SBarry Smithcaches, I/O channels, etc.), and the best strategy will depend on the 3447f296bb3SBarry Smithapplication and the software and hardware stack. 3457f296bb3SBarry Smith 3467f296bb3SBarry SmithDifferent process placement strategies can affect performance at least 3477f296bb3SBarry Smithas much as some commonly explored settings, such as compiler 3487f296bb3SBarry Smithoptimization levels. Unfortunately, exploration of this space is 3497f296bb3SBarry Smithcomplicated by two factors: First, processor and core numberings may be 3507f296bb3SBarry Smithcompletely arbitrary, changing with BIOS version, etc., and second—as 3517f296bb3SBarry Smithalready noted—there is no standard mechanism used by MPI implementations 3527f296bb3SBarry Smith(or job schedulers) to specify process affinity. To overcome the first 3537f296bb3SBarry Smithissue, we recommend using the `lstopo` utility of the Portable 3547f296bb3SBarry SmithHardware Locality (`hwloc`) software package (which can be installed 3557f296bb3SBarry Smithby configuring PETSc with `–download-hwloc`) to understand the 3567f296bb3SBarry Smithprocessor topology of your machine. We cannot fully address the second 3577f296bb3SBarry Smithissue—consult the documentation for your MPI implementation and/or job 3587f296bb3SBarry Smithscheduler—but we offer some general observations on understanding 3597f296bb3SBarry Smithplacement options: 3607f296bb3SBarry Smith 3617f296bb3SBarry Smith- An MPI implementation may support a notion of *domains* in which a 3627f296bb3SBarry Smith process may be pinned. A domain may simply correspond to a single 3637f296bb3SBarry Smith core; however, the MPI implementation may allow a deal of flexibility 3647f296bb3SBarry Smith in specifying domains that encompass multiple cores, span sockets, 3657f296bb3SBarry Smith etc. Some implementations, such as Intel MPI, provide means to 3667f296bb3SBarry Smith specify whether domains should be “compact”—composed of cores sharing 3677f296bb3SBarry Smith resources such as caches—or “scatter”-ed, with little resource 3687f296bb3SBarry Smith sharing (possibly even spanning sockets). 3697f296bb3SBarry Smith- Separate from the specification of domains, MPI implementations often 3707f296bb3SBarry Smith support different *orderings* in which MPI ranks should be bound to 3717f296bb3SBarry Smith these domains. Intel MPI, for instance, supports “compact” ordering 3727f296bb3SBarry Smith to place consecutive ranks close in terms of shared resources, 3737f296bb3SBarry Smith “scatter” to place them far apart, and “bunch” to map proportionally 3747f296bb3SBarry Smith to sockets while placing ranks as close together as possible within 3757f296bb3SBarry Smith the sockets. 3767f296bb3SBarry Smith- An MPI implementation that supports process pinning should offer some 3777f296bb3SBarry Smith way to view the rank assignments. Use this output in conjunction with 3787f296bb3SBarry Smith the topology obtained via `lstopo` or a similar tool to determine 3797f296bb3SBarry Smith if the placements correspond to something you believe is reasonable 3807f296bb3SBarry Smith for your application. Do not assume that the MPI implementation is 3817f296bb3SBarry Smith doing something sensible by default! 3827f296bb3SBarry Smith 3837f296bb3SBarry Smith## Performance Pitfalls and Advice 3847f296bb3SBarry Smith 3857f296bb3SBarry SmithThis section looks into a potpourri of performance pitfalls encountered 3867f296bb3SBarry Smithby users in the past. Many of these pitfalls require a deeper 3877f296bb3SBarry Smithunderstanding of the system and experience to detect. The purpose of 3887f296bb3SBarry Smiththis section is to summarize and share our experience so that these 3897f296bb3SBarry Smithpitfalls can be avoided in the future. 3907f296bb3SBarry Smith 3917f296bb3SBarry Smith### Debug vs. Optimized Builds 3927f296bb3SBarry Smith 3937f296bb3SBarry SmithPETSc’s `configure` defaults to building PETSc with debug mode 3947f296bb3SBarry Smithenabled. Any code development should be done in this mode, because it 3957f296bb3SBarry Smithprovides handy debugging facilities such as accurate stack traces, 3967f296bb3SBarry Smithmemory leak checks, and memory corruption checks. Note that PETSc has no 3977f296bb3SBarry Smithreliable way of knowing whether a particular run is a production or 3987f296bb3SBarry Smithdebug run. In the case that a user requests profiling information via 3997f296bb3SBarry Smith`-log_view`, a debug build of PETSc issues the following warning: 4007f296bb3SBarry Smith 4017f296bb3SBarry Smith```none 4027f296bb3SBarry Smith########################################################## 4037f296bb3SBarry Smith# # 4047f296bb3SBarry Smith# WARNING!!! # 4057f296bb3SBarry Smith# # 4067f296bb3SBarry Smith# This code was compiled with a debugging option, # 4077f296bb3SBarry Smith# To get timing results run configure # 4087f296bb3SBarry Smith# using --with-debugging=no, the performance will # 4097f296bb3SBarry Smith# be generally two or three times faster. # 4107f296bb3SBarry Smith# # 4117f296bb3SBarry Smith########################################################## 4127f296bb3SBarry Smith``` 4137f296bb3SBarry Smith 4147f296bb3SBarry SmithConversely, one way of checking whether a particular build of PETSc has 4157f296bb3SBarry Smithdebugging enabled is to inspect the output of `-log_view`. 4167f296bb3SBarry Smith 4177f296bb3SBarry SmithDebug mode will generally be most useful for code development if 4187f296bb3SBarry Smithappropriate compiler options are set to facilitate debugging. The 4197f296bb3SBarry Smithcompiler should be instructed to generate binaries with debug symbols 4207f296bb3SBarry Smith(command line option `-g` for most compilers), and the optimization 4217f296bb3SBarry Smithlevel chosen should either completely disable optimizations (`-O0` for 4227f296bb3SBarry Smithmost compilers) or enable only optimizations that do not interfere with 4237f296bb3SBarry Smithdebugging (GCC, for instance, supports a `-Og` optimization level that 4247f296bb3SBarry Smithdoes this). 4257f296bb3SBarry Smith 4267f296bb3SBarry SmithOnly once the new code is thoroughly tested and ready for production, 4277f296bb3SBarry Smithone should disable debugging facilities by passing 4287f296bb3SBarry Smith`--with-debugging=no` to 4297f296bb3SBarry Smith 4307f296bb3SBarry Smith`configure`. One should also ensure that an appropriate compiler 4317f296bb3SBarry Smithoptimization level is set. Note that some compilers (e.g., Intel) 4327f296bb3SBarry Smithdefault to fairly comprehensive optimization levels, while others (e.g., 4337f296bb3SBarry SmithGCC) default to no optimization at all. The best optimization flags will 4347f296bb3SBarry Smithdepend on your code, the compiler, and the target architecture, but we 4357f296bb3SBarry Smithoffer a few guidelines for finding those that will offer the best 4367f296bb3SBarry Smithperformance: 4377f296bb3SBarry Smith 4387f296bb3SBarry Smith- Most compilers have a number of optimization levels (with level n 4397f296bb3SBarry Smith usually specified via `-On`) that provide a quick way to enable 4407f296bb3SBarry Smith sets of several optimization flags. We suggest trying the higher 4417f296bb3SBarry Smith optimization levels (the highest level is not guaranteed to produce 4427f296bb3SBarry Smith the fastest executable, so some experimentation may be merited). With 4437f296bb3SBarry Smith most recent processors now supporting some form of SIMD or vector 4447f296bb3SBarry Smith instructions, it is important to choose a level that enables the 4457f296bb3SBarry Smith compiler’s auto-vectorizer; many compilers do not enable 4467f296bb3SBarry Smith auto-vectorization at lower optimization levels (e.g., GCC does not 4477f296bb3SBarry Smith enable it below `-O3` and the Intel compiler does not enable it 4487f296bb3SBarry Smith below `-O2`). 4497f296bb3SBarry Smith- For processors supporting newer vector instruction sets, such as 4507f296bb3SBarry Smith Intel AVX2 and AVX-512, it is also important to direct the compiler 4517f296bb3SBarry Smith to generate code that targets these processors (e.g., `-march=native`); 4527f296bb3SBarry Smith otherwise, the executables built will not 4537f296bb3SBarry Smith utilize the newer instructions sets and will not take advantage of 4547f296bb3SBarry Smith the vector processing units. 4557f296bb3SBarry Smith- Beyond choosing the optimization levels, some value-unsafe 4567f296bb3SBarry Smith optimizations (such as using reciprocals of values instead of 4577f296bb3SBarry Smith dividing by those values, or allowing re-association of operands in a 4587f296bb3SBarry Smith series of calculations) for floating point calculations may yield 4597f296bb3SBarry Smith significant performance gains. Compilers often provide flags (e.g., 4607f296bb3SBarry Smith `-ffast-math` in GCC) to enable a set of these optimizations, and 4617f296bb3SBarry Smith they may be turned on when using options for very aggressive 4627f296bb3SBarry Smith optimization (`-fast` or `-Ofast` in many compilers). These are 4637f296bb3SBarry Smith worth exploring to maximize performance, but, if employed, it 4647f296bb3SBarry Smith important to verify that these do not cause erroneous results with 4657f296bb3SBarry Smith your code, since calculations may violate the IEEE standard for 4667f296bb3SBarry Smith floating-point arithmetic. 4677f296bb3SBarry Smith 4687f296bb3SBarry Smith### Profiling 4697f296bb3SBarry Smith 4707f296bb3SBarry SmithUsers should not spend time optimizing a code until after having 4717f296bb3SBarry Smithdetermined where it spends the bulk of its time on realistically sized 4727f296bb3SBarry Smithproblems. As discussed in detail in {any}`ch_profiling`, the 4737f296bb3SBarry SmithPETSc routines automatically log performance data if certain runtime 4747f296bb3SBarry Smithoptions are specified. 4757f296bb3SBarry Smith 4767f296bb3SBarry SmithTo obtain a summary of where and how much time is spent in different 4777f296bb3SBarry Smithsections of the code, use one of the following options: 4787f296bb3SBarry Smith 4797f296bb3SBarry Smith- Run the code with the option `-log_view` to print a performance 4807f296bb3SBarry Smith summary for various phases of the code. 4817f296bb3SBarry Smith- Run the code with the option `-log_mpe` `[logfilename]`, which 4827f296bb3SBarry Smith creates a logfile of events suitable for viewing with Jumpshot (part 4837f296bb3SBarry Smith of MPICH). 4847f296bb3SBarry Smith 4857f296bb3SBarry SmithThen, focus on the sections where most of the time is spent. If you 4867f296bb3SBarry Smithprovided your own callback routines, e.g. for residual evaluations, 4877f296bb3SBarry Smithsearch the profiling output for routines such as `SNESFunctionEval` or 4887f296bb3SBarry Smith`SNESJacobianEval`. If their relative time is significant (say, more 4897f296bb3SBarry Smiththan 30 percent), consider optimizing these routines first. Generic 4907f296bb3SBarry Smithinstructions on how to optimize your callback functions are difficult; 4917f296bb3SBarry Smithyou may start by reading performance optimization guides for your 4927f296bb3SBarry Smithsystem’s hardware. 4937f296bb3SBarry Smith 4947f296bb3SBarry Smith### Aggregation 4957f296bb3SBarry Smith 4967f296bb3SBarry SmithPerforming operations on chunks of data rather than a single element at 4977f296bb3SBarry Smitha time can significantly enhance performance because of cache reuse or 4987f296bb3SBarry Smithlower data motion. Typical examples are: 4997f296bb3SBarry Smith 5007f296bb3SBarry Smith- Insert several (many) elements of a matrix or vector at once, rather 5017f296bb3SBarry Smith than looping and inserting a single value at a time. In order to 5027f296bb3SBarry Smith access elements in of vector repeatedly, employ `VecGetArray()` to 5037f296bb3SBarry Smith allow direct manipulation of the vector elements. 5047f296bb3SBarry Smith- When possible, use `VecMDot()` rather than a series of calls to 5057f296bb3SBarry Smith `VecDot()`. 5067f296bb3SBarry Smith- If you require a sequence of matrix-vector products with the same 5077f296bb3SBarry Smith matrix, consider packing your vectors into a single matrix and use 5087f296bb3SBarry Smith matrix-matrix multiplications. 5097f296bb3SBarry Smith- Users should employ a reasonable number of `PetscMalloc()` calls in 5107f296bb3SBarry Smith their codes. Hundreds or thousands of memory allocations may be 5117f296bb3SBarry Smith appropriate; however, if tens of thousands are being used, then 5127f296bb3SBarry Smith reducing the number of `PetscMalloc()` calls may be warranted. For 5137f296bb3SBarry Smith example, reusing space or allocating large chunks and dividing it 5147f296bb3SBarry Smith into pieces can produce a significant savings in allocation overhead. 5157f296bb3SBarry Smith {any}`sec_dsreuse` gives details. 5167f296bb3SBarry Smith 5177f296bb3SBarry SmithAggressive aggregation of data may result in inflexible datastructures 5187f296bb3SBarry Smithand code that is hard to maintain. We advise users to keep these 5197f296bb3SBarry Smithcompeting goals in mind and not blindly optimize for performance only. 5207f296bb3SBarry Smith 5217f296bb3SBarry Smith(sec_symbolfactor)= 5227f296bb3SBarry Smith 5237f296bb3SBarry Smith### Memory Allocation for Sparse Matrix Factorization 5247f296bb3SBarry Smith 5257f296bb3SBarry SmithWhen symbolically factoring an AIJ matrix, PETSc has to guess how much 5267f296bb3SBarry Smithfill there will be. Careful use of the fill parameter in the 5277f296bb3SBarry Smith`MatFactorInfo` structure when calling `MatLUFactorSymbolic()` or 5287f296bb3SBarry Smith`MatILUFactorSymbolic()` can reduce greatly the number of mallocs and 5297f296bb3SBarry Smithcopies required, and thus greatly improve the performance of the 5307f296bb3SBarry Smithfactorization. One way to determine a good value for the fill parameter 5317f296bb3SBarry Smithis to run a program with the option `-info`. The symbolic 5327f296bb3SBarry Smithfactorization phase will then print information such as 5337f296bb3SBarry Smith 5347f296bb3SBarry Smith```none 5357f296bb3SBarry SmithInfo:MatILUFactorSymbolic_SeqAIJ:Reallocs 12 Fill ratio:given 1 needed 2.16423 5367f296bb3SBarry Smith``` 5377f296bb3SBarry Smith 5387f296bb3SBarry SmithThis indicates that the user should have used a fill estimate factor of 5397f296bb3SBarry Smithabout 2.17 (instead of 1) to prevent the 12 required mallocs and copies. 5407f296bb3SBarry SmithThe command line option 5417f296bb3SBarry Smith 5427f296bb3SBarry Smith```none 5437f296bb3SBarry Smith-pc_factor_fill 2.17 5447f296bb3SBarry Smith``` 5457f296bb3SBarry Smith 5467f296bb3SBarry Smithwill cause PETSc to preallocate the correct amount of space for 5477f296bb3SBarry Smiththe factorization. 5487f296bb3SBarry Smith 5497f296bb3SBarry Smith(detecting_memory_problems)= 5507f296bb3SBarry Smith 5517f296bb3SBarry Smith### Detecting Memory Allocation Problems and Memory Usage 5527f296bb3SBarry Smith 5537f296bb3SBarry SmithPETSc provides tools to aid in understanding PETSc memory usage and detecting problems with 5547f296bb3SBarry Smithmemory allocation, including leaks and use of uninitialized space. Internally, PETSc uses 5557f296bb3SBarry Smiththe routines `PetscMalloc()` and `PetscFree()` for memory allocation; instead of directly calling `malloc()` and `free()`. 5567f296bb3SBarry SmithThis allows PETSc to track its memory usage and perform error checking. Users are urged to use these routines as well when 5577f296bb3SBarry Smithappropriate. 5587f296bb3SBarry Smith 5597f296bb3SBarry Smith- The option `-malloc_debug` turns on PETSc's extensive runtime error checking of memory for corruption. 5607f296bb3SBarry Smith This checking can be expensive, so should not be used for 5617f296bb3SBarry Smith production runs. The option `-malloc_test` is equivalent to `-malloc_debug` 5627f296bb3SBarry Smith but only works when PETSc is configured with `--with-debugging` (the default configuration). 5637f296bb3SBarry Smith We suggest setting the environmental variable `PETSC_OPTIONS=-malloc_test` 5647f296bb3SBarry Smith in your shell startup file to automatically enable runtime check memory for developing code but not 5657f296bb3SBarry Smith running optimized code. Using `-malloc_debug` or `-malloc_test` for large runs can slow them significantly, thus we 5667f296bb3SBarry Smith recommend turning them off if you code is painfully slow and you don't need the testing. In addition, you can use 5677f296bb3SBarry Smith `-check_pointer_intensity 0` for long run debug runs that do not need extensive memory corruption testing. This option 5687f296bb3SBarry Smith is occasionally added to the `PETSC_OPTIONS` environmental variable by some users. 5697f296bb3SBarry Smith- The option 5707f296bb3SBarry Smith `-malloc_dump` will print a list of memory locations that have not been freed at the 5717f296bb3SBarry Smith conclusion of a program. If all memory has been freed no message 5727f296bb3SBarry Smith is printed. Note that 5737f296bb3SBarry Smith the option `-malloc_dump` activates a call to 5747f296bb3SBarry Smith `PetscMallocDump()` during `PetscFinalize()`. The user can also 5757f296bb3SBarry Smith call `PetscMallocDump()` elsewhere in a program. 5767f296bb3SBarry Smith- Another useful option 5777f296bb3SBarry Smith is `-malloc_view`, which reports memory usage in all routines at the conclusion of the program. 5787f296bb3SBarry Smith Note that this option 5797f296bb3SBarry Smith activates logging by calling `PetscMallocViewSet()` in 5807f296bb3SBarry Smith `PetscInitialize()` and then prints the log by calling 5817f296bb3SBarry Smith `PetscMallocView()` in `PetscFinalize()`. The user can also call 5827f296bb3SBarry Smith these routines elsewhere in a program. 5837f296bb3SBarry Smith- When finer granularity is 5847f296bb3SBarry Smith desired, the user can call `PetscMallocGetCurrentUsage()` and 5857f296bb3SBarry Smith `PetscMallocGetMaximumUsage()` for memory allocated by PETSc, or 5867f296bb3SBarry Smith `PetscMemoryGetCurrentUsage()` and `PetscMemoryGetMaximumUsage()` 5877f296bb3SBarry Smith for the total memory used by the program. Note that 5887f296bb3SBarry Smith `PetscMemorySetGetMaximumUsage()` must be called before 5897f296bb3SBarry Smith `PetscMemoryGetMaximumUsage()` (typically at the beginning of the 5907f296bb3SBarry Smith program). 5917f296bb3SBarry Smith- The option `-memory_view` provides a high-level view of all memory usage, 5927f296bb3SBarry Smith not just the memory used by `PetscMalloc()`, at the conclusion of the program. 5937f296bb3SBarry Smith- When running with `-log_view`, the additional option `-log_view_memory` 5947f296bb3SBarry Smith causes the display of additional columns of information about how much 5957f296bb3SBarry Smith memory was allocated and freed during each logged event. This is useful 5967f296bb3SBarry Smith to understand what phases of a computation require the most memory. 5977f296bb3SBarry Smith 5987f296bb3SBarry SmithOne can also use [Valgrind](http://valgrind.org) to track memory usage and find bugs, see {any}`FAQ: Valgrind usage<valgrind>`. 5997f296bb3SBarry Smith 6007f296bb3SBarry Smith(sec_dsreuse)= 6017f296bb3SBarry Smith 6027f296bb3SBarry Smith### Data Structure Reuse 6037f296bb3SBarry Smith 6047f296bb3SBarry SmithData structures should be reused whenever possible. For example, if a 6057f296bb3SBarry Smithcode often creates new matrices or vectors, there often may be a way to 6067f296bb3SBarry Smithreuse some of them. Very significant performance improvements can be 6077f296bb3SBarry Smithachieved by reusing matrix data structures with the same nonzero 6087f296bb3SBarry Smithpattern. If a code creates thousands of matrix or vector objects, 6097f296bb3SBarry Smithperformance will be degraded. For example, when solving a nonlinear 6107f296bb3SBarry Smithproblem or timestepping, reusing the matrices and their nonzero 6117f296bb3SBarry Smithstructure for many steps when appropriate can make the code run 6127f296bb3SBarry Smithsignificantly faster. 6137f296bb3SBarry Smith 6147f296bb3SBarry SmithA simple technique for saving work vectors, matrices, etc. is employing 6157f296bb3SBarry Smitha user-defined context. In C and C++ such a context is merely a 6167f296bb3SBarry Smithstructure in which various objects can be stashed; in Fortran a user 6177f296bb3SBarry Smithcontext can be an integer array that contains both parameters and 6187f296bb3SBarry Smithpointers to PETSc objects. See 6197f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/snes/tutorials/ex5.c.html">SNES Tutorial ex5</a> 6207f296bb3SBarry Smithand 6217f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/snes/tutorials/ex5f90.F90.html">SNES Tutorial ex5f90</a> 6227f296bb3SBarry Smithfor examples of user-defined application contexts in C and Fortran, 6237f296bb3SBarry Smithrespectively. 6247f296bb3SBarry Smith 6257f296bb3SBarry Smith### Numerical Experiments 6267f296bb3SBarry Smith 6277f296bb3SBarry SmithPETSc users should run a variety of tests. For example, there are a 6287f296bb3SBarry Smithlarge number of options for the linear and nonlinear equation solvers in 6297f296bb3SBarry SmithPETSc, and different choices can make a *very* big difference in 6307f296bb3SBarry Smithconvergence rates and execution times. PETSc employs defaults that are 6317f296bb3SBarry Smithgenerally reasonable for a wide range of problems, but clearly these 6327f296bb3SBarry Smithdefaults cannot be best for all cases. Users should experiment with many 6337f296bb3SBarry Smithcombinations to determine what is best for a given problem and customize 6347f296bb3SBarry Smiththe solvers accordingly. 6357f296bb3SBarry Smith 6367f296bb3SBarry Smith- Use the options `-snes_view`, `-ksp_view`, etc. (or the routines 6377f296bb3SBarry Smith `KSPView()`, `SNESView()`, etc.) to view the options that have 6387f296bb3SBarry Smith been used for a particular solver. 6397f296bb3SBarry Smith- Run the code with the option `-help` for a list of the available 6407f296bb3SBarry Smith runtime commands. 6417f296bb3SBarry Smith- Use the option `-info` to print details about the solvers’ 6427f296bb3SBarry Smith operation. 6437f296bb3SBarry Smith- Use the PETSc monitoring discussed in {any}`ch_profiling` 6447f296bb3SBarry Smith to evaluate the performance of various numerical methods. 6457f296bb3SBarry Smith 6467f296bb3SBarry Smith(sec_slestips)= 6477f296bb3SBarry Smith 6487f296bb3SBarry Smith### Tips for Efficient Use of Linear Solvers 6497f296bb3SBarry Smith 6507f296bb3SBarry SmithAs discussed in {any}`ch_ksp`, the default linear 6517f296bb3SBarry Smithsolvers are 6527f296bb3SBarry Smith 6537f296bb3SBarry Smith- uniprocess: GMRES(30) with ILU(0) preconditioning 6547f296bb3SBarry Smith- multiprocess: GMRES(30) with block Jacobi preconditioning, where 6557f296bb3SBarry Smith there is 1 block per process, and each block is solved with ILU(0) 6567f296bb3SBarry Smith 6577f296bb3SBarry SmithOne should experiment to determine alternatives that may be better for 6587f296bb3SBarry Smithvarious applications. Recall that one can specify the `KSP` methods 6597f296bb3SBarry Smithand preconditioners at runtime via the options: 6607f296bb3SBarry Smith 6617f296bb3SBarry Smith```none 6627f296bb3SBarry Smith-ksp_type <ksp_name> -pc_type <pc_name> 6637f296bb3SBarry Smith``` 6647f296bb3SBarry Smith 6657f296bb3SBarry SmithOne can also specify a variety of runtime customizations for the 6667f296bb3SBarry Smithsolvers, as discussed throughout the manual. 6677f296bb3SBarry Smith 6687f296bb3SBarry SmithIn particular, note that the default restart parameter for GMRES is 30, 6697f296bb3SBarry Smithwhich may be too small for some large-scale problems. One can alter this 6707f296bb3SBarry Smithparameter with the option `-ksp_gmres_restart <restart>` or by calling 6717f296bb3SBarry Smith`KSPGMRESSetRestart()`. {any}`sec_ksp` gives 6727f296bb3SBarry Smithinformation on setting alternative GMRES orthogonalization routines, 6737f296bb3SBarry Smithwhich may provide much better parallel performance. 6747f296bb3SBarry Smith 6757f296bb3SBarry SmithFor elliptic problems one often obtains good performance and scalability 6767f296bb3SBarry Smithwith multigrid solvers. Consult {any}`sec_amg` for 6777f296bb3SBarry Smithavailable options. Our experience is that GAMG works particularly well 6787f296bb3SBarry Smithfor elasticity problems, whereas hypre does well for scalar problems. 6797f296bb3SBarry Smith 6807f296bb3SBarry Smith### System-Related Problems 6817f296bb3SBarry Smith 6827f296bb3SBarry SmithThe performance of a code can be affected by a variety of factors, 6837f296bb3SBarry Smithincluding the cache behavior, other users on the machine, etc. Below we 6847f296bb3SBarry Smithbriefly describe some common problems and possibilities for overcoming 6857f296bb3SBarry Smiththem. 6867f296bb3SBarry Smith 6877f296bb3SBarry Smith- **Problem too large for physical memory size**: When timing a 6887f296bb3SBarry Smith program, one should always leave at least a ten percent margin 6897f296bb3SBarry Smith between the total memory a process is using and the physical size of 6907f296bb3SBarry Smith the machine’s memory. One way to estimate the amount of memory used 6917f296bb3SBarry Smith by given process is with the Unix `getrusage` system routine. 6927f296bb3SBarry Smith The PETSc option `-malloc_view` reports all 6937f296bb3SBarry Smith memory usage, including any Fortran arrays in an application code. 6947f296bb3SBarry Smith- **Effects of other users**: If other users are running jobs on the 6957f296bb3SBarry Smith same physical processor nodes on which a program is being profiled, 6967f296bb3SBarry Smith the timing results are essentially meaningless. 6977f296bb3SBarry Smith- **Overhead of timing routines on certain machines**: On certain 6987f296bb3SBarry Smith machines, even calling the system clock in order to time routines is 6997f296bb3SBarry Smith slow; this skews all of the flop rates and timing results. The file 7007f296bb3SBarry Smith `$PETSC_DIR/src/benchmarks/PetscTime.c` (<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/benchmarks/PetscTime.c.html">source</a>) 7017f296bb3SBarry Smith contains a simple test problem that will approximate the amount of 7027f296bb3SBarry Smith time required to get the current time in a running program. On good 7037f296bb3SBarry Smith systems it will on the order of $10^{-6}$ seconds or less. 7047f296bb3SBarry Smith- **Problem too large for good cache performance**: Certain machines 7057f296bb3SBarry Smith with lower memory bandwidths (slow memory access) attempt to 7067f296bb3SBarry Smith compensate by having a very large cache. Thus, if a significant 7077f296bb3SBarry Smith portion of an application fits within the cache, the program will 7087f296bb3SBarry Smith achieve very good performance; if the code is too large, the 7097f296bb3SBarry Smith performance can degrade markedly. To analyze whether this situation 7107f296bb3SBarry Smith affects a particular code, one can try plotting the total flop rate 7117f296bb3SBarry Smith as a function of problem size. If the flop rate decreases rapidly at 7127f296bb3SBarry Smith some point, then the problem may likely be too large for the cache 7137f296bb3SBarry Smith size. 7147f296bb3SBarry Smith- **Inconsistent timings**: Inconsistent timings are likely due to 7157f296bb3SBarry Smith other users on the machine, thrashing (using more virtual memory than 7167f296bb3SBarry Smith available physical memory), or paging in of the initial executable. 7177f296bb3SBarry Smith {any}`sec_profaccuracy` provides information on 7187f296bb3SBarry Smith overcoming paging overhead when profiling a code. We have found on 7197f296bb3SBarry Smith all systems that if you follow all the advise above your timings will 7207f296bb3SBarry Smith be consistent within a variation of less than five percent. 721