1*7f296bb3SBarry Smith(ch_performance)= 2*7f296bb3SBarry Smith 3*7f296bb3SBarry Smith# Hints for Performance Tuning 4*7f296bb3SBarry Smith 5*7f296bb3SBarry SmithThis chapter provides hints on how to get to achieve best performance 6*7f296bb3SBarry Smithwith PETSc, particularly on distributed-memory machines with multiple 7*7f296bb3SBarry SmithCPU sockets per node. We focus on machine-related performance 8*7f296bb3SBarry Smithoptimization here; algorithmic aspects like preconditioner selection are 9*7f296bb3SBarry Smithnot the focus of this section. 10*7f296bb3SBarry Smith 11*7f296bb3SBarry Smith## Maximizing Memory Bandwidth 12*7f296bb3SBarry Smith 13*7f296bb3SBarry SmithMost operations in PETSc deal with large datasets (typically vectors and 14*7f296bb3SBarry Smithsparse matrices) and perform relatively few arithmetic operations for 15*7f296bb3SBarry Smitheach byte loaded or stored from global memory. Therefore, the 16*7f296bb3SBarry Smith*arithmetic intensity* expressed as the ratio of floating point 17*7f296bb3SBarry Smithoperations to the number of bytes loaded and stored is usually well 18*7f296bb3SBarry Smithbelow unity for typical PETSc operations. On the other hand, modern CPUs 19*7f296bb3SBarry Smithare able to execute on the order of 10 floating point operations for 20*7f296bb3SBarry Smitheach byte loaded or stored. As a consequence, almost all PETSc 21*7f296bb3SBarry Smithoperations are limited by the rate at which data can be loaded or stored 22*7f296bb3SBarry Smith(*memory bandwidth limited*) rather than by the rate of floating point 23*7f296bb3SBarry Smithoperations. 24*7f296bb3SBarry Smith 25*7f296bb3SBarry SmithThis section discusses ways to maximize the memory bandwidth achieved by 26*7f296bb3SBarry Smithapplications based on PETSc. Where appropriate, we include benchmark 27*7f296bb3SBarry Smithresults in order to provide quantitative results on typical performance 28*7f296bb3SBarry Smithgains one can achieve through parallelization, both on a single compute 29*7f296bb3SBarry Smithnode and across nodes. In particular, we start with the answer to the 30*7f296bb3SBarry Smithcommon question of why performance generally does not increase 20-fold 31*7f296bb3SBarry Smithwith a 20-core CPU. 32*7f296bb3SBarry Smith 33*7f296bb3SBarry Smith(subsec_bandwidth_vs_processes)= 34*7f296bb3SBarry Smith 35*7f296bb3SBarry Smith### Memory Bandwidth vs. Processes 36*7f296bb3SBarry Smith 37*7f296bb3SBarry SmithConsider the addition of two large vectors, with the result written to a 38*7f296bb3SBarry Smiththird vector. Because there are no dependencies across the different 39*7f296bb3SBarry Smithentries of each vector, the operation is embarrassingly parallel. 40*7f296bb3SBarry Smith 41*7f296bb3SBarry Smith:::{figure} /images/manual/stream-results-intel.* 42*7f296bb3SBarry Smith:alt: Memory bandwidth obtained on Intel hardware (dual socket except KNL) over the 43*7f296bb3SBarry Smith: number of processes used. One can get close to peak memory bandwidth with only a 44*7f296bb3SBarry Smith: few processes. 45*7f296bb3SBarry Smith:name: fig_stream_intel 46*7f296bb3SBarry Smith:width: 80.0% 47*7f296bb3SBarry Smith 48*7f296bb3SBarry SmithMemory bandwidth obtained on Intel hardware (dual socket except KNL) 49*7f296bb3SBarry Smithover the number of processes used. One can get close to peak memory 50*7f296bb3SBarry Smithbandwidth with only a few processes. 51*7f296bb3SBarry Smith::: 52*7f296bb3SBarry Smith 53*7f296bb3SBarry SmithAs {numref}`fig_stream_intel` shows, the performance gains due to 54*7f296bb3SBarry Smithparallelization on different multi- and many-core CPUs quickly 55*7f296bb3SBarry Smithsaturates. The reason is that only a fraction of the total number of CPU 56*7f296bb3SBarry Smithcores is required to saturate the memory channels. For example, a 57*7f296bb3SBarry Smithdual-socket system equipped with Haswell 12-core Xeon CPUs achieves more 58*7f296bb3SBarry Smiththan 80 percent of achievable peak memory bandwidth with only four 59*7f296bb3SBarry Smithprocesses per socket (8 total), cf. {numref}`fig_stream_intel`. 60*7f296bb3SBarry SmithConsequently, running with more than 8 MPI ranks on such a system will 61*7f296bb3SBarry Smithnot increase performance substantially. For the same reason, PETSc-based 62*7f296bb3SBarry Smithapplications usually do not benefit from hyper-threading. 63*7f296bb3SBarry Smith 64*7f296bb3SBarry SmithPETSc provides a simple way to measure memory bandwidth for different 65*7f296bb3SBarry Smithnumbers of processes via the target `make streams` executed from 66*7f296bb3SBarry Smith`$PETSC_DIR`. The output provides an overview of the possible speedup 67*7f296bb3SBarry Smithone can obtain on the given machine (not necessarily a shared memory 68*7f296bb3SBarry Smithsystem). For example, the following is the most relevant output obtained 69*7f296bb3SBarry Smithon a dual-socket system equipped with two six-core-CPUs with 70*7f296bb3SBarry Smithhyperthreading: 71*7f296bb3SBarry Smith 72*7f296bb3SBarry Smith```none 73*7f296bb3SBarry Smithnp speedup 74*7f296bb3SBarry Smith1 1.0 75*7f296bb3SBarry Smith2 1.58 76*7f296bb3SBarry Smith3 2.19 77*7f296bb3SBarry Smith4 2.42 78*7f296bb3SBarry Smith5 2.63 79*7f296bb3SBarry Smith6 2.69 80*7f296bb3SBarry Smith... 81*7f296bb3SBarry Smith21 3.82 82*7f296bb3SBarry Smith22 3.49 83*7f296bb3SBarry Smith23 3.79 84*7f296bb3SBarry Smith24 3.71 85*7f296bb3SBarry SmithEstimation of possible speedup of MPI programs based on Streams benchmark. 86*7f296bb3SBarry SmithIt appears you have 1 node(s) 87*7f296bb3SBarry Smith``` 88*7f296bb3SBarry Smith 89*7f296bb3SBarry SmithOn this machine, one should expect a speed-up of typical memory 90*7f296bb3SBarry Smithbandwidth-bound PETSc applications of at most 4x when running multiple 91*7f296bb3SBarry SmithMPI ranks on the node. Most of the gains are already obtained when 92*7f296bb3SBarry Smithrunning with only 4-6 ranks. Because a smaller number of MPI ranks 93*7f296bb3SBarry Smithusually implies better preconditioners and better performance for 94*7f296bb3SBarry Smithsmaller problems, the best performance for PETSc applications may be 95*7f296bb3SBarry Smithobtained with fewer ranks than there are physical CPU cores available. 96*7f296bb3SBarry Smith 97*7f296bb3SBarry SmithFollowing the results from the above run of `make streams`, we 98*7f296bb3SBarry Smithrecommend to use additional nodes instead of placing additional MPI 99*7f296bb3SBarry Smithranks on the nodes. In particular, weak scaling (i.e. constant load per 100*7f296bb3SBarry Smithprocess, increasing the number of processes) and strong scaling 101*7f296bb3SBarry Smith(i.e. constant total work, increasing the number of processes) studies 102*7f296bb3SBarry Smithshould keep the number of processes per node constant. 103*7f296bb3SBarry Smith 104*7f296bb3SBarry Smith### Non-Uniform Memory Access (NUMA) and Process Placement 105*7f296bb3SBarry Smith 106*7f296bb3SBarry SmithCPUs in nodes with more than one CPU socket are internally connected via 107*7f296bb3SBarry Smitha high-speed fabric, cf. {numref}`fig_numa`, to enable data 108*7f296bb3SBarry Smithexchange as well as cache coherency. Because main memory on modern 109*7f296bb3SBarry Smithsystems is connected via the integrated memory controllers on each CPU, 110*7f296bb3SBarry Smithmemory is accessed in a non-uniform way: A process running on one socket 111*7f296bb3SBarry Smithhas direct access to the memory channels of the respective CPU, whereas 112*7f296bb3SBarry Smithrequests for memory attached to a different CPU socket need to go 113*7f296bb3SBarry Smiththrough the high-speed fabric. Consequently, best aggregate memory 114*7f296bb3SBarry Smithbandwidth on the node is obtained when the memory controllers on each 115*7f296bb3SBarry SmithCPU are fully saturated. However, full saturation of memory channels is 116*7f296bb3SBarry Smithonly possible if the data is distributed across the different memory 117*7f296bb3SBarry Smithchannels. 118*7f296bb3SBarry Smith 119*7f296bb3SBarry Smith:::{figure} /images/manual/numa.* 120*7f296bb3SBarry Smith:alt: Schematic of a two-socket NUMA system. Processes should be spread across both 121*7f296bb3SBarry Smith: CPUs to obtain full bandwidth. 122*7f296bb3SBarry Smith:name: fig_numa 123*7f296bb3SBarry Smith:width: 90.0% 124*7f296bb3SBarry Smith 125*7f296bb3SBarry SmithSchematic of a two-socket NUMA system. Processes should be spread 126*7f296bb3SBarry Smithacross both CPUs to obtain full bandwidth. 127*7f296bb3SBarry Smith::: 128*7f296bb3SBarry Smith 129*7f296bb3SBarry SmithData in memory on modern machines is allocated by the operating system 130*7f296bb3SBarry Smithbased on a first-touch policy. That is, memory is not allocated at the 131*7f296bb3SBarry Smithpoint of issuing `malloc()`, but at the point when the respective 132*7f296bb3SBarry Smithmemory segment is actually touched (read or write). Upon first-touch, 133*7f296bb3SBarry Smithmemory is allocated on the memory channel associated with the respective 134*7f296bb3SBarry SmithCPU the process is running on. Only if all memory on the respective CPU 135*7f296bb3SBarry Smithis already in use (either allocated or as IO cache), memory available 136*7f296bb3SBarry Smiththrough other sockets is considered. 137*7f296bb3SBarry Smith 138*7f296bb3SBarry SmithMaximum memory bandwidth can be achieved by ensuring that processes are 139*7f296bb3SBarry Smithspread over all sockets in the respective node. For example, the 140*7f296bb3SBarry Smithrecommended placement of a 8-way parallel run on a four-socket machine 141*7f296bb3SBarry Smithis to assign two processes to each CPU socket. To do so, one needs to 142*7f296bb3SBarry Smithknow the enumeration of cores and pass the requested information to 143*7f296bb3SBarry Smith`mpirun`. Consider the hardware topology information returned by 144*7f296bb3SBarry Smith`lstopo` (part of the hwloc package) for the following two-socket 145*7f296bb3SBarry Smithmachine, in which each CPU consists of six cores and supports 146*7f296bb3SBarry Smithhyperthreading: 147*7f296bb3SBarry Smith 148*7f296bb3SBarry Smith```none 149*7f296bb3SBarry SmithMachine (126GB total) 150*7f296bb3SBarry Smith NUMANode L#0 (P#0 63GB) 151*7f296bb3SBarry Smith Package L#0 + L3 L#0 (15MB) 152*7f296bb3SBarry Smith L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 153*7f296bb3SBarry Smith PU L#0 (P#0) 154*7f296bb3SBarry Smith PU L#1 (P#12) 155*7f296bb3SBarry Smith L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 156*7f296bb3SBarry Smith PU L#2 (P#1) 157*7f296bb3SBarry Smith PU L#3 (P#13) 158*7f296bb3SBarry Smith L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 159*7f296bb3SBarry Smith PU L#4 (P#2) 160*7f296bb3SBarry Smith PU L#5 (P#14) 161*7f296bb3SBarry Smith L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 162*7f296bb3SBarry Smith PU L#6 (P#3) 163*7f296bb3SBarry Smith PU L#7 (P#15) 164*7f296bb3SBarry Smith L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 165*7f296bb3SBarry Smith PU L#8 (P#4) 166*7f296bb3SBarry Smith PU L#9 (P#16) 167*7f296bb3SBarry Smith L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 168*7f296bb3SBarry Smith PU L#10 (P#5) 169*7f296bb3SBarry Smith PU L#11 (P#17) 170*7f296bb3SBarry Smith NUMANode L#1 (P#1 63GB) 171*7f296bb3SBarry Smith Package L#1 + L3 L#1 (15MB) 172*7f296bb3SBarry Smith L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 173*7f296bb3SBarry Smith PU L#12 (P#6) 174*7f296bb3SBarry Smith PU L#13 (P#18) 175*7f296bb3SBarry Smith L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 176*7f296bb3SBarry Smith PU L#14 (P#7) 177*7f296bb3SBarry Smith PU L#15 (P#19) 178*7f296bb3SBarry Smith L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 179*7f296bb3SBarry Smith PU L#16 (P#8) 180*7f296bb3SBarry Smith PU L#17 (P#20) 181*7f296bb3SBarry Smith L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 182*7f296bb3SBarry Smith PU L#18 (P#9) 183*7f296bb3SBarry Smith PU L#19 (P#21) 184*7f296bb3SBarry Smith L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 185*7f296bb3SBarry Smith PU L#20 (P#10) 186*7f296bb3SBarry Smith PU L#21 (P#22) 187*7f296bb3SBarry Smith L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 188*7f296bb3SBarry Smith PU L#22 (P#11) 189*7f296bb3SBarry Smith PU L#23 (P#23) 190*7f296bb3SBarry Smith``` 191*7f296bb3SBarry Smith 192*7f296bb3SBarry SmithThe relevant physical processor IDs are shown in parentheses prefixed by 193*7f296bb3SBarry Smith`P#`. Here, IDs 0 and 12 share the same physical core and have a 194*7f296bb3SBarry Smithcommon L2 cache. IDs 0, 12, 1, 13, 2, 14, 3, 15, 4, 16, 5, 17 share the 195*7f296bb3SBarry Smithsame socket and have a common L3 cache. 196*7f296bb3SBarry Smith 197*7f296bb3SBarry SmithA good placement for a run with six processes is to locate three 198*7f296bb3SBarry Smithprocesses on the first socket and three processes on the second socket. 199*7f296bb3SBarry SmithUnfortunately, mechanisms for process placement vary across MPI 200*7f296bb3SBarry Smithimplementations, so make sure to consult the manual of your MPI 201*7f296bb3SBarry Smithimplementation. The following discussion is based on how processor 202*7f296bb3SBarry Smithplacement is done with MPICH and Open MPI, where one needs to pass 203*7f296bb3SBarry Smith`--bind-to core --map-by socket` to `mpirun`: 204*7f296bb3SBarry Smith 205*7f296bb3SBarry Smith```console 206*7f296bb3SBarry Smith$ mpirun -n 6 --bind-to core --map-by socket ./stream 207*7f296bb3SBarry Smithprocess 0 binding: 100000000000100000000000 208*7f296bb3SBarry Smithprocess 1 binding: 000000100000000000100000 209*7f296bb3SBarry Smithprocess 2 binding: 010000000000010000000000 210*7f296bb3SBarry Smithprocess 3 binding: 000000010000000000010000 211*7f296bb3SBarry Smithprocess 4 binding: 001000000000001000000000 212*7f296bb3SBarry Smithprocess 5 binding: 000000001000000000001000 213*7f296bb3SBarry SmithTriad: 45403.1949 Rate (MB/s) 214*7f296bb3SBarry Smith``` 215*7f296bb3SBarry Smith 216*7f296bb3SBarry SmithIn this configuration, process 0 is bound to the first physical core on 217*7f296bb3SBarry Smiththe first socket (with IDs 0 and 12), process 1 is bound to the first 218*7f296bb3SBarry Smithcore on the second socket (IDs 6 and 18), and similarly for the 219*7f296bb3SBarry Smithremaining processes. The achieved bandwidth of 45 GB/sec is close to the 220*7f296bb3SBarry Smithpractical peak of about 50 GB/sec available on the machine. If, however, 221*7f296bb3SBarry Smithall MPI processes are located on the same socket, memory bandwidth drops 222*7f296bb3SBarry Smithsignificantly: 223*7f296bb3SBarry Smith 224*7f296bb3SBarry Smith```console 225*7f296bb3SBarry Smith$ mpirun -n 6 --bind-to core --map-by core ./stream 226*7f296bb3SBarry Smithprocess 0 binding: 100000000000100000000000 227*7f296bb3SBarry Smithprocess 1 binding: 010000000000010000000000 228*7f296bb3SBarry Smithprocess 2 binding: 001000000000001000000000 229*7f296bb3SBarry Smithprocess 3 binding: 000100000000000100000000 230*7f296bb3SBarry Smithprocess 4 binding: 000010000000000010000000 231*7f296bb3SBarry Smithprocess 5 binding: 000001000000000001000000 232*7f296bb3SBarry SmithTriad: 25510.7507 Rate (MB/s) 233*7f296bb3SBarry Smith``` 234*7f296bb3SBarry Smith 235*7f296bb3SBarry SmithAll processes are now mapped to cores on the same socket. As a result, 236*7f296bb3SBarry Smithonly the first memory channel is fully saturated at 25.5 GB/sec. 237*7f296bb3SBarry Smith 238*7f296bb3SBarry SmithOne must not assume that `mpirun` uses good defaults. To 239*7f296bb3SBarry Smithdemonstrate, compare the full output of `make streams` from {any}`subsec_bandwidth_vs_processes` first, followed by 240*7f296bb3SBarry Smiththe results obtained by passing `--bind-to core --map-by socket`: 241*7f296bb3SBarry Smith 242*7f296bb3SBarry Smith```console 243*7f296bb3SBarry Smith$ make streams 244*7f296bb3SBarry Smithnp speedup 245*7f296bb3SBarry Smith1 1.0 246*7f296bb3SBarry Smith2 1.58 247*7f296bb3SBarry Smith3 2.19 248*7f296bb3SBarry Smith4 2.42 249*7f296bb3SBarry Smith5 2.63 250*7f296bb3SBarry Smith6 2.69 251*7f296bb3SBarry Smith7 2.31 252*7f296bb3SBarry Smith8 2.42 253*7f296bb3SBarry Smith9 2.37 254*7f296bb3SBarry Smith10 2.65 255*7f296bb3SBarry Smith11 2.3 256*7f296bb3SBarry Smith12 2.53 257*7f296bb3SBarry Smith13 2.43 258*7f296bb3SBarry Smith14 2.63 259*7f296bb3SBarry Smith15 2.74 260*7f296bb3SBarry Smith16 2.7 261*7f296bb3SBarry Smith17 3.28 262*7f296bb3SBarry Smith18 3.66 263*7f296bb3SBarry Smith19 3.95 264*7f296bb3SBarry Smith20 3.07 265*7f296bb3SBarry Smith21 3.82 266*7f296bb3SBarry Smith22 3.49 267*7f296bb3SBarry Smith23 3.79 268*7f296bb3SBarry Smith24 3.71 269*7f296bb3SBarry Smith``` 270*7f296bb3SBarry Smith 271*7f296bb3SBarry Smith```console 272*7f296bb3SBarry Smith$ make streams MPI_BINDING="--bind-to core --map-by socket" 273*7f296bb3SBarry Smithnp speedup 274*7f296bb3SBarry Smith1 1.0 275*7f296bb3SBarry Smith2 1.59 276*7f296bb3SBarry Smith3 2.66 277*7f296bb3SBarry Smith4 3.5 278*7f296bb3SBarry Smith5 3.56 279*7f296bb3SBarry Smith6 4.23 280*7f296bb3SBarry Smith7 3.95 281*7f296bb3SBarry Smith8 4.39 282*7f296bb3SBarry Smith9 4.09 283*7f296bb3SBarry Smith10 4.46 284*7f296bb3SBarry Smith11 4.15 285*7f296bb3SBarry Smith12 4.42 286*7f296bb3SBarry Smith13 3.71 287*7f296bb3SBarry Smith14 3.83 288*7f296bb3SBarry Smith15 4.08 289*7f296bb3SBarry Smith16 4.22 290*7f296bb3SBarry Smith17 4.18 291*7f296bb3SBarry Smith18 4.31 292*7f296bb3SBarry Smith19 4.22 293*7f296bb3SBarry Smith20 4.28 294*7f296bb3SBarry Smith21 4.25 295*7f296bb3SBarry Smith22 4.23 296*7f296bb3SBarry Smith23 4.28 297*7f296bb3SBarry Smith24 4.22 298*7f296bb3SBarry Smith``` 299*7f296bb3SBarry Smith 300*7f296bb3SBarry SmithFor the non-optimized version on the left, the speedup obtained when 301*7f296bb3SBarry Smithusing any number of processes between 3 and 13 is essentially constant 302*7f296bb3SBarry Smithup to fluctuations, indicating that all processes were by default 303*7f296bb3SBarry Smithexecuted on the same socket. Only with 14 or more processes, the 304*7f296bb3SBarry Smithspeedup number increases again. In contrast, the results of 305*7f296bb3SBarry Smith 306*7f296bb3SBarry Smith 307*7f296bb3SBarry Smith`make streams` 308*7f296bb3SBarry Smith 309*7f296bb3SBarry Smith with proper processor placement shown second 310*7f296bb3SBarry Smithresulted in slightly higher overall parallel speedup (identical 311*7f296bb3SBarry Smithbaselines), in smaller performance fluctuations, and more than 90 312*7f296bb3SBarry Smithpercent of peak bandwidth with only six processes. 313*7f296bb3SBarry Smith 314*7f296bb3SBarry SmithMachines with job submission systems such as SLURM usually provide 315*7f296bb3SBarry Smithsimilar mechanisms for processor placements through options specified in 316*7f296bb3SBarry Smithjob submission scripts. Please consult the respective manuals. 317*7f296bb3SBarry Smith 318*7f296bb3SBarry Smith#### Additional Process Placement Considerations and Details 319*7f296bb3SBarry Smith 320*7f296bb3SBarry SmithFor a typical, memory bandwidth-limited PETSc application, the primary 321*7f296bb3SBarry Smithconsideration in placing MPI processes is ensuring that processes are 322*7f296bb3SBarry Smithevenly distributed among sockets, and hence using all available memory 323*7f296bb3SBarry Smithchannels. Increasingly complex processor designs and cache hierarchies, 324*7f296bb3SBarry Smithhowever, mean that performance may also be sensitive to how processes 325*7f296bb3SBarry Smithare bound to the resources within each socket. Performance on the two 326*7f296bb3SBarry Smithprocessor machine in the preceding example may be relatively insensitive 327*7f296bb3SBarry Smithto such placement decisions, because one L3 cache is shared by all cores 328*7f296bb3SBarry Smithwithin a NUMA domain, and each core has its own L2 and L1 caches. 329*7f296bb3SBarry SmithHowever, processors that are less “flat”, with more complex hierarchies, 330*7f296bb3SBarry Smithmay be more sensitive. In many AMD Opterons or the second-generation 331*7f296bb3SBarry Smith“Knights Landing” Intel Xeon Phi, for instance, L2 caches are shared 332*7f296bb3SBarry Smithbetween two cores. On these processors, placing consecutive MPI ranks on 333*7f296bb3SBarry Smithcores that share the same L2 cache may benefit performance if the two 334*7f296bb3SBarry Smithranks communicate frequently with each other, because the latency 335*7f296bb3SBarry Smithbetween cores sharing an L2 cache may be roughly half that of two cores 336*7f296bb3SBarry Smithnot sharing one. There may be benefit, however, in placing consecutive 337*7f296bb3SBarry Smithranks on cores that do not share an L2 cache, because (if there are 338*7f296bb3SBarry Smithfewer MPI ranks than cores) this increases the total L2 cache capacity 339*7f296bb3SBarry Smithand bandwidth available to the application. There is a trade-off to be 340*7f296bb3SBarry Smithconsidered between placing processes close together (in terms of shared 341*7f296bb3SBarry Smithresources) to optimize for efficient communication and synchronization 342*7f296bb3SBarry Smithvs. farther apart to maximize available resources (memory channels, 343*7f296bb3SBarry Smithcaches, I/O channels, etc.), and the best strategy will depend on the 344*7f296bb3SBarry Smithapplication and the software and hardware stack. 345*7f296bb3SBarry Smith 346*7f296bb3SBarry SmithDifferent process placement strategies can affect performance at least 347*7f296bb3SBarry Smithas much as some commonly explored settings, such as compiler 348*7f296bb3SBarry Smithoptimization levels. Unfortunately, exploration of this space is 349*7f296bb3SBarry Smithcomplicated by two factors: First, processor and core numberings may be 350*7f296bb3SBarry Smithcompletely arbitrary, changing with BIOS version, etc., and second—as 351*7f296bb3SBarry Smithalready noted—there is no standard mechanism used by MPI implementations 352*7f296bb3SBarry Smith(or job schedulers) to specify process affinity. To overcome the first 353*7f296bb3SBarry Smithissue, we recommend using the `lstopo` utility of the Portable 354*7f296bb3SBarry SmithHardware Locality (`hwloc`) software package (which can be installed 355*7f296bb3SBarry Smithby configuring PETSc with `–download-hwloc`) to understand the 356*7f296bb3SBarry Smithprocessor topology of your machine. We cannot fully address the second 357*7f296bb3SBarry Smithissue—consult the documentation for your MPI implementation and/or job 358*7f296bb3SBarry Smithscheduler—but we offer some general observations on understanding 359*7f296bb3SBarry Smithplacement options: 360*7f296bb3SBarry Smith 361*7f296bb3SBarry Smith- An MPI implementation may support a notion of *domains* in which a 362*7f296bb3SBarry Smith process may be pinned. A domain may simply correspond to a single 363*7f296bb3SBarry Smith core; however, the MPI implementation may allow a deal of flexibility 364*7f296bb3SBarry Smith in specifying domains that encompass multiple cores, span sockets, 365*7f296bb3SBarry Smith etc. Some implementations, such as Intel MPI, provide means to 366*7f296bb3SBarry Smith specify whether domains should be “compact”—composed of cores sharing 367*7f296bb3SBarry Smith resources such as caches—or “scatter”-ed, with little resource 368*7f296bb3SBarry Smith sharing (possibly even spanning sockets). 369*7f296bb3SBarry Smith- Separate from the specification of domains, MPI implementations often 370*7f296bb3SBarry Smith support different *orderings* in which MPI ranks should be bound to 371*7f296bb3SBarry Smith these domains. Intel MPI, for instance, supports “compact” ordering 372*7f296bb3SBarry Smith to place consecutive ranks close in terms of shared resources, 373*7f296bb3SBarry Smith “scatter” to place them far apart, and “bunch” to map proportionally 374*7f296bb3SBarry Smith to sockets while placing ranks as close together as possible within 375*7f296bb3SBarry Smith the sockets. 376*7f296bb3SBarry Smith- An MPI implementation that supports process pinning should offer some 377*7f296bb3SBarry Smith way to view the rank assignments. Use this output in conjunction with 378*7f296bb3SBarry Smith the topology obtained via `lstopo` or a similar tool to determine 379*7f296bb3SBarry Smith if the placements correspond to something you believe is reasonable 380*7f296bb3SBarry Smith for your application. Do not assume that the MPI implementation is 381*7f296bb3SBarry Smith doing something sensible by default! 382*7f296bb3SBarry Smith 383*7f296bb3SBarry Smith## Performance Pitfalls and Advice 384*7f296bb3SBarry Smith 385*7f296bb3SBarry SmithThis section looks into a potpourri of performance pitfalls encountered 386*7f296bb3SBarry Smithby users in the past. Many of these pitfalls require a deeper 387*7f296bb3SBarry Smithunderstanding of the system and experience to detect. The purpose of 388*7f296bb3SBarry Smiththis section is to summarize and share our experience so that these 389*7f296bb3SBarry Smithpitfalls can be avoided in the future. 390*7f296bb3SBarry Smith 391*7f296bb3SBarry Smith### Debug vs. Optimized Builds 392*7f296bb3SBarry Smith 393*7f296bb3SBarry SmithPETSc’s `configure` defaults to building PETSc with debug mode 394*7f296bb3SBarry Smithenabled. Any code development should be done in this mode, because it 395*7f296bb3SBarry Smithprovides handy debugging facilities such as accurate stack traces, 396*7f296bb3SBarry Smithmemory leak checks, and memory corruption checks. Note that PETSc has no 397*7f296bb3SBarry Smithreliable way of knowing whether a particular run is a production or 398*7f296bb3SBarry Smithdebug run. In the case that a user requests profiling information via 399*7f296bb3SBarry Smith`-log_view`, a debug build of PETSc issues the following warning: 400*7f296bb3SBarry Smith 401*7f296bb3SBarry Smith```none 402*7f296bb3SBarry Smith########################################################## 403*7f296bb3SBarry Smith# # 404*7f296bb3SBarry Smith# WARNING!!! # 405*7f296bb3SBarry Smith# # 406*7f296bb3SBarry Smith# This code was compiled with a debugging option, # 407*7f296bb3SBarry Smith# To get timing results run configure # 408*7f296bb3SBarry Smith# using --with-debugging=no, the performance will # 409*7f296bb3SBarry Smith# be generally two or three times faster. # 410*7f296bb3SBarry Smith# # 411*7f296bb3SBarry Smith########################################################## 412*7f296bb3SBarry Smith``` 413*7f296bb3SBarry Smith 414*7f296bb3SBarry SmithConversely, one way of checking whether a particular build of PETSc has 415*7f296bb3SBarry Smithdebugging enabled is to inspect the output of `-log_view`. 416*7f296bb3SBarry Smith 417*7f296bb3SBarry SmithDebug mode will generally be most useful for code development if 418*7f296bb3SBarry Smithappropriate compiler options are set to facilitate debugging. The 419*7f296bb3SBarry Smithcompiler should be instructed to generate binaries with debug symbols 420*7f296bb3SBarry Smith(command line option `-g` for most compilers), and the optimization 421*7f296bb3SBarry Smithlevel chosen should either completely disable optimizations (`-O0` for 422*7f296bb3SBarry Smithmost compilers) or enable only optimizations that do not interfere with 423*7f296bb3SBarry Smithdebugging (GCC, for instance, supports a `-Og` optimization level that 424*7f296bb3SBarry Smithdoes this). 425*7f296bb3SBarry Smith 426*7f296bb3SBarry SmithOnly once the new code is thoroughly tested and ready for production, 427*7f296bb3SBarry Smithone should disable debugging facilities by passing 428*7f296bb3SBarry Smith`--with-debugging=no` to 429*7f296bb3SBarry Smith 430*7f296bb3SBarry Smith`configure`. One should also ensure that an appropriate compiler 431*7f296bb3SBarry Smithoptimization level is set. Note that some compilers (e.g., Intel) 432*7f296bb3SBarry Smithdefault to fairly comprehensive optimization levels, while others (e.g., 433*7f296bb3SBarry SmithGCC) default to no optimization at all. The best optimization flags will 434*7f296bb3SBarry Smithdepend on your code, the compiler, and the target architecture, but we 435*7f296bb3SBarry Smithoffer a few guidelines for finding those that will offer the best 436*7f296bb3SBarry Smithperformance: 437*7f296bb3SBarry Smith 438*7f296bb3SBarry Smith- Most compilers have a number of optimization levels (with level n 439*7f296bb3SBarry Smith usually specified via `-On`) that provide a quick way to enable 440*7f296bb3SBarry Smith sets of several optimization flags. We suggest trying the higher 441*7f296bb3SBarry Smith optimization levels (the highest level is not guaranteed to produce 442*7f296bb3SBarry Smith the fastest executable, so some experimentation may be merited). With 443*7f296bb3SBarry Smith most recent processors now supporting some form of SIMD or vector 444*7f296bb3SBarry Smith instructions, it is important to choose a level that enables the 445*7f296bb3SBarry Smith compiler’s auto-vectorizer; many compilers do not enable 446*7f296bb3SBarry Smith auto-vectorization at lower optimization levels (e.g., GCC does not 447*7f296bb3SBarry Smith enable it below `-O3` and the Intel compiler does not enable it 448*7f296bb3SBarry Smith below `-O2`). 449*7f296bb3SBarry Smith- For processors supporting newer vector instruction sets, such as 450*7f296bb3SBarry Smith Intel AVX2 and AVX-512, it is also important to direct the compiler 451*7f296bb3SBarry Smith to generate code that targets these processors (e.g., `-march=native`); 452*7f296bb3SBarry Smith otherwise, the executables built will not 453*7f296bb3SBarry Smith utilize the newer instructions sets and will not take advantage of 454*7f296bb3SBarry Smith the vector processing units. 455*7f296bb3SBarry Smith- Beyond choosing the optimization levels, some value-unsafe 456*7f296bb3SBarry Smith optimizations (such as using reciprocals of values instead of 457*7f296bb3SBarry Smith dividing by those values, or allowing re-association of operands in a 458*7f296bb3SBarry Smith series of calculations) for floating point calculations may yield 459*7f296bb3SBarry Smith significant performance gains. Compilers often provide flags (e.g., 460*7f296bb3SBarry Smith `-ffast-math` in GCC) to enable a set of these optimizations, and 461*7f296bb3SBarry Smith they may be turned on when using options for very aggressive 462*7f296bb3SBarry Smith optimization (`-fast` or `-Ofast` in many compilers). These are 463*7f296bb3SBarry Smith worth exploring to maximize performance, but, if employed, it 464*7f296bb3SBarry Smith important to verify that these do not cause erroneous results with 465*7f296bb3SBarry Smith your code, since calculations may violate the IEEE standard for 466*7f296bb3SBarry Smith floating-point arithmetic. 467*7f296bb3SBarry Smith 468*7f296bb3SBarry Smith### Profiling 469*7f296bb3SBarry Smith 470*7f296bb3SBarry SmithUsers should not spend time optimizing a code until after having 471*7f296bb3SBarry Smithdetermined where it spends the bulk of its time on realistically sized 472*7f296bb3SBarry Smithproblems. As discussed in detail in {any}`ch_profiling`, the 473*7f296bb3SBarry SmithPETSc routines automatically log performance data if certain runtime 474*7f296bb3SBarry Smithoptions are specified. 475*7f296bb3SBarry Smith 476*7f296bb3SBarry SmithTo obtain a summary of where and how much time is spent in different 477*7f296bb3SBarry Smithsections of the code, use one of the following options: 478*7f296bb3SBarry Smith 479*7f296bb3SBarry Smith- Run the code with the option `-log_view` to print a performance 480*7f296bb3SBarry Smith summary for various phases of the code. 481*7f296bb3SBarry Smith- Run the code with the option `-log_mpe` `[logfilename]`, which 482*7f296bb3SBarry Smith creates a logfile of events suitable for viewing with Jumpshot (part 483*7f296bb3SBarry Smith of MPICH). 484*7f296bb3SBarry Smith 485*7f296bb3SBarry SmithThen, focus on the sections where most of the time is spent. If you 486*7f296bb3SBarry Smithprovided your own callback routines, e.g. for residual evaluations, 487*7f296bb3SBarry Smithsearch the profiling output for routines such as `SNESFunctionEval` or 488*7f296bb3SBarry Smith`SNESJacobianEval`. If their relative time is significant (say, more 489*7f296bb3SBarry Smiththan 30 percent), consider optimizing these routines first. Generic 490*7f296bb3SBarry Smithinstructions on how to optimize your callback functions are difficult; 491*7f296bb3SBarry Smithyou may start by reading performance optimization guides for your 492*7f296bb3SBarry Smithsystem’s hardware. 493*7f296bb3SBarry Smith 494*7f296bb3SBarry Smith### Aggregation 495*7f296bb3SBarry Smith 496*7f296bb3SBarry SmithPerforming operations on chunks of data rather than a single element at 497*7f296bb3SBarry Smitha time can significantly enhance performance because of cache reuse or 498*7f296bb3SBarry Smithlower data motion. Typical examples are: 499*7f296bb3SBarry Smith 500*7f296bb3SBarry Smith- Insert several (many) elements of a matrix or vector at once, rather 501*7f296bb3SBarry Smith than looping and inserting a single value at a time. In order to 502*7f296bb3SBarry Smith access elements in of vector repeatedly, employ `VecGetArray()` to 503*7f296bb3SBarry Smith allow direct manipulation of the vector elements. 504*7f296bb3SBarry Smith- When possible, use `VecMDot()` rather than a series of calls to 505*7f296bb3SBarry Smith `VecDot()`. 506*7f296bb3SBarry Smith- If you require a sequence of matrix-vector products with the same 507*7f296bb3SBarry Smith matrix, consider packing your vectors into a single matrix and use 508*7f296bb3SBarry Smith matrix-matrix multiplications. 509*7f296bb3SBarry Smith- Users should employ a reasonable number of `PetscMalloc()` calls in 510*7f296bb3SBarry Smith their codes. Hundreds or thousands of memory allocations may be 511*7f296bb3SBarry Smith appropriate; however, if tens of thousands are being used, then 512*7f296bb3SBarry Smith reducing the number of `PetscMalloc()` calls may be warranted. For 513*7f296bb3SBarry Smith example, reusing space or allocating large chunks and dividing it 514*7f296bb3SBarry Smith into pieces can produce a significant savings in allocation overhead. 515*7f296bb3SBarry Smith {any}`sec_dsreuse` gives details. 516*7f296bb3SBarry Smith 517*7f296bb3SBarry SmithAggressive aggregation of data may result in inflexible datastructures 518*7f296bb3SBarry Smithand code that is hard to maintain. We advise users to keep these 519*7f296bb3SBarry Smithcompeting goals in mind and not blindly optimize for performance only. 520*7f296bb3SBarry Smith 521*7f296bb3SBarry Smith(sec_symbolfactor)= 522*7f296bb3SBarry Smith 523*7f296bb3SBarry Smith### Memory Allocation for Sparse Matrix Factorization 524*7f296bb3SBarry Smith 525*7f296bb3SBarry SmithWhen symbolically factoring an AIJ matrix, PETSc has to guess how much 526*7f296bb3SBarry Smithfill there will be. Careful use of the fill parameter in the 527*7f296bb3SBarry Smith`MatFactorInfo` structure when calling `MatLUFactorSymbolic()` or 528*7f296bb3SBarry Smith`MatILUFactorSymbolic()` can reduce greatly the number of mallocs and 529*7f296bb3SBarry Smithcopies required, and thus greatly improve the performance of the 530*7f296bb3SBarry Smithfactorization. One way to determine a good value for the fill parameter 531*7f296bb3SBarry Smithis to run a program with the option `-info`. The symbolic 532*7f296bb3SBarry Smithfactorization phase will then print information such as 533*7f296bb3SBarry Smith 534*7f296bb3SBarry Smith```none 535*7f296bb3SBarry SmithInfo:MatILUFactorSymbolic_SeqAIJ:Reallocs 12 Fill ratio:given 1 needed 2.16423 536*7f296bb3SBarry Smith``` 537*7f296bb3SBarry Smith 538*7f296bb3SBarry SmithThis indicates that the user should have used a fill estimate factor of 539*7f296bb3SBarry Smithabout 2.17 (instead of 1) to prevent the 12 required mallocs and copies. 540*7f296bb3SBarry SmithThe command line option 541*7f296bb3SBarry Smith 542*7f296bb3SBarry Smith```none 543*7f296bb3SBarry Smith-pc_factor_fill 2.17 544*7f296bb3SBarry Smith``` 545*7f296bb3SBarry Smith 546*7f296bb3SBarry Smithwill cause PETSc to preallocate the correct amount of space for 547*7f296bb3SBarry Smiththe factorization. 548*7f296bb3SBarry Smith 549*7f296bb3SBarry Smith(detecting_memory_problems)= 550*7f296bb3SBarry Smith 551*7f296bb3SBarry Smith### Detecting Memory Allocation Problems and Memory Usage 552*7f296bb3SBarry Smith 553*7f296bb3SBarry SmithPETSc provides tools to aid in understanding PETSc memory usage and detecting problems with 554*7f296bb3SBarry Smithmemory allocation, including leaks and use of uninitialized space. Internally, PETSc uses 555*7f296bb3SBarry Smiththe routines `PetscMalloc()` and `PetscFree()` for memory allocation; instead of directly calling `malloc()` and `free()`. 556*7f296bb3SBarry SmithThis allows PETSc to track its memory usage and perform error checking. Users are urged to use these routines as well when 557*7f296bb3SBarry Smithappropriate. 558*7f296bb3SBarry Smith 559*7f296bb3SBarry Smith- The option `-malloc_debug` turns on PETSc's extensive runtime error checking of memory for corruption. 560*7f296bb3SBarry Smith This checking can be expensive, so should not be used for 561*7f296bb3SBarry Smith production runs. The option `-malloc_test` is equivalent to `-malloc_debug` 562*7f296bb3SBarry Smith but only works when PETSc is configured with `--with-debugging` (the default configuration). 563*7f296bb3SBarry Smith We suggest setting the environmental variable `PETSC_OPTIONS=-malloc_test` 564*7f296bb3SBarry Smith in your shell startup file to automatically enable runtime check memory for developing code but not 565*7f296bb3SBarry Smith running optimized code. Using `-malloc_debug` or `-malloc_test` for large runs can slow them significantly, thus we 566*7f296bb3SBarry Smith recommend turning them off if you code is painfully slow and you don't need the testing. In addition, you can use 567*7f296bb3SBarry Smith `-check_pointer_intensity 0` for long run debug runs that do not need extensive memory corruption testing. This option 568*7f296bb3SBarry Smith is occasionally added to the `PETSC_OPTIONS` environmental variable by some users. 569*7f296bb3SBarry Smith- The option 570*7f296bb3SBarry Smith `-malloc_dump` will print a list of memory locations that have not been freed at the 571*7f296bb3SBarry Smith conclusion of a program. If all memory has been freed no message 572*7f296bb3SBarry Smith is printed. Note that 573*7f296bb3SBarry Smith the option `-malloc_dump` activates a call to 574*7f296bb3SBarry Smith `PetscMallocDump()` during `PetscFinalize()`. The user can also 575*7f296bb3SBarry Smith call `PetscMallocDump()` elsewhere in a program. 576*7f296bb3SBarry Smith- Another useful option 577*7f296bb3SBarry Smith is `-malloc_view`, which reports memory usage in all routines at the conclusion of the program. 578*7f296bb3SBarry Smith Note that this option 579*7f296bb3SBarry Smith activates logging by calling `PetscMallocViewSet()` in 580*7f296bb3SBarry Smith `PetscInitialize()` and then prints the log by calling 581*7f296bb3SBarry Smith `PetscMallocView()` in `PetscFinalize()`. The user can also call 582*7f296bb3SBarry Smith these routines elsewhere in a program. 583*7f296bb3SBarry Smith- When finer granularity is 584*7f296bb3SBarry Smith desired, the user can call `PetscMallocGetCurrentUsage()` and 585*7f296bb3SBarry Smith `PetscMallocGetMaximumUsage()` for memory allocated by PETSc, or 586*7f296bb3SBarry Smith `PetscMemoryGetCurrentUsage()` and `PetscMemoryGetMaximumUsage()` 587*7f296bb3SBarry Smith for the total memory used by the program. Note that 588*7f296bb3SBarry Smith `PetscMemorySetGetMaximumUsage()` must be called before 589*7f296bb3SBarry Smith `PetscMemoryGetMaximumUsage()` (typically at the beginning of the 590*7f296bb3SBarry Smith program). 591*7f296bb3SBarry Smith- The option `-memory_view` provides a high-level view of all memory usage, 592*7f296bb3SBarry Smith not just the memory used by `PetscMalloc()`, at the conclusion of the program. 593*7f296bb3SBarry Smith- When running with `-log_view`, the additional option `-log_view_memory` 594*7f296bb3SBarry Smith causes the display of additional columns of information about how much 595*7f296bb3SBarry Smith memory was allocated and freed during each logged event. This is useful 596*7f296bb3SBarry Smith to understand what phases of a computation require the most memory. 597*7f296bb3SBarry Smith 598*7f296bb3SBarry SmithOne can also use [Valgrind](http://valgrind.org) to track memory usage and find bugs, see {any}`FAQ: Valgrind usage<valgrind>`. 599*7f296bb3SBarry Smith 600*7f296bb3SBarry Smith(sec_dsreuse)= 601*7f296bb3SBarry Smith 602*7f296bb3SBarry Smith### Data Structure Reuse 603*7f296bb3SBarry Smith 604*7f296bb3SBarry SmithData structures should be reused whenever possible. For example, if a 605*7f296bb3SBarry Smithcode often creates new matrices or vectors, there often may be a way to 606*7f296bb3SBarry Smithreuse some of them. Very significant performance improvements can be 607*7f296bb3SBarry Smithachieved by reusing matrix data structures with the same nonzero 608*7f296bb3SBarry Smithpattern. If a code creates thousands of matrix or vector objects, 609*7f296bb3SBarry Smithperformance will be degraded. For example, when solving a nonlinear 610*7f296bb3SBarry Smithproblem or timestepping, reusing the matrices and their nonzero 611*7f296bb3SBarry Smithstructure for many steps when appropriate can make the code run 612*7f296bb3SBarry Smithsignificantly faster. 613*7f296bb3SBarry Smith 614*7f296bb3SBarry SmithA simple technique for saving work vectors, matrices, etc. is employing 615*7f296bb3SBarry Smitha user-defined context. In C and C++ such a context is merely a 616*7f296bb3SBarry Smithstructure in which various objects can be stashed; in Fortran a user 617*7f296bb3SBarry Smithcontext can be an integer array that contains both parameters and 618*7f296bb3SBarry Smithpointers to PETSc objects. See 619*7f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/snes/tutorials/ex5.c.html">SNES Tutorial ex5</a> 620*7f296bb3SBarry Smithand 621*7f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/snes/tutorials/ex5f90.F90.html">SNES Tutorial ex5f90</a> 622*7f296bb3SBarry Smithfor examples of user-defined application contexts in C and Fortran, 623*7f296bb3SBarry Smithrespectively. 624*7f296bb3SBarry Smith 625*7f296bb3SBarry Smith### Numerical Experiments 626*7f296bb3SBarry Smith 627*7f296bb3SBarry SmithPETSc users should run a variety of tests. For example, there are a 628*7f296bb3SBarry Smithlarge number of options for the linear and nonlinear equation solvers in 629*7f296bb3SBarry SmithPETSc, and different choices can make a *very* big difference in 630*7f296bb3SBarry Smithconvergence rates and execution times. PETSc employs defaults that are 631*7f296bb3SBarry Smithgenerally reasonable for a wide range of problems, but clearly these 632*7f296bb3SBarry Smithdefaults cannot be best for all cases. Users should experiment with many 633*7f296bb3SBarry Smithcombinations to determine what is best for a given problem and customize 634*7f296bb3SBarry Smiththe solvers accordingly. 635*7f296bb3SBarry Smith 636*7f296bb3SBarry Smith- Use the options `-snes_view`, `-ksp_view`, etc. (or the routines 637*7f296bb3SBarry Smith `KSPView()`, `SNESView()`, etc.) to view the options that have 638*7f296bb3SBarry Smith been used for a particular solver. 639*7f296bb3SBarry Smith- Run the code with the option `-help` for a list of the available 640*7f296bb3SBarry Smith runtime commands. 641*7f296bb3SBarry Smith- Use the option `-info` to print details about the solvers’ 642*7f296bb3SBarry Smith operation. 643*7f296bb3SBarry Smith- Use the PETSc monitoring discussed in {any}`ch_profiling` 644*7f296bb3SBarry Smith to evaluate the performance of various numerical methods. 645*7f296bb3SBarry Smith 646*7f296bb3SBarry Smith(sec_slestips)= 647*7f296bb3SBarry Smith 648*7f296bb3SBarry Smith### Tips for Efficient Use of Linear Solvers 649*7f296bb3SBarry Smith 650*7f296bb3SBarry SmithAs discussed in {any}`ch_ksp`, the default linear 651*7f296bb3SBarry Smithsolvers are 652*7f296bb3SBarry Smith 653*7f296bb3SBarry Smith- uniprocess: GMRES(30) with ILU(0) preconditioning 654*7f296bb3SBarry Smith- multiprocess: GMRES(30) with block Jacobi preconditioning, where 655*7f296bb3SBarry Smith there is 1 block per process, and each block is solved with ILU(0) 656*7f296bb3SBarry Smith 657*7f296bb3SBarry SmithOne should experiment to determine alternatives that may be better for 658*7f296bb3SBarry Smithvarious applications. Recall that one can specify the `KSP` methods 659*7f296bb3SBarry Smithand preconditioners at runtime via the options: 660*7f296bb3SBarry Smith 661*7f296bb3SBarry Smith```none 662*7f296bb3SBarry Smith-ksp_type <ksp_name> -pc_type <pc_name> 663*7f296bb3SBarry Smith``` 664*7f296bb3SBarry Smith 665*7f296bb3SBarry SmithOne can also specify a variety of runtime customizations for the 666*7f296bb3SBarry Smithsolvers, as discussed throughout the manual. 667*7f296bb3SBarry Smith 668*7f296bb3SBarry SmithIn particular, note that the default restart parameter for GMRES is 30, 669*7f296bb3SBarry Smithwhich may be too small for some large-scale problems. One can alter this 670*7f296bb3SBarry Smithparameter with the option `-ksp_gmres_restart <restart>` or by calling 671*7f296bb3SBarry Smith`KSPGMRESSetRestart()`. {any}`sec_ksp` gives 672*7f296bb3SBarry Smithinformation on setting alternative GMRES orthogonalization routines, 673*7f296bb3SBarry Smithwhich may provide much better parallel performance. 674*7f296bb3SBarry Smith 675*7f296bb3SBarry SmithFor elliptic problems one often obtains good performance and scalability 676*7f296bb3SBarry Smithwith multigrid solvers. Consult {any}`sec_amg` for 677*7f296bb3SBarry Smithavailable options. Our experience is that GAMG works particularly well 678*7f296bb3SBarry Smithfor elasticity problems, whereas hypre does well for scalar problems. 679*7f296bb3SBarry Smith 680*7f296bb3SBarry Smith### System-Related Problems 681*7f296bb3SBarry Smith 682*7f296bb3SBarry SmithThe performance of a code can be affected by a variety of factors, 683*7f296bb3SBarry Smithincluding the cache behavior, other users on the machine, etc. Below we 684*7f296bb3SBarry Smithbriefly describe some common problems and possibilities for overcoming 685*7f296bb3SBarry Smiththem. 686*7f296bb3SBarry Smith 687*7f296bb3SBarry Smith- **Problem too large for physical memory size**: When timing a 688*7f296bb3SBarry Smith program, one should always leave at least a ten percent margin 689*7f296bb3SBarry Smith between the total memory a process is using and the physical size of 690*7f296bb3SBarry Smith the machine’s memory. One way to estimate the amount of memory used 691*7f296bb3SBarry Smith by given process is with the Unix `getrusage` system routine. 692*7f296bb3SBarry Smith The PETSc option `-malloc_view` reports all 693*7f296bb3SBarry Smith memory usage, including any Fortran arrays in an application code. 694*7f296bb3SBarry Smith- **Effects of other users**: If other users are running jobs on the 695*7f296bb3SBarry Smith same physical processor nodes on which a program is being profiled, 696*7f296bb3SBarry Smith the timing results are essentially meaningless. 697*7f296bb3SBarry Smith- **Overhead of timing routines on certain machines**: On certain 698*7f296bb3SBarry Smith machines, even calling the system clock in order to time routines is 699*7f296bb3SBarry Smith slow; this skews all of the flop rates and timing results. The file 700*7f296bb3SBarry Smith `$PETSC_DIR/src/benchmarks/PetscTime.c` (<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/benchmarks/PetscTime.c.html">source</a>) 701*7f296bb3SBarry Smith contains a simple test problem that will approximate the amount of 702*7f296bb3SBarry Smith time required to get the current time in a running program. On good 703*7f296bb3SBarry Smith systems it will on the order of $10^{-6}$ seconds or less. 704*7f296bb3SBarry Smith- **Problem too large for good cache performance**: Certain machines 705*7f296bb3SBarry Smith with lower memory bandwidths (slow memory access) attempt to 706*7f296bb3SBarry Smith compensate by having a very large cache. Thus, if a significant 707*7f296bb3SBarry Smith portion of an application fits within the cache, the program will 708*7f296bb3SBarry Smith achieve very good performance; if the code is too large, the 709*7f296bb3SBarry Smith performance can degrade markedly. To analyze whether this situation 710*7f296bb3SBarry Smith affects a particular code, one can try plotting the total flop rate 711*7f296bb3SBarry Smith as a function of problem size. If the flop rate decreases rapidly at 712*7f296bb3SBarry Smith some point, then the problem may likely be too large for the cache 713*7f296bb3SBarry Smith size. 714*7f296bb3SBarry Smith- **Inconsistent timings**: Inconsistent timings are likely due to 715*7f296bb3SBarry Smith other users on the machine, thrashing (using more virtual memory than 716*7f296bb3SBarry Smith available physical memory), or paging in of the initial executable. 717*7f296bb3SBarry Smith {any}`sec_profaccuracy` provides information on 718*7f296bb3SBarry Smith overcoming paging overhead when profiling a code. We have found on 719*7f296bb3SBarry Smith all systems that if you follow all the advise above your timings will 720*7f296bb3SBarry Smith be consistent within a variation of less than five percent. 721