1(ch_performance)= 2 3# Hints for Performance Tuning 4 5This chapter provides hints on how to get to achieve best performance 6with PETSc, particularly on distributed-memory machines with multiple 7CPU sockets per node. We focus on machine-related performance 8optimization here; algorithmic aspects like preconditioner selection are 9not the focus of this section. 10 11## Maximizing Memory Bandwidth 12 13Most operations in PETSc deal with large datasets (typically vectors and 14sparse matrices) and perform relatively few arithmetic operations for 15each byte loaded or stored from global memory. Therefore, the 16*arithmetic intensity* expressed as the ratio of floating point 17operations to the number of bytes loaded and stored is usually well 18below unity for typical PETSc operations. On the other hand, modern CPUs 19are able to execute on the order of 10 floating point operations for 20each byte loaded or stored. As a consequence, almost all PETSc 21operations are limited by the rate at which data can be loaded or stored 22(*memory bandwidth limited*) rather than by the rate of floating point 23operations. 24 25This section discusses ways to maximize the memory bandwidth achieved by 26applications based on PETSc. Where appropriate, we include benchmark 27results in order to provide quantitative results on typical performance 28gains one can achieve through parallelization, both on a single compute 29node and across nodes. In particular, we start with the answer to the 30common question of why performance generally does not increase 20-fold 31with a 20-core CPU. 32 33(subsec_bandwidth_vs_processes)= 34 35### Memory Bandwidth vs. Processes 36 37Consider the addition of two large vectors, with the result written to a 38third vector. Because there are no dependencies across the different 39entries of each vector, the operation is embarrassingly parallel. 40 41:::{figure} /images/manual/stream-results-intel.* 42:alt: Memory bandwidth obtained on Intel hardware (dual socket except KNL) over the 43: number of processes used. One can get close to peak memory bandwidth with only a 44: few processes. 45:name: fig_stream_intel 46:width: 80.0% 47 48Memory bandwidth obtained on Intel hardware (dual socket except KNL) 49over the number of processes used. One can get close to peak memory 50bandwidth with only a few processes. 51::: 52 53As {numref}`fig_stream_intel` shows, the performance gains due to 54parallelization on different multi- and many-core CPUs quickly 55saturates. The reason is that only a fraction of the total number of CPU 56cores is required to saturate the memory channels. For example, a 57dual-socket system equipped with Haswell 12-core Xeon CPUs achieves more 58than 80 percent of achievable peak memory bandwidth with only four 59processes per socket (8 total), cf. {numref}`fig_stream_intel`. 60Consequently, running with more than 8 MPI ranks on such a system will 61not increase performance substantially. For the same reason, PETSc-based 62applications usually do not benefit from hyper-threading. 63 64PETSc provides a simple way to measure memory bandwidth for different 65numbers of processes via the target `make streams` executed from 66`$PETSC_DIR`. The output provides an overview of the possible speedup 67one can obtain on the given machine (not necessarily a shared memory 68system). For example, the following is the most relevant output obtained 69on a dual-socket system equipped with two six-core-CPUs with 70hyperthreading: 71 72```none 73np speedup 741 1.0 752 1.58 763 2.19 774 2.42 785 2.63 796 2.69 80... 8121 3.82 8222 3.49 8323 3.79 8424 3.71 85Estimation of possible speedup of MPI programs based on Streams benchmark. 86It appears you have 1 node(s) 87``` 88 89On this machine, one should expect a speed-up of typical memory 90bandwidth-bound PETSc applications of at most 4x when running multiple 91MPI ranks on the node. Most of the gains are already obtained when 92running with only 4-6 ranks. Because a smaller number of MPI ranks 93usually implies better preconditioners and better performance for 94smaller problems, the best performance for PETSc applications may be 95obtained with fewer ranks than there are physical CPU cores available. 96 97Following the results from the above run of `make streams`, we 98recommend to use additional nodes instead of placing additional MPI 99ranks on the nodes. In particular, weak scaling (i.e. constant load per 100process, increasing the number of processes) and strong scaling 101(i.e. constant total work, increasing the number of processes) studies 102should keep the number of processes per node constant. 103 104### Non-Uniform Memory Access (NUMA) and Process Placement 105 106CPUs in nodes with more than one CPU socket are internally connected via 107a high-speed fabric, cf. {numref}`fig_numa`, to enable data 108exchange as well as cache coherency. Because main memory on modern 109systems is connected via the integrated memory controllers on each CPU, 110memory is accessed in a non-uniform way: A process running on one socket 111has direct access to the memory channels of the respective CPU, whereas 112requests for memory attached to a different CPU socket need to go 113through the high-speed fabric. Consequently, best aggregate memory 114bandwidth on the node is obtained when the memory controllers on each 115CPU are fully saturated. However, full saturation of memory channels is 116only possible if the data is distributed across the different memory 117channels. 118 119:::{figure} /images/manual/numa.* 120:alt: Schematic of a two-socket NUMA system. Processes should be spread across both 121: CPUs to obtain full bandwidth. 122:name: fig_numa 123:width: 90.0% 124 125Schematic of a two-socket NUMA system. Processes should be spread 126across both CPUs to obtain full bandwidth. 127::: 128 129Data in memory on modern machines is allocated by the operating system 130based on a first-touch policy. That is, memory is not allocated at the 131point of issuing `malloc()`, but at the point when the respective 132memory segment is actually touched (read or write). Upon first-touch, 133memory is allocated on the memory channel associated with the respective 134CPU the process is running on. Only if all memory on the respective CPU 135is already in use (either allocated or as IO cache), memory available 136through other sockets is considered. 137 138Maximum memory bandwidth can be achieved by ensuring that processes are 139spread over all sockets in the respective node. For example, the 140recommended placement of a 8-way parallel run on a four-socket machine 141is to assign two processes to each CPU socket. To do so, one needs to 142know the enumeration of cores and pass the requested information to 143`mpiexec`. Consider the hardware topology information returned by 144`lstopo` (part of the hwloc package) for the following two-socket 145machine, in which each CPU consists of six cores and supports 146hyperthreading: 147 148```none 149Machine (126GB total) 150 NUMANode L#0 (P#0 63GB) 151 Package L#0 + L3 L#0 (15MB) 152 L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 153 PU L#0 (P#0) 154 PU L#1 (P#12) 155 L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 156 PU L#2 (P#1) 157 PU L#3 (P#13) 158 L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 159 PU L#4 (P#2) 160 PU L#5 (P#14) 161 L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 162 PU L#6 (P#3) 163 PU L#7 (P#15) 164 L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 165 PU L#8 (P#4) 166 PU L#9 (P#16) 167 L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 168 PU L#10 (P#5) 169 PU L#11 (P#17) 170 NUMANode L#1 (P#1 63GB) 171 Package L#1 + L3 L#1 (15MB) 172 L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 173 PU L#12 (P#6) 174 PU L#13 (P#18) 175 L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 176 PU L#14 (P#7) 177 PU L#15 (P#19) 178 L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 179 PU L#16 (P#8) 180 PU L#17 (P#20) 181 L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 182 PU L#18 (P#9) 183 PU L#19 (P#21) 184 L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 185 PU L#20 (P#10) 186 PU L#21 (P#22) 187 L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 188 PU L#22 (P#11) 189 PU L#23 (P#23) 190``` 191 192The relevant physical processor IDs are shown in parentheses prefixed by 193`P#`. Here, IDs 0 and 12 share the same physical core and have a 194common L2 cache. IDs 0, 12, 1, 13, 2, 14, 3, 15, 4, 16, 5, 17 share the 195same socket and have a common L3 cache. 196 197A good placement for a run with six processes is to locate three 198processes on the first socket and three processes on the second socket. 199Unfortunately, mechanisms for process placement vary across MPI 200implementations, so make sure to consult the manual of your MPI 201implementation. The following discussion is based on how processor 202placement is done with MPICH and Open MPI, where one needs to pass 203`--bind-to core --map-by socket` to `mpiexec`: 204 205```console 206$ mpiexec -n 6 --bind-to core --map-by socket ./stream 207process 0 binding: 100000000000100000000000 208process 1 binding: 000000100000000000100000 209process 2 binding: 010000000000010000000000 210process 3 binding: 000000010000000000010000 211process 4 binding: 001000000000001000000000 212process 5 binding: 000000001000000000001000 213Triad: 45403.1949 Rate (MB/s) 214``` 215 216In this configuration, process 0 is bound to the first physical core on 217the first socket (with IDs 0 and 12), process 1 is bound to the first 218core on the second socket (IDs 6 and 18), and similarly for the 219remaining processes. The achieved bandwidth of 45 GB/sec is close to the 220practical peak of about 50 GB/sec available on the machine. If, however, 221all MPI processes are located on the same socket, memory bandwidth drops 222significantly: 223 224```console 225$ mpiexec -n 6 --bind-to core --map-by core ./stream 226process 0 binding: 100000000000100000000000 227process 1 binding: 010000000000010000000000 228process 2 binding: 001000000000001000000000 229process 3 binding: 000100000000000100000000 230process 4 binding: 000010000000000010000000 231process 5 binding: 000001000000000001000000 232Triad: 25510.7507 Rate (MB/s) 233``` 234 235All processes are now mapped to cores on the same socket. As a result, 236only the first memory channel is fully saturated at 25.5 GB/sec. 237 238One must not assume that `mpiexec` uses good defaults. To 239demonstrate, compare the full output of `make streams` from {any}`subsec_bandwidth_vs_processes` first, followed by 240the results obtained by passing `--bind-to core --map-by socket`: 241 242```console 243$ make streams 244np speedup 2451 1.0 2462 1.58 2473 2.19 2484 2.42 2495 2.63 2506 2.69 2517 2.31 2528 2.42 2539 2.37 25410 2.65 25511 2.3 25612 2.53 25713 2.43 25814 2.63 25915 2.74 26016 2.7 26117 3.28 26218 3.66 26319 3.95 26420 3.07 26521 3.82 26622 3.49 26723 3.79 26824 3.71 269``` 270 271```console 272$ make streams MPI_BINDING="--bind-to core --map-by socket" 273np speedup 2741 1.0 2752 1.59 2763 2.66 2774 3.5 2785 3.56 2796 4.23 2807 3.95 2818 4.39 2829 4.09 28310 4.46 28411 4.15 28512 4.42 28613 3.71 28714 3.83 28815 4.08 28916 4.22 29017 4.18 29118 4.31 29219 4.22 29320 4.28 29421 4.25 29522 4.23 29623 4.28 29724 4.22 298``` 299 300For the non-optimized version on the left, the speedup obtained when 301using any number of processes between 3 and 13 is essentially constant 302up to fluctuations, indicating that all processes were by default 303executed on the same socket. Only with 14 or more processes, the 304speedup number increases again. In contrast, the results of 305 306 307`make streams` 308 309 with proper processor placement shown second 310resulted in slightly higher overall parallel speedup (identical 311baselines), in smaller performance fluctuations, and more than 90 312percent of peak bandwidth with only six processes. 313 314Machines with job submission systems such as SLURM usually provide 315similar mechanisms for processor placements through options specified in 316job submission scripts. Please consult the respective manuals. 317 318#### Additional Process Placement Considerations and Details 319 320For a typical, memory bandwidth-limited PETSc application, the primary 321consideration in placing MPI processes is ensuring that processes are 322evenly distributed among sockets, and hence using all available memory 323channels. Increasingly complex processor designs and cache hierarchies, 324however, mean that performance may also be sensitive to how processes 325are bound to the resources within each socket. Performance on the two 326processor machine in the preceding example may be relatively insensitive 327to such placement decisions, because one L3 cache is shared by all cores 328within a NUMA domain, and each core has its own L2 and L1 caches. 329However, processors that are less “flat”, with more complex hierarchies, 330may be more sensitive. In many AMD Opterons or the second-generation 331“Knights Landing” Intel Xeon Phi, for instance, L2 caches are shared 332between two cores. On these processors, placing consecutive MPI ranks on 333cores that share the same L2 cache may benefit performance if the two 334ranks communicate frequently with each other, because the latency 335between cores sharing an L2 cache may be roughly half that of two cores 336not sharing one. There may be benefit, however, in placing consecutive 337ranks on cores that do not share an L2 cache, because (if there are 338fewer MPI ranks than cores) this increases the total L2 cache capacity 339and bandwidth available to the application. There is a trade-off to be 340considered between placing processes close together (in terms of shared 341resources) to optimize for efficient communication and synchronization 342vs. farther apart to maximize available resources (memory channels, 343caches, I/O channels, etc.), and the best strategy will depend on the 344application and the software and hardware stack. 345 346Different process placement strategies can affect performance at least 347as much as some commonly explored settings, such as compiler 348optimization levels. Unfortunately, exploration of this space is 349complicated by two factors: First, processor and core numberings may be 350completely arbitrary, changing with BIOS version, etc., and second—as 351already noted—there is no standard mechanism used by MPI implementations 352(or job schedulers) to specify process affinity. To overcome the first 353issue, we recommend using the `lstopo` utility of the Portable 354Hardware Locality (`hwloc`) software package (which can be installed 355by configuring PETSc with `–download-hwloc`) to understand the 356processor topology of your machine. We cannot fully address the second 357issue—consult the documentation for your MPI implementation and/or job 358scheduler—but we offer some general observations on understanding 359placement options: 360 361- An MPI implementation may support a notion of *domains* in which a 362 process may be pinned. A domain may simply correspond to a single 363 core; however, the MPI implementation may allow a deal of flexibility 364 in specifying domains that encompass multiple cores, span sockets, 365 etc. Some implementations, such as Intel MPI, provide means to 366 specify whether domains should be “compact”—composed of cores sharing 367 resources such as caches—or “scatter”-ed, with little resource 368 sharing (possibly even spanning sockets). 369- Separate from the specification of domains, MPI implementations often 370 support different *orderings* in which MPI ranks should be bound to 371 these domains. Intel MPI, for instance, supports “compact” ordering 372 to place consecutive ranks close in terms of shared resources, 373 “scatter” to place them far apart, and “bunch” to map proportionally 374 to sockets while placing ranks as close together as possible within 375 the sockets. 376- An MPI implementation that supports process pinning should offer some 377 way to view the rank assignments. Use this output in conjunction with 378 the topology obtained via `lstopo` or a similar tool to determine 379 if the placements correspond to something you believe is reasonable 380 for your application. Do not assume that the MPI implementation is 381 doing something sensible by default! 382 383## Performance Pitfalls and Advice 384 385This section looks into a potpourri of performance pitfalls encountered 386by users in the past. Many of these pitfalls require a deeper 387understanding of the system and experience to detect. The purpose of 388this section is to summarize and share our experience so that these 389pitfalls can be avoided in the future. 390 391### Debug vs. Optimized Builds 392 393PETSc’s `configure` defaults to building PETSc with debug mode 394enabled. Any code development should be done in this mode, because it 395provides handy debugging facilities such as accurate stack traces, 396memory leak checks, and memory corruption checks. Note that PETSc has no 397reliable way of knowing whether a particular run is a production or 398debug run. In the case that a user requests profiling information via 399`-log_view`, a debug build of PETSc issues the following warning: 400 401```none 402########################################################## 403# # 404# WARNING!!! # 405# # 406# This code was compiled with a debugging option, # 407# To get timing results run configure # 408# using --with-debugging=no, the performance will # 409# be generally two or three times faster. # 410# # 411########################################################## 412``` 413 414Conversely, one way of checking whether a particular build of PETSc has 415debugging enabled is to inspect the output of `-log_view`. 416 417Debug mode will generally be most useful for code development if 418appropriate compiler options are set to facilitate debugging. The 419compiler should be instructed to generate binaries with debug symbols 420(command line option `-g` for most compilers), and the optimization 421level chosen should either completely disable optimizations (`-O0` for 422most compilers) or enable only optimizations that do not interfere with 423debugging (GCC, for instance, supports a `-Og` optimization level that 424does this). 425 426Only once the new code is thoroughly tested and ready for production, 427one should disable debugging facilities by passing 428`--with-debugging=no` to 429 430`configure`. One should also ensure that an appropriate compiler 431optimization level is set. Note that some compilers (e.g., Intel) 432default to fairly comprehensive optimization levels, while others (e.g., 433GCC) default to no optimization at all. The best optimization flags will 434depend on your code, the compiler, and the target architecture, but we 435offer a few guidelines for finding those that will offer the best 436performance: 437 438- Most compilers have a number of optimization levels (with level n 439 usually specified via `-On`) that provide a quick way to enable 440 sets of several optimization flags. We suggest trying the higher 441 optimization levels (the highest level is not guaranteed to produce 442 the fastest executable, so some experimentation may be merited). With 443 most recent processors now supporting some form of SIMD or vector 444 instructions, it is important to choose a level that enables the 445 compiler’s auto-vectorizer; many compilers do not enable 446 auto-vectorization at lower optimization levels (e.g., GCC does not 447 enable it below `-O3` and the Intel compiler does not enable it 448 below `-O2`). 449- For processors supporting newer vector instruction sets, such as 450 Intel AVX2 and AVX-512, it is also important to direct the compiler 451 to generate code that targets these processors (e.g., `-march=native`); 452 otherwise, the executables built will not 453 utilize the newer instructions sets and will not take advantage of 454 the vector processing units. 455- Beyond choosing the optimization levels, some value-unsafe 456 optimizations (such as using reciprocals of values instead of 457 dividing by those values, or allowing re-association of operands in a 458 series of calculations) for floating point calculations may yield 459 significant performance gains. Compilers often provide flags (e.g., 460 `-ffast-math` in GCC) to enable a set of these optimizations, and 461 they may be turned on when using options for very aggressive 462 optimization (`-fast` or `-Ofast` in many compilers). These are 463 worth exploring to maximize performance, but, if employed, it 464 important to verify that these do not cause erroneous results with 465 your code, since calculations may violate the IEEE standard for 466 floating-point arithmetic. 467 468### Profiling 469 470Users should not spend time optimizing a code until after having 471determined where it spends the bulk of its time on realistically sized 472problems. As discussed in detail in {any}`ch_profiling`, the 473PETSc routines automatically log performance data if certain runtime 474options are specified. 475 476To obtain a summary of where and how much time is spent in different 477sections of the code, use one of the following options: 478 479- Run the code with the option `-log_view` to print a performance 480 summary for various phases of the code. 481- Run the code with the option `-log_mpe` `[logfilename]`, which 482 creates a logfile of events suitable for viewing with Jumpshot (part 483 of MPICH). 484 485Then, focus on the sections where most of the time is spent. If you 486provided your own callback routines, e.g. for residual evaluations, 487search the profiling output for routines such as `SNESFunctionEval` or 488`SNESJacobianEval`. If their relative time is significant (say, more 489than 30 percent), consider optimizing these routines first. Generic 490instructions on how to optimize your callback functions are difficult; 491you may start by reading performance optimization guides for your 492system’s hardware. 493 494### Aggregation 495 496Performing operations on chunks of data rather than a single element at 497a time can significantly enhance performance because of cache reuse or 498lower data motion. Typical examples are: 499 500- Insert several (many) elements of a matrix or vector at once, rather 501 than looping and inserting a single value at a time. In order to 502 access elements in of vector repeatedly, employ `VecGetArray()` to 503 allow direct manipulation of the vector elements. 504- When possible, use `VecMDot()` rather than a series of calls to 505 `VecDot()`. 506- If you require a sequence of matrix-vector products with the same 507 matrix, consider packing your vectors into a single matrix and use 508 matrix-matrix multiplications. 509- Users should employ a reasonable number of `PetscMalloc()` calls in 510 their codes. Hundreds or thousands of memory allocations may be 511 appropriate; however, if tens of thousands are being used, then 512 reducing the number of `PetscMalloc()` calls may be warranted. For 513 example, reusing space or allocating large chunks and dividing it 514 into pieces can produce a significant savings in allocation overhead. 515 {any}`sec_dsreuse` gives details. 516 517Aggressive aggregation of data may result in inflexible datastructures 518and code that is hard to maintain. We advise users to keep these 519competing goals in mind and not blindly optimize for performance only. 520 521(sec_symbolfactor)= 522 523### Memory Allocation for Sparse Matrix Factorization 524 525When symbolically factoring an AIJ matrix, PETSc has to guess how much 526fill there will be. Careful use of the fill parameter in the 527`MatFactorInfo` structure when calling `MatLUFactorSymbolic()` or 528`MatILUFactorSymbolic()` can reduce greatly the number of mallocs and 529copies required, and thus greatly improve the performance of the 530factorization. One way to determine a good value for the fill parameter 531is to run a program with the option `-info`. The symbolic 532factorization phase will then print information such as 533 534```none 535Info:MatILUFactorSymbolic_SeqAIJ:Reallocs 12 Fill ratio:given 1 needed 2.16423 536``` 537 538This indicates that the user should have used a fill estimate factor of 539about 2.17 (instead of 1) to prevent the 12 required mallocs and copies. 540The command line option 541 542```none 543-pc_factor_fill 2.17 544``` 545 546will cause PETSc to preallocate the correct amount of space for 547the factorization. 548 549(detecting_memory_problems)= 550 551### Detecting Memory Allocation Problems and Memory Usage 552 553PETSc provides tools to aid in understanding PETSc memory usage and detecting problems with 554memory allocation, including leaks and use of uninitialized space. Internally, PETSc uses 555the routines `PetscMalloc()` and `PetscFree()` for memory allocation; instead of directly calling `malloc()` and `free()`. 556This allows PETSc to track its memory usage and perform error checking. Users are urged to use these routines as well when 557appropriate. 558 559- The option `-malloc_debug` turns on PETSc's extensive runtime error checking of memory for corruption. 560 This checking can be expensive, so should not be used for 561 production runs. The option `-malloc_test` is equivalent to `-malloc_debug` 562 but only works when PETSc is configured with `--with-debugging` (the default configuration). 563 We suggest setting the environmental variable `PETSC_OPTIONS=-malloc_test` 564 in your shell startup file to automatically enable runtime check memory for developing code but not 565 running optimized code. Using `-malloc_debug` or `-malloc_test` for large runs can slow them significantly, thus we 566 recommend turning them off if you code is painfully slow and you don't need the testing. In addition, you can use 567 `-check_pointer_intensity 0` for long run debug runs that do not need extensive memory corruption testing. This option 568 is occasionally added to the `PETSC_OPTIONS` environmental variable by some users. 569- The option 570 `-malloc_dump` will print a list of memory locations that have not been freed at the 571 conclusion of a program. If all memory has been freed no message 572 is printed. Note that 573 the option `-malloc_dump` activates a call to 574 `PetscMallocDump()` during `PetscFinalize()`. The user can also 575 call `PetscMallocDump()` elsewhere in a program. 576- Another useful option 577 is `-malloc_view`, which reports memory usage in all routines at the conclusion of the program. 578 Note that this option 579 activates logging by calling `PetscMallocViewSet()` in 580 `PetscInitialize()` and then prints the log by calling 581 `PetscMallocView()` in `PetscFinalize()`. The user can also call 582 these routines elsewhere in a program. 583- When finer granularity is 584 desired, the user can call `PetscMallocGetCurrentUsage()` and 585 `PetscMallocGetMaximumUsage()` for memory allocated by PETSc, or 586 `PetscMemoryGetCurrentUsage()` and `PetscMemoryGetMaximumUsage()` 587 for the total memory used by the program. Note that 588 `PetscMemorySetGetMaximumUsage()` must be called before 589 `PetscMemoryGetMaximumUsage()` (typically at the beginning of the 590 program). 591- The option `-memory_view` provides a high-level view of all memory usage, 592 not just the memory used by `PetscMalloc()`, at the conclusion of the program. 593- When running with `-log_view`, the additional option `-log_view_memory` 594 causes the display of additional columns of information about how much 595 memory was allocated and freed during each logged event. This is useful 596 to understand what phases of a computation require the most memory. 597 598One can also use [Valgrind](http://valgrind.org) to track memory usage and find bugs, see {any}`FAQ: Valgrind usage<valgrind>`. 599 600(sec_dsreuse)= 601 602### Data Structure Reuse 603 604Data structures should be reused whenever possible. For example, if a 605code often creates new matrices or vectors, there often may be a way to 606reuse some of them. Very significant performance improvements can be 607achieved by reusing matrix data structures with the same nonzero 608pattern. If a code creates thousands of matrix or vector objects, 609performance will be degraded. For example, when solving a nonlinear 610problem or timestepping, reusing the matrices and their nonzero 611structure for many steps when appropriate can make the code run 612significantly faster. 613 614A simple technique for saving work vectors, matrices, etc. is employing 615a user-defined context. In C and C++ such a context is merely a 616structure in which various objects can be stashed; in Fortran a user 617context can be an integer array that contains both parameters and 618pointers to PETSc objects. See 619<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/snes/tutorials/ex5.c.html">SNES Tutorial ex5</a> 620and 621<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/snes/tutorials/ex5f90.F90.html">SNES Tutorial ex5f90</a> 622for examples of user-defined application contexts in C and Fortran, 623respectively. 624 625### Numerical Experiments 626 627PETSc users should run a variety of tests. For example, there are a 628large number of options for the linear and nonlinear equation solvers in 629PETSc, and different choices can make a *very* big difference in 630convergence rates and execution times. PETSc employs defaults that are 631generally reasonable for a wide range of problems, but clearly these 632defaults cannot be best for all cases. Users should experiment with many 633combinations to determine what is best for a given problem and customize 634the solvers accordingly. 635 636- Use the options `-snes_view`, `-ksp_view`, etc. (or the routines 637 `KSPView()`, `SNESView()`, etc.) to view the options that have 638 been used for a particular solver. 639- Run the code with the option `-help` for a list of the available 640 runtime commands. 641- Use the option `-info` to print details about the solvers’ 642 operation. 643- Use the PETSc monitoring discussed in {any}`ch_profiling` 644 to evaluate the performance of various numerical methods. 645 646(sec_slestips)= 647 648### Tips for Efficient Use of Linear Solvers 649 650As discussed in {any}`ch_ksp`, the default linear 651solvers are 652 653- uniprocess: GMRES(30) with ILU(0) preconditioning 654- multiprocess: GMRES(30) with block Jacobi preconditioning, where 655 there is 1 block per process, and each block is solved with ILU(0) 656 657One should experiment to determine alternatives that may be better for 658various applications. Recall that one can specify the `KSP` methods 659and preconditioners at runtime via the options: 660 661```none 662-ksp_type <ksp_name> -pc_type <pc_name> 663``` 664 665One can also specify a variety of runtime customizations for the 666solvers, as discussed throughout the manual. 667 668In particular, note that the default restart parameter for GMRES is 30, 669which may be too small for some large-scale problems. One can alter this 670parameter with the option `-ksp_gmres_restart <restart>` or by calling 671`KSPGMRESSetRestart()`. {any}`sec_ksp` gives 672information on setting alternative GMRES orthogonalization routines, 673which may provide much better parallel performance. 674 675For elliptic problems one often obtains good performance and scalability 676with multigrid solvers. Consult {any}`sec_amg` for 677available options. Our experience is that GAMG works particularly well 678for elasticity problems, whereas hypre does well for scalar problems. 679 680### System-Related Problems 681 682The performance of a code can be affected by a variety of factors, 683including the cache behavior, other users on the machine, etc. Below we 684briefly describe some common problems and possibilities for overcoming 685them. 686 687- **Problem too large for physical memory size**: When timing a 688 program, one should always leave at least a ten percent margin 689 between the total memory a process is using and the physical size of 690 the machine’s memory. One way to estimate the amount of memory used 691 by given process is with the Unix `getrusage` system routine. 692 The PETSc option `-malloc_view` reports all 693 memory usage, including any Fortran arrays in an application code. 694- **Effects of other users**: If other users are running jobs on the 695 same physical processor nodes on which a program is being profiled, 696 the timing results are essentially meaningless. 697- **Overhead of timing routines on certain machines**: On certain 698 machines, even calling the system clock in order to time routines is 699 slow; this skews all of the flop rates and timing results. The file 700 `$PETSC_DIR/src/benchmarks/PetscTime.c` (<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/benchmarks/PetscTime.c.html">source</a>) 701 contains a simple test problem that will approximate the amount of 702 time required to get the current time in a running program. On good 703 systems it will on the order of $10^{-6}$ seconds or less. 704- **Problem too large for good cache performance**: Certain machines 705 with lower memory bandwidths (slow memory access) attempt to 706 compensate by having a very large cache. Thus, if a significant 707 portion of an application fits within the cache, the program will 708 achieve very good performance; if the code is too large, the 709 performance can degrade markedly. To analyze whether this situation 710 affects a particular code, one can try plotting the total flop rate 711 as a function of problem size. If the flop rate decreases rapidly at 712 some point, then the problem may likely be too large for the cache 713 size. 714- **Inconsistent timings**: Inconsistent timings are likely due to 715 other users on the machine, thrashing (using more virtual memory than 716 available physical memory), or paging in of the initial executable. 717 {any}`sec_profaccuracy` provides information on 718 overcoming paging overhead when profiling a code. We have found on 719 all systems that if you follow all the advise above your timings will 720 be consistent within a variation of less than five percent. 721