xref: /petsc/doc/manual/performance.md (revision 6c5693054f5123506dab0f5da2d352ed973d0e50)
1(ch_performance)=
2
3# Hints for Performance Tuning
4
5This chapter provides hints on how to get to achieve best performance
6with PETSc, particularly on distributed-memory machines with multiple
7CPU sockets per node. We focus on machine-related performance
8optimization here; algorithmic aspects like preconditioner selection are
9not the focus of this section.
10
11## Maximizing Memory Bandwidth
12
13Most operations in PETSc deal with large datasets (typically vectors and
14sparse matrices) and perform relatively few arithmetic operations for
15each byte loaded or stored from global memory. Therefore, the
16*arithmetic intensity* expressed as the ratio of floating point
17operations to the number of bytes loaded and stored is usually well
18below unity for typical PETSc operations. On the other hand, modern CPUs
19are able to execute on the order of 10 floating point operations for
20each byte loaded or stored. As a consequence, almost all PETSc
21operations are limited by the rate at which data can be loaded or stored
22(*memory bandwidth limited*) rather than by the rate of floating point
23operations.
24
25This section discusses ways to maximize the memory bandwidth achieved by
26applications based on PETSc. Where appropriate, we include benchmark
27results in order to provide quantitative results on typical performance
28gains one can achieve through parallelization, both on a single compute
29node and across nodes. In particular, we start with the answer to the
30common question of why performance generally does not increase 20-fold
31with a 20-core CPU.
32
33(subsec_bandwidth_vs_processes)=
34
35### Memory Bandwidth vs. Processes
36
37Consider the addition of two large vectors, with the result written to a
38third vector. Because there are no dependencies across the different
39entries of each vector, the operation is embarrassingly parallel.
40
41:::{figure} /images/manual/stream-results-intel.*
42:alt: Memory bandwidth obtained on Intel hardware (dual socket except KNL) over the
43:  number of processes used. One can get close to peak memory bandwidth with only a
44:  few processes.
45:name: fig_stream_intel
46:width: 80.0%
47
48Memory bandwidth obtained on Intel hardware (dual socket except KNL)
49over the number of processes used. One can get close to peak memory
50bandwidth with only a few processes.
51:::
52
53As {numref}`fig_stream_intel` shows, the performance gains due to
54parallelization on different multi- and many-core CPUs quickly
55saturates. The reason is that only a fraction of the total number of CPU
56cores is required to saturate the memory channels. For example, a
57dual-socket system equipped with Haswell 12-core Xeon CPUs achieves more
58than 80 percent of achievable peak memory bandwidth with only four
59processes per socket (8 total), cf. {numref}`fig_stream_intel`.
60Consequently, running with more than 8 MPI ranks on such a system will
61not increase performance substantially. For the same reason, PETSc-based
62applications usually do not benefit from hyper-threading.
63
64PETSc provides a simple way to measure memory bandwidth for different
65numbers of processes via the target `make streams` executed from
66`$PETSC_DIR`. The output provides an overview of the possible speedup
67one can obtain on the given machine (not necessarily a shared memory
68system). For example, the following is the most relevant output obtained
69on a dual-socket system equipped with two six-core-CPUs with
70hyperthreading:
71
72```none
73np  speedup
741 1.0
752 1.58
763 2.19
774 2.42
785 2.63
796 2.69
80...
8121 3.82
8222 3.49
8323 3.79
8424 3.71
85Estimation of possible speedup of MPI programs based on Streams benchmark.
86It appears you have 1 node(s)
87```
88
89On this machine, one should expect a speed-up of typical memory
90bandwidth-bound PETSc applications of at most 4x when running multiple
91MPI ranks on the node. Most of the gains are already obtained when
92running with only 4-6 ranks. Because a smaller number of MPI ranks
93usually implies better preconditioners and better performance for
94smaller problems, the best performance for PETSc applications may be
95obtained with fewer ranks than there are physical CPU cores available.
96
97Following the results from the above run of `make streams`, we
98recommend to use additional nodes instead of placing additional MPI
99ranks on the nodes. In particular, weak scaling (i.e. constant load per
100process, increasing the number of processes) and strong scaling
101(i.e. constant total work, increasing the number of processes) studies
102should keep the number of processes per node constant.
103
104### Non-Uniform Memory Access (NUMA) and Process Placement
105
106CPUs in nodes with more than one CPU socket are internally connected via
107a high-speed fabric, cf. {numref}`fig_numa`, to enable data
108exchange as well as cache coherency. Because main memory on modern
109systems is connected via the integrated memory controllers on each CPU,
110memory is accessed in a non-uniform way: A process running on one socket
111has direct access to the memory channels of the respective CPU, whereas
112requests for memory attached to a different CPU socket need to go
113through the high-speed fabric. Consequently, best aggregate memory
114bandwidth on the node is obtained when the memory controllers on each
115CPU are fully saturated. However, full saturation of memory channels is
116only possible if the data is distributed across the different memory
117channels.
118
119:::{figure} /images/manual/numa.*
120:alt: Schematic of a two-socket NUMA system. Processes should be spread across both
121:  CPUs to obtain full bandwidth.
122:name: fig_numa
123:width: 90.0%
124
125Schematic of a two-socket NUMA system. Processes should be spread
126across both CPUs to obtain full bandwidth.
127:::
128
129Data in memory on modern machines is allocated by the operating system
130based on a first-touch policy. That is, memory is not allocated at the
131point of issuing `malloc()`, but at the point when the respective
132memory segment is actually touched (read or write). Upon first-touch,
133memory is allocated on the memory channel associated with the respective
134CPU the process is running on. Only if all memory on the respective CPU
135is already in use (either allocated or as IO cache), memory available
136through other sockets is considered.
137
138Maximum memory bandwidth can be achieved by ensuring that processes are
139spread over all sockets in the respective node. For example, the
140recommended placement of a 8-way parallel run on a four-socket machine
141is to assign two processes to each CPU socket. To do so, one needs to
142know the enumeration of cores and pass the requested information to
143`mpiexec`. Consider the hardware topology information returned by
144`lstopo` (part of the hwloc package) for the following two-socket
145machine, in which each CPU consists of six cores and supports
146hyperthreading:
147
148```none
149Machine (126GB total)
150  NUMANode L#0 (P#0 63GB)
151    Package L#0 + L3 L#0 (15MB)
152      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
153        PU L#0 (P#0)
154        PU L#1 (P#12)
155      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
156        PU L#2 (P#1)
157        PU L#3 (P#13)
158      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
159        PU L#4 (P#2)
160        PU L#5 (P#14)
161      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
162        PU L#6 (P#3)
163        PU L#7 (P#15)
164      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
165        PU L#8 (P#4)
166        PU L#9 (P#16)
167      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
168        PU L#10 (P#5)
169        PU L#11 (P#17)
170  NUMANode L#1 (P#1 63GB)
171    Package L#1 + L3 L#1 (15MB)
172      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
173        PU L#12 (P#6)
174        PU L#13 (P#18)
175      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
176        PU L#14 (P#7)
177        PU L#15 (P#19)
178      L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
179        PU L#16 (P#8)
180        PU L#17 (P#20)
181      L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
182        PU L#18 (P#9)
183        PU L#19 (P#21)
184      L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
185        PU L#20 (P#10)
186        PU L#21 (P#22)
187      L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
188        PU L#22 (P#11)
189        PU L#23 (P#23)
190```
191
192The relevant physical processor IDs are shown in parentheses prefixed by
193`P#`. Here, IDs 0 and 12 share the same physical core and have a
194common L2 cache. IDs 0, 12, 1, 13, 2, 14, 3, 15, 4, 16, 5, 17 share the
195same socket and have a common L3 cache.
196
197A good placement for a run with six processes is to locate three
198processes on the first socket and three processes on the second socket.
199Unfortunately, mechanisms for process placement vary across MPI
200implementations, so make sure to consult the manual of your MPI
201implementation. The following discussion is based on how processor
202placement is done with MPICH and Open MPI, where one needs to pass
203`--bind-to core --map-by socket` to `mpiexec`:
204
205```console
206$ mpiexec -n 6 --bind-to core --map-by socket ./stream
207process 0 binding: 100000000000100000000000
208process 1 binding: 000000100000000000100000
209process 2 binding: 010000000000010000000000
210process 3 binding: 000000010000000000010000
211process 4 binding: 001000000000001000000000
212process 5 binding: 000000001000000000001000
213Triad:        45403.1949   Rate (MB/s)
214```
215
216In this configuration, process 0 is bound to the first physical core on
217the first socket (with IDs 0 and 12), process 1 is bound to the first
218core on the second socket (IDs 6 and 18), and similarly for the
219remaining processes. The achieved bandwidth of 45 GB/sec is close to the
220practical peak of about 50 GB/sec available on the machine. If, however,
221all MPI processes are located on the same socket, memory bandwidth drops
222significantly:
223
224```console
225$ mpiexec -n 6 --bind-to core --map-by core ./stream
226process 0 binding: 100000000000100000000000
227process 1 binding: 010000000000010000000000
228process 2 binding: 001000000000001000000000
229process 3 binding: 000100000000000100000000
230process 4 binding: 000010000000000010000000
231process 5 binding: 000001000000000001000000
232Triad:        25510.7507   Rate (MB/s)
233```
234
235All processes are now mapped to cores on the same socket. As a result,
236only the first memory channel is fully saturated at 25.5 GB/sec.
237
238One must not assume that `mpiexec` uses good defaults. To
239demonstrate, compare the full output of `make streams` from {any}`subsec_bandwidth_vs_processes` first, followed by
240the results obtained by passing `--bind-to core --map-by socket`:
241
242```console
243$ make streams
244np  speedup
2451 1.0
2462 1.58
2473 2.19
2484 2.42
2495 2.63
2506 2.69
2517 2.31
2528 2.42
2539 2.37
25410 2.65
25511 2.3
25612 2.53
25713 2.43
25814 2.63
25915 2.74
26016 2.7
26117 3.28
26218 3.66
26319 3.95
26420 3.07
26521 3.82
26622 3.49
26723 3.79
26824 3.71
269```
270
271```console
272$ make streams MPI_BINDING="--bind-to core --map-by socket"
273np  speedup
2741 1.0
2752 1.59
2763 2.66
2774 3.5
2785 3.56
2796 4.23
2807 3.95
2818 4.39
2829 4.09
28310 4.46
28411 4.15
28512 4.42
28613 3.71
28714 3.83
28815 4.08
28916 4.22
29017 4.18
29118 4.31
29219 4.22
29320 4.28
29421 4.25
29522 4.23
29623 4.28
29724 4.22
298```
299
300For the non-optimized version on the left, the speedup obtained when
301using any number of processes between 3 and 13 is essentially constant
302up to fluctuations, indicating that all processes were by default
303executed on the same socket. Only with 14 or more processes, the
304speedup number increases again. In contrast, the results of
305
306
307`make streams`
308
309 with proper processor placement shown second
310resulted in slightly higher overall parallel speedup (identical
311baselines), in smaller performance fluctuations, and more than 90
312percent of peak bandwidth with only six processes.
313
314Machines with job submission systems such as SLURM usually provide
315similar mechanisms for processor placements through options specified in
316job submission scripts. Please consult the respective manuals.
317
318#### Additional Process Placement Considerations and Details
319
320For a typical, memory bandwidth-limited PETSc application, the primary
321consideration in placing MPI processes is ensuring that processes are
322evenly distributed among sockets, and hence using all available memory
323channels. Increasingly complex processor designs and cache hierarchies,
324however, mean that performance may also be sensitive to how processes
325are bound to the resources within each socket. Performance on the two
326processor machine in the preceding example may be relatively insensitive
327to such placement decisions, because one L3 cache is shared by all cores
328within a NUMA domain, and each core has its own L2 and L1 caches.
329However, processors that are less “flat”, with more complex hierarchies,
330may be more sensitive. In many AMD Opterons or the second-generation
331“Knights Landing” Intel Xeon Phi, for instance, L2 caches are shared
332between two cores. On these processors, placing consecutive MPI ranks on
333cores that share the same L2 cache may benefit performance if the two
334ranks communicate frequently with each other, because the latency
335between cores sharing an L2 cache may be roughly half that of two cores
336not sharing one. There may be benefit, however, in placing consecutive
337ranks on cores that do not share an L2 cache, because (if there are
338fewer MPI ranks than cores) this increases the total L2 cache capacity
339and bandwidth available to the application. There is a trade-off to be
340considered between placing processes close together (in terms of shared
341resources) to optimize for efficient communication and synchronization
342vs. farther apart to maximize available resources (memory channels,
343caches, I/O channels, etc.), and the best strategy will depend on the
344application and the software and hardware stack.
345
346Different process placement strategies can affect performance at least
347as much as some commonly explored settings, such as compiler
348optimization levels. Unfortunately, exploration of this space is
349complicated by two factors: First, processor and core numberings may be
350completely arbitrary, changing with BIOS version, etc., and second—as
351already noted—there is no standard mechanism used by MPI implementations
352(or job schedulers) to specify process affinity. To overcome the first
353issue, we recommend using the `lstopo` utility of the Portable
354Hardware Locality (`hwloc`) software package (which can be installed
355by configuring PETSc with `–download-hwloc`) to understand the
356processor topology of your machine. We cannot fully address the second
357issue—consult the documentation for your MPI implementation and/or job
358scheduler—but we offer some general observations on understanding
359placement options:
360
361- An MPI implementation may support a notion of *domains* in which a
362  process may be pinned. A domain may simply correspond to a single
363  core; however, the MPI implementation may allow a deal of flexibility
364  in specifying domains that encompass multiple cores, span sockets,
365  etc. Some implementations, such as Intel MPI, provide means to
366  specify whether domains should be “compact”—composed of cores sharing
367  resources such as caches—or “scatter”-ed, with little resource
368  sharing (possibly even spanning sockets).
369- Separate from the specification of domains, MPI implementations often
370  support different *orderings* in which MPI ranks should be bound to
371  these domains. Intel MPI, for instance, supports “compact” ordering
372  to place consecutive ranks close in terms of shared resources,
373  “scatter” to place them far apart, and “bunch” to map proportionally
374  to sockets while placing ranks as close together as possible within
375  the sockets.
376- An MPI implementation that supports process pinning should offer some
377  way to view the rank assignments. Use this output in conjunction with
378  the topology obtained via `lstopo` or a similar tool to determine
379  if the placements correspond to something you believe is reasonable
380  for your application. Do not assume that the MPI implementation is
381  doing something sensible by default!
382
383## Performance Pitfalls and Advice
384
385This section looks into a potpourri of performance pitfalls encountered
386by users in the past. Many of these pitfalls require a deeper
387understanding of the system and experience to detect. The purpose of
388this section is to summarize and share our experience so that these
389pitfalls can be avoided in the future.
390
391### Debug vs. Optimized Builds
392
393PETSc’s `configure` defaults to building PETSc with debug mode
394enabled. Any code development should be done in this mode, because it
395provides handy debugging facilities such as accurate stack traces,
396memory leak checks, and memory corruption checks. Note that PETSc has no
397reliable way of knowing whether a particular run is a production or
398debug run. In the case that a user requests profiling information via
399`-log_view`, a debug build of PETSc issues the following warning:
400
401```none
402##########################################################
403#                                                        #
404#                          WARNING!!!                    #
405#                                                        #
406#   This code was compiled with a debugging option,      #
407#   To get timing results run configure                  #
408#   using --with-debugging=no, the performance will      #
409#   be generally two or three times faster.              #
410#                                                        #
411##########################################################
412```
413
414Conversely, one way of checking whether a particular build of PETSc has
415debugging enabled is to inspect the output of `-log_view`.
416
417Debug mode will generally be most useful for code development if
418appropriate compiler options are set to facilitate debugging. The
419compiler should be instructed to generate binaries with debug symbols
420(command line option `-g` for most compilers), and the optimization
421level chosen should either completely disable optimizations (`-O0` for
422most compilers) or enable only optimizations that do not interfere with
423debugging (GCC, for instance, supports a `-Og` optimization level that
424does this).
425
426Only once the new code is thoroughly tested and ready for production,
427one should disable debugging facilities by passing
428`--with-debugging=no` to
429
430`configure`. One should also ensure that an appropriate compiler
431optimization level is set. Note that some compilers (e.g., Intel)
432default to fairly comprehensive optimization levels, while others (e.g.,
433GCC) default to no optimization at all. The best optimization flags will
434depend on your code, the compiler, and the target architecture, but we
435offer a few guidelines for finding those that will offer the best
436performance:
437
438- Most compilers have a number of optimization levels (with level n
439  usually specified via `-On`) that provide a quick way to enable
440  sets of several optimization flags. We suggest trying the higher
441  optimization levels (the highest level is not guaranteed to produce
442  the fastest executable, so some experimentation may be merited). With
443  most recent processors now supporting some form of SIMD or vector
444  instructions, it is important to choose a level that enables the
445  compiler’s auto-vectorizer; many compilers do not enable
446  auto-vectorization at lower optimization levels (e.g., GCC does not
447  enable it below `-O3` and the Intel compiler does not enable it
448  below `-O2`).
449- For processors supporting newer vector instruction sets, such as
450  Intel AVX2 and AVX-512, it is also important to direct the compiler
451  to generate code that targets these processors (e.g., `-march=native`);
452  otherwise, the executables built will not
453  utilize the newer instructions sets and will not take advantage of
454  the vector processing units.
455- Beyond choosing the optimization levels, some value-unsafe
456  optimizations (such as using reciprocals of values instead of
457  dividing by those values, or allowing re-association of operands in a
458  series of calculations) for floating point calculations may yield
459  significant performance gains. Compilers often provide flags (e.g.,
460  `-ffast-math` in GCC) to enable a set of these optimizations, and
461  they may be turned on when using options for very aggressive
462  optimization (`-fast` or `-Ofast` in many compilers). These are
463  worth exploring to maximize performance, but, if employed, it
464  important to verify that these do not cause erroneous results with
465  your code, since calculations may violate the IEEE standard for
466  floating-point arithmetic.
467
468### Profiling
469
470Users should not spend time optimizing a code until after having
471determined where it spends the bulk of its time on realistically sized
472problems. As discussed in detail in {any}`ch_profiling`, the
473PETSc routines automatically log performance data if certain runtime
474options are specified.
475
476To obtain a summary of where and how much time is spent in different
477sections of the code, use one of the following options:
478
479- Run the code with the option `-log_view` to print a performance
480  summary for various phases of the code.
481- Run the code with the option `-log_mpe` `[logfilename]`, which
482  creates a logfile of events suitable for viewing with Jumpshot (part
483  of MPICH).
484
485Then, focus on the sections where most of the time is spent. If you
486provided your own callback routines, e.g. for residual evaluations,
487search the profiling output for routines such as `SNESFunctionEval` or
488`SNESJacobianEval`. If their relative time is significant (say, more
489than 30 percent), consider optimizing these routines first. Generic
490instructions on how to optimize your callback functions are difficult;
491you may start by reading performance optimization guides for your
492system’s hardware.
493
494### Aggregation
495
496Performing operations on chunks of data rather than a single element at
497a time can significantly enhance performance because of cache reuse or
498lower data motion. Typical examples are:
499
500- Insert several (many) elements of a matrix or vector at once, rather
501  than looping and inserting a single value at a time. In order to
502  access elements in of vector repeatedly, employ `VecGetArray()` to
503  allow direct manipulation of the vector elements.
504- When possible, use `VecMDot()` rather than a series of calls to
505  `VecDot()`.
506- If you require a sequence of matrix-vector products with the same
507  matrix, consider packing your vectors into a single matrix and use
508  matrix-matrix multiplications.
509- Users should employ a reasonable number of `PetscMalloc()` calls in
510  their codes. Hundreds or thousands of memory allocations may be
511  appropriate; however, if tens of thousands are being used, then
512  reducing the number of `PetscMalloc()` calls may be warranted. For
513  example, reusing space or allocating large chunks and dividing it
514  into pieces can produce a significant savings in allocation overhead.
515  {any}`sec_dsreuse` gives details.
516
517Aggressive aggregation of data may result in inflexible datastructures
518and code that is hard to maintain. We advise users to keep these
519competing goals in mind and not blindly optimize for performance only.
520
521(sec_symbolfactor)=
522
523### Memory Allocation for Sparse Matrix Factorization
524
525When symbolically factoring an AIJ matrix, PETSc has to guess how much
526fill there will be. Careful use of the fill parameter in the
527`MatFactorInfo` structure when calling `MatLUFactorSymbolic()` or
528`MatILUFactorSymbolic()` can reduce greatly the number of mallocs and
529copies required, and thus greatly improve the performance of the
530factorization. One way to determine a good value for the fill parameter
531is to run a program with the option `-info`. The symbolic
532factorization phase will then print information such as
533
534```none
535Info:MatILUFactorSymbolic_SeqAIJ:Reallocs 12 Fill ratio:given 1 needed 2.16423
536```
537
538This indicates that the user should have used a fill estimate factor of
539about 2.17 (instead of 1) to prevent the 12 required mallocs and copies.
540The command line option
541
542```none
543-pc_factor_fill 2.17
544```
545
546will cause PETSc to preallocate the correct amount of space for
547the factorization.
548
549(detecting_memory_problems)=
550
551### Detecting Memory Allocation Problems and Memory Usage
552
553PETSc provides tools to aid in understanding PETSc memory usage and detecting problems with
554memory allocation, including leaks and use of uninitialized space. Internally, PETSc uses
555the routines `PetscMalloc()` and `PetscFree()` for memory allocation; instead of directly calling `malloc()` and `free()`.
556This allows PETSc to track its memory usage and perform error checking. Users are urged to use these routines as well when
557appropriate.
558
559- The option `-malloc_debug` turns on PETSc's extensive runtime error checking of memory for corruption.
560  This checking can be expensive, so should not be used for
561  production runs. The option `-malloc_test` is equivalent to `-malloc_debug`
562  but only works when PETSc is configured with `--with-debugging` (the default configuration).
563  We suggest setting the environmental variable `PETSC_OPTIONS=-malloc_test`
564  in your shell startup file to automatically enable runtime check memory for developing code but not
565  running optimized code. Using `-malloc_debug` or `-malloc_test` for large runs can slow them significantly, thus we
566  recommend turning them off if you code is painfully slow and you don't need the testing. In addition, you can use
567  `-check_pointer_intensity 0` for long run debug runs that do not need extensive memory corruption testing. This option
568  is occasionally added to the `PETSC_OPTIONS` environmental variable by some users.
569- The option
570  `-malloc_dump` will print a list of memory locations that have not been freed at the
571  conclusion of a program. If all memory has been freed no message
572  is printed. Note that
573  the option `-malloc_dump` activates a call to
574  `PetscMallocDump()` during `PetscFinalize()`. The user can also
575  call `PetscMallocDump()` elsewhere in a program.
576- Another useful option
577  is `-malloc_view`, which reports memory usage in all routines at the conclusion of the program.
578  Note that this option
579  activates logging by calling `PetscMallocViewSet()` in
580  `PetscInitialize()` and then prints the log by calling
581  `PetscMallocView()` in `PetscFinalize()`. The user can also call
582  these routines elsewhere in a program.
583- When finer granularity is
584  desired, the user can call `PetscMallocGetCurrentUsage()` and
585  `PetscMallocGetMaximumUsage()` for memory allocated by PETSc, or
586  `PetscMemoryGetCurrentUsage()` and `PetscMemoryGetMaximumUsage()`
587  for the total memory used by the program. Note that
588  `PetscMemorySetGetMaximumUsage()` must be called before
589  `PetscMemoryGetMaximumUsage()` (typically at the beginning of the
590  program).
591- The option `-memory_view` provides a high-level view of all memory usage,
592  not just the memory used by `PetscMalloc()`, at the conclusion of the program.
593- When running with `-log_view`, the additional option `-log_view_memory`
594  causes the display of additional columns of information about how much
595  memory was allocated and freed during each logged event. This is useful
596  to understand what phases of a computation require the most memory.
597
598One can also use [Valgrind](http://valgrind.org) to track memory usage and find bugs, see {any}`FAQ: Valgrind usage<valgrind>`.
599
600(sec_dsreuse)=
601
602### Data Structure Reuse
603
604Data structures should be reused whenever possible. For example, if a
605code often creates new matrices or vectors, there often may be a way to
606reuse some of them. Very significant performance improvements can be
607achieved by reusing matrix data structures with the same nonzero
608pattern. If a code creates thousands of matrix or vector objects,
609performance will be degraded. For example, when solving a nonlinear
610problem or timestepping, reusing the matrices and their nonzero
611structure for many steps when appropriate can make the code run
612significantly faster.
613
614A simple technique for saving work vectors, matrices, etc. is employing
615a user-defined context. In C and C++ such a context is merely a
616structure in which various objects can be stashed; in Fortran a user
617context can be an integer array that contains both parameters and
618pointers to PETSc objects. See
619<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/snes/tutorials/ex5.c.html">SNES Tutorial ex5</a>
620and
621<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/snes/tutorials/ex5f90.F90.html">SNES Tutorial ex5f90</a>
622for examples of user-defined application contexts in C and Fortran,
623respectively.
624
625### Numerical Experiments
626
627PETSc users should run a variety of tests. For example, there are a
628large number of options for the linear and nonlinear equation solvers in
629PETSc, and different choices can make a *very* big difference in
630convergence rates and execution times. PETSc employs defaults that are
631generally reasonable for a wide range of problems, but clearly these
632defaults cannot be best for all cases. Users should experiment with many
633combinations to determine what is best for a given problem and customize
634the solvers accordingly.
635
636- Use the options `-snes_view`, `-ksp_view`, etc. (or the routines
637  `KSPView()`, `SNESView()`, etc.) to view the options that have
638  been used for a particular solver.
639- Run the code with the option `-help` for a list of the available
640  runtime commands.
641- Use the option `-info` to print details about the solvers’
642  operation.
643- Use the PETSc monitoring discussed in {any}`ch_profiling`
644  to evaluate the performance of various numerical methods.
645
646(sec_slestips)=
647
648### Tips for Efficient Use of Linear Solvers
649
650As discussed in {any}`ch_ksp`, the default linear
651solvers are
652
653- uniprocess: GMRES(30) with ILU(0) preconditioning
654- multiprocess: GMRES(30) with block Jacobi preconditioning, where
655  there is 1 block per process, and each block is solved with ILU(0)
656
657One should experiment to determine alternatives that may be better for
658various applications. Recall that one can specify the `KSP` methods
659and preconditioners at runtime via the options:
660
661```none
662-ksp_type <ksp_name> -pc_type <pc_name>
663```
664
665One can also specify a variety of runtime customizations for the
666solvers, as discussed throughout the manual.
667
668In particular, note that the default restart parameter for GMRES is 30,
669which may be too small for some large-scale problems. One can alter this
670parameter with the option `-ksp_gmres_restart <restart>` or by calling
671`KSPGMRESSetRestart()`. {any}`sec_ksp` gives
672information on setting alternative GMRES orthogonalization routines,
673which may provide much better parallel performance.
674
675For elliptic problems one often obtains good performance and scalability
676with multigrid solvers. Consult {any}`sec_amg` for
677available options. Our experience is that GAMG works particularly well
678for elasticity problems, whereas hypre does well for scalar problems.
679
680### System-Related Problems
681
682The performance of a code can be affected by a variety of factors,
683including the cache behavior, other users on the machine, etc. Below we
684briefly describe some common problems and possibilities for overcoming
685them.
686
687- **Problem too large for physical memory size**: When timing a
688  program, one should always leave at least a ten percent margin
689  between the total memory a process is using and the physical size of
690  the machine’s memory. One way to estimate the amount of memory used
691  by given process is with the Unix `getrusage` system routine.
692  The PETSc option `-malloc_view` reports all
693  memory usage, including any Fortran arrays in an application code.
694- **Effects of other users**: If other users are running jobs on the
695  same physical processor nodes on which a program is being profiled,
696  the timing results are essentially meaningless.
697- **Overhead of timing routines on certain machines**: On certain
698  machines, even calling the system clock in order to time routines is
699  slow; this skews all of the flop rates and timing results. The file
700  `$PETSC_DIR/src/benchmarks/PetscTime.c` (<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/benchmarks/PetscTime.c.html">source</a>)
701  contains a simple test problem that will approximate the amount of
702  time required to get the current time in a running program. On good
703  systems it will on the order of $10^{-6}$ seconds or less.
704- **Problem too large for good cache performance**: Certain machines
705  with lower memory bandwidths (slow memory access) attempt to
706  compensate by having a very large cache. Thus, if a significant
707  portion of an application fits within the cache, the program will
708  achieve very good performance; if the code is too large, the
709  performance can degrade markedly. To analyze whether this situation
710  affects a particular code, one can try plotting the total flop rate
711  as a function of problem size. If the flop rate decreases rapidly at
712  some point, then the problem may likely be too large for the cache
713  size.
714- **Inconsistent timings**: Inconsistent timings are likely due to
715  other users on the machine, thrashing (using more virtual memory than
716  available physical memory), or paging in of the initial executable.
717  {any}`sec_profaccuracy` provides information on
718  overcoming paging overhead when profiling a code. We have found on
719  all systems that if you follow all the advise above your timings will
720  be consistent within a variation of less than five percent.
721