xref: /petsc/doc/manual/profiling.md (revision 6c5693054f5123506dab0f5da2d352ed973d0e50)
17f296bb3SBarry Smith(ch_profiling)=
27f296bb3SBarry Smith
37f296bb3SBarry Smith# Profiling
47f296bb3SBarry Smith
57f296bb3SBarry SmithPETSc includes a consistent, lightweight scheme for profiling
67f296bb3SBarry Smithapplication programs. The PETSc routines automatically log
77f296bb3SBarry Smithperformance data if certain options are specified at runtime. The user
87f296bb3SBarry Smithcan also log information about application codes for a complete picture
97f296bb3SBarry Smithof performance.
107f296bb3SBarry Smith
117f296bb3SBarry SmithIn addition, as described in
127f296bb3SBarry Smith{any}`sec_ploginfo`, PETSc provides a mechanism for
137f296bb3SBarry Smithprinting informative messages about computations.
147f296bb3SBarry Smith{any}`sec_profbasic` introduces the various profiling
157f296bb3SBarry Smithoptions in PETSc, while the remainder of the chapter focuses on details
167f296bb3SBarry Smithsuch as monitoring application codes and tips for accurate profiling.
177f296bb3SBarry Smith
187f296bb3SBarry Smith(sec_profbasic)=
197f296bb3SBarry Smith
207f296bb3SBarry Smith## Basic Profiling Information
217f296bb3SBarry Smith
227f296bb3SBarry SmithThe profiling options include the following:
237f296bb3SBarry Smith
247f296bb3SBarry Smith- `-log_view  [:filename]` - Prints an ASCII version of performance data at the
257f296bb3SBarry Smith  program’s conclusion. These statistics are comprehensive and concise
267f296bb3SBarry Smith  and require little overhead; thus, `-log_view` is intended as the
277f296bb3SBarry Smith  primary means of monitoring the performance of PETSc codes. See {any}`sec_ploginfo`
287f296bb3SBarry Smith- `-info [infofile]` - Prints verbose information about code to
297f296bb3SBarry Smith  stdout or an optional file. This option provides details about
307f296bb3SBarry Smith  algorithms, data structures, etc. Since the overhead of printing such
317f296bb3SBarry Smith  output slows a code, this option should not be used when evaluating a
327f296bb3SBarry Smith  program’s performance. See {any}`sec_PetscInfo`
337f296bb3SBarry Smith- `-log_trace [logfile]` - Traces the beginning and ending of all
347f296bb3SBarry Smith  PETSc events. This option, which can be used in conjunction with
357f296bb3SBarry Smith  `-info`, is useful to see where a program is hanging without
367f296bb3SBarry Smith  running in the debugger. See `PetscLogTraceBegin()`.
377f296bb3SBarry Smith
387f296bb3SBarry SmithAs discussed in {any}`sec_mpelogs`, additional profiling
397f296bb3SBarry Smithcan be done with MPE.
407f296bb3SBarry Smith
417f296bb3SBarry Smith(sec_ploginfo)=
427f296bb3SBarry Smith
437f296bb3SBarry Smith### Interpreting `-log_view` Output: The Basics
447f296bb3SBarry Smith
457f296bb3SBarry SmithAs shown in {any}`listing <listing_exprof>` in {any}`sec_profiling_programs`, the
467f296bb3SBarry Smithoption `-log_view` `[:filename]` activates printing of profile data to standard
477f296bb3SBarry Smithoutput or an ASCII file at the conclusion of a program. See `PetscLogView()` for all the possible
487f296bb3SBarry Smithoutput options.
497f296bb3SBarry Smith
507f296bb3SBarry SmithWe print performance data for each routine, organized by PETSc
517f296bb3SBarry Smithlibraries, followed by any user-defined events (discussed in
527f296bb3SBarry Smith{any}`sec_profileuser`). For each routine, the output data
537f296bb3SBarry Smithinclude the maximum time and floating point operation (flop) rate over
547f296bb3SBarry Smithall processes. Information about parallel performance is also included,
557f296bb3SBarry Smithas discussed in the following section.
567f296bb3SBarry Smith
577f296bb3SBarry SmithFor the purpose of PETSc floating point operation counting, we define
587f296bb3SBarry Smithone *flop* as one operation of any of the following types:
597f296bb3SBarry Smithmultiplication, division, addition, or subtraction. For example, one
607f296bb3SBarry Smith`VecAXPY()` operation, which computes $y = \alpha x + y$ for
617f296bb3SBarry Smithvectors of length $N$, requires $2N$ flop (consisting of
627f296bb3SBarry Smith$N$ additions and $N$ multiplications). Bear in mind that
637f296bb3SBarry Smithflop rates present only a limited view of performance, since memory
647f296bb3SBarry Smithloads and stores are the real performance barrier.
657f296bb3SBarry Smith
667f296bb3SBarry SmithFor simplicity, the remainder of this discussion focuses on interpreting
677f296bb3SBarry Smithprofile data for the `KSP` library, which provides the linear solvers
687f296bb3SBarry Smithat the heart of the PETSc package. Recall the hierarchical organization
697f296bb3SBarry Smithof the PETSc library, as shown in
707f296bb3SBarry Smith{any}`fig_library`. Each `KSP` solver is composed
717f296bb3SBarry Smithof a `PC` (preconditioner) and a `KSP` (Krylov subspace) part, which
727f296bb3SBarry Smithare in turn built on top of the `Mat` (matrix) and `Vec` (vector)
737f296bb3SBarry Smithmodules. Thus, operations in the `KSP` module are composed of
747f296bb3SBarry Smithlower-level operations in these packages. Note also that the nonlinear
757f296bb3SBarry Smithsolvers library, `SNES`, is built on top of the `KSP` module, and
767f296bb3SBarry Smiththe timestepping library, `TS`, is in turn built on top of `SNES`.
777f296bb3SBarry Smith
787f296bb3SBarry SmithWe briefly discuss interpretation of the sample output in
797f296bb3SBarry Smith{any}`listing <listing_exprof>`, which was generated by solving a
807f296bb3SBarry Smithlinear system on one process using restarted GMRES and ILU
817f296bb3SBarry Smithpreconditioning. The linear solvers in `KSP` consist of two basic
827f296bb3SBarry Smithphases, `KSPSetUp()` and `KSPSolve()`, each of which consists of a
837f296bb3SBarry Smithvariety of actions, depending on the particular solution technique. For
847f296bb3SBarry Smiththe case of using the `PCILU` preconditioner and `KSPGMRES` Krylov
857f296bb3SBarry Smithsubspace method, the breakdown of PETSc routines is listed below. As
867f296bb3SBarry Smithindicated by the levels of indentation, the operations in `KSPSetUp()`
877f296bb3SBarry Smithinclude all of the operations within `PCSetUp()`, which in turn
887f296bb3SBarry Smithinclude `MatILUFactor()`, and so on.
897f296bb3SBarry Smith
907f296bb3SBarry Smith- `KSPSetUp` - Set up linear solver
917f296bb3SBarry Smith
927f296bb3SBarry Smith  - `PCSetUp` - Set up preconditioner
937f296bb3SBarry Smith
947addb90fSBarry Smith    - `MatILUFactor` - Factor matrix
957f296bb3SBarry Smith
967f296bb3SBarry Smith      - `MatILUFactorSymbolic` - Symbolic factorization phase
977f296bb3SBarry Smith      - `MatLUFactorNumeric` - Numeric factorization phase
987f296bb3SBarry Smith
997f296bb3SBarry Smith- `KSPSolve` - Solve linear system
1007f296bb3SBarry Smith
1017f296bb3SBarry Smith  - `PCApply` - Apply preconditioner
1027f296bb3SBarry Smith
1037f296bb3SBarry Smith    - `MatSolve` - Forward/backward triangular solves
1047f296bb3SBarry Smith
1057f296bb3SBarry Smith  - `KSPGMRESOrthog` - Orthogonalization in GMRES
1067f296bb3SBarry Smith
1077f296bb3SBarry Smith    - `VecDot` or `VecMDot` - Inner products
1087f296bb3SBarry Smith    - `VecAXPY` or `VecMAXPY` - vector updates
1097f296bb3SBarry Smith
1107f296bb3SBarry Smith  - `MatMult` - Matrix-vector product
1117f296bb3SBarry Smith
1127f296bb3SBarry Smith  - `MatMultAdd` - Matrix-vector product + vector addition
1137f296bb3SBarry Smith
1147f296bb3SBarry Smith    - `VecScale`, `VecNorm`, `VecAXPY`, `VecCopy`, ...
1157f296bb3SBarry Smith
1167f296bb3SBarry SmithThe summaries printed via `-log_view` reflect this routine hierarchy.
1177f296bb3SBarry SmithFor example, the performance summaries for a particular high-level
1187f296bb3SBarry Smithroutine such as `KSPSolve()` include all of the operations accumulated
1197f296bb3SBarry Smithin the lower-level components that make up the routine.
1207f296bb3SBarry Smith
1217f296bb3SBarry SmithThe output produced with `-log_view` is flat, meaning that the hierarchy
1227f296bb3SBarry Smithof PETSc operations is not completely clear. For a
1237f296bb3SBarry Smithparticular problem, the user should generally have an idea of the basic
1247f296bb3SBarry Smithoperations that are required for its implementation (e.g., which
1257f296bb3SBarry Smithoperations are performed when using GMRES and ILU, as described above),
1267f296bb3SBarry Smithso that interpreting the `-log_view` data should be relatively
1277f296bb3SBarry Smithstraightforward.
1287f296bb3SBarry SmithIf this is problematic then it is also possible to examine
1297f296bb3SBarry Smiththe profiling information in a nested format. For more information see
1307f296bb3SBarry Smith{any}`sec_nestedevents`.
1317f296bb3SBarry Smith
1327f296bb3SBarry Smith(sec_parperformance)=
1337f296bb3SBarry Smith
1347f296bb3SBarry Smith### Interpreting `-log_view` Output: Parallel Performance
1357f296bb3SBarry Smith
1367f296bb3SBarry SmithWe next discuss performance summaries for parallel programs, as shown
1377f296bb3SBarry Smithwithin {any}`listing <listing_exparprof>` and {any}`listing <listing_exparprof2>`,
1387f296bb3SBarry Smithwhich presents the
1397f296bb3SBarry Smithoutput generated by the `-log_view` option. The program that generated
1407f296bb3SBarry Smiththis data is
1417f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/ksp/ksp/tutorials/ex10.c.html">KSP Tutorial ex10</a>.
1427f296bb3SBarry SmithThe code loads a matrix and right-hand-side vector from a binary file
1437f296bb3SBarry Smithand then solves the resulting linear system; the program then repeats
1447f296bb3SBarry Smiththis process for a second linear system. This particular case was run on
1457f296bb3SBarry Smithfour processors of an Intel x86_64 Linux cluster, using restarted GMRES
1467f296bb3SBarry Smithand the block Jacobi preconditioner, where each block was solved with
1477f296bb3SBarry SmithILU. The two input files `medium` and `arco6` can be obtained from
1487f296bb3SBarry Smith[datafiles](https://gitlab.com/petsc/datafiles), see {any}`petsc_repositories`.
1497f296bb3SBarry Smith
1507f296bb3SBarry SmithThe first {any}`listing <listing_exparprof>` presents an overall
1517f296bb3SBarry Smithperformance summary, including times, floating-point operations,
1527f296bb3SBarry Smithcomputational rates, and message-passing activity (such as the number
1537f296bb3SBarry Smithand size of messages sent and collective operations). Summaries for
1547f296bb3SBarry Smithvarious user-defined stages of monitoring (as discussed in
1557f296bb3SBarry Smith{any}`sec_profstages`) are also given. Information about the
1567f296bb3SBarry Smithvarious phases of computation then follow (as shown separately here in
1577f296bb3SBarry Smiththe second {any}`listing <listing_exparprof2>`). Finally, a summary of
1587f296bb3SBarry Smithobject creation and destruction is presented.
1597f296bb3SBarry Smith
1607f296bb3SBarry Smith(listing_exparprof)=
1617f296bb3SBarry Smith
1627f296bb3SBarry Smith```none
1637f296bb3SBarry Smithmpiexec -n 4 ./ex10 -f0 medium -f1 arco6 -ksp_gmres_classicalgramschmidt -log_view -mat_type baij \
1647f296bb3SBarry Smith            -matload_block_size 3 -pc_type bjacobi -options_left
1657f296bb3SBarry Smith
1667f296bb3SBarry SmithNumber of iterations = 19
1677f296bb3SBarry SmithResidual norm 1.088292e-05
1687f296bb3SBarry SmithNumber of iterations = 59
1697f296bb3SBarry SmithResidual norm 3.871022e-02
1707f296bb3SBarry Smith---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
1717f296bb3SBarry Smith
1727f296bb3SBarry Smith./ex10 on a intel-bdw-opt named beboplogin4 with 4 processors, by jczhang Mon Apr 23 13:36:54 2018
1737f296bb3SBarry SmithUsing PETSc Development Git Revision: v3.9-163-gbe3efd42 Git Date: 2018-04-16 10:45:40 -0500
1747f296bb3SBarry Smith
1757f296bb3SBarry Smith                         Max       Max/Min        Avg      Total
1767f296bb3SBarry SmithTime (sec):           1.849e-01      1.00002   1.849e-01
1777f296bb3SBarry SmithObjects:              1.060e+02      1.00000   1.060e+02
1787f296bb3SBarry SmithFlops:                2.361e+08      1.00684   2.353e+08  9.413e+08
1797f296bb3SBarry SmithFlops/sec:            1.277e+09      1.00685   1.273e+09  5.091e+09
1807f296bb3SBarry SmithMPI Msg Count:        2.360e+02      1.34857   2.061e+02  8.245e+02
1817f296bb3SBarry SmithMPI Msg Len (bytes):  1.256e+07      2.24620   4.071e+04  3.357e+07
1827f296bb3SBarry SmithMPI Reductions:       2.160e+02      1.00000
1837f296bb3SBarry Smith
1847f296bb3SBarry SmithSummary of Stages:   ----- Time ------  ----- Flop -----  --- Messages ---  -- Message Lengths --  -- Reductions --
1857f296bb3SBarry Smith                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total
1867f296bb3SBarry Smith 0:      Main Stage: 5.9897e-04   0.3%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  2.000e+00   0.9%
1877f296bb3SBarry Smith 1:   Load System 0: 2.9113e-03   1.6%  0.0000e+00   0.0%  3.550e+01   4.3%  5.984e+02        0.1%  2.200e+01  10.2%
1887f296bb3SBarry Smith 2:      KSPSetUp 0: 7.7349e-04   0.4%  9.9360e+03   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  2.000e+00   0.9%
1897f296bb3SBarry Smith 3:      KSPSolve 0: 1.7690e-03   1.0%  2.9673e+05   0.0%  1.520e+02  18.4%  1.800e+02        0.1%  3.900e+01  18.1%
1907f296bb3SBarry Smith 4:   Load System 1: 1.0056e-01  54.4%  0.0000e+00   0.0%  3.700e+01   4.5%  5.657e+05       62.4%  2.200e+01  10.2%
1917f296bb3SBarry Smith 5:      KSPSetUp 1: 5.6883e-03   3.1%  2.1205e+07   2.3%  0.000e+00   0.0%  0.000e+00        0.0%  2.000e+00   0.9%
1927f296bb3SBarry Smith 6:      KSPSolve 1: 7.2578e-02  39.3%  9.1979e+08  97.7%  6.000e+02  72.8%  2.098e+04       37.5%  1.200e+02  55.6%
1937f296bb3SBarry Smith
1947f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------
1957f296bb3SBarry Smith
1967f296bb3SBarry Smith.... [Summary of various phases, see part II below] ...
1977f296bb3SBarry Smith
1987f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------
1997f296bb3SBarry Smith
2007f296bb3SBarry SmithObject Type          Creations   Destructions    (Reports information only for process 0.)
2017f296bb3SBarry Smith...
2027f296bb3SBarry Smith--- Event Stage 3: KSPSolve 0
2037f296bb3SBarry Smith
2047f296bb3SBarry Smith              Matrix     0              4
2057f296bb3SBarry Smith              Vector    20             30
2067f296bb3SBarry Smith           Index Set     0              3
2077f296bb3SBarry Smith         Vec Scatter     0              1
2087f296bb3SBarry Smith       Krylov Solver     0              2
2097f296bb3SBarry Smith      Preconditioner     0              2
2107f296bb3SBarry Smith```
2117f296bb3SBarry Smith
2127f296bb3SBarry SmithWe next focus on the summaries for the various phases of the
2137f296bb3SBarry Smithcomputation, as given in the table within
2147f296bb3SBarry Smiththe following {any}`listing <listing_exparprof2>`. The summary for each
2157f296bb3SBarry Smithphase presents the maximum times and flop rates over all processes, as
2167f296bb3SBarry Smithwell as the ratio of maximum to minimum times and flop rates for all
2177f296bb3SBarry Smithprocesses. A ratio of approximately 1 indicates that computations within
2187f296bb3SBarry Smitha given phase are well balanced among the processes; as the ratio
2197f296bb3SBarry Smithincreases, the balance becomes increasingly poor. Also, the total
2207f296bb3SBarry Smithcomputational rate (in units of MFlop/sec) is given for each phase in
2217f296bb3SBarry Smiththe final column of the phase summary table.
2227f296bb3SBarry Smith
2237f296bb3SBarry Smith$$
2247f296bb3SBarry Smith{\rm Total\: Mflop/sec} \:=\: 10^{-6} * ({\rm sum\; of\; flop\; over\; all\; processors})/({\rm max\; time\; over\; all\; processors})
2257f296bb3SBarry Smith$$
2267f296bb3SBarry Smith
2277f296bb3SBarry Smith*Note*: Total computational rates $<$ 1 MFlop are listed as 0 in
2287f296bb3SBarry Smiththis column of the phase summary table. Additional statistics for each
2297f296bb3SBarry Smithphase include the total number of messages sent, the average message
2307f296bb3SBarry Smithlength, and the number of global reductions.
2317f296bb3SBarry Smith
2327f296bb3SBarry Smith(listing_exparprof2)=
2337f296bb3SBarry Smith
2347f296bb3SBarry Smith```none
2357f296bb3SBarry Smithmpiexec -n 4 ./ex10 -f0 medium -f1 arco6 -ksp_gmres_classicalgramschmidt -log_view -mat_type baij \
2367f296bb3SBarry Smith            -matload_block_size 3 -pc_type bjacobi -options_left
2377f296bb3SBarry Smith
2387f296bb3SBarry Smith---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
2397f296bb3SBarry Smith.... [Overall summary, see part I] ...
2407f296bb3SBarry Smith
2417f296bb3SBarry SmithPhase summary info:
2427f296bb3SBarry Smith   Count: number of times phase was executed
2437f296bb3SBarry Smith   Time and Flop/sec: Max - maximum over all processors
2447f296bb3SBarry Smith                       Ratio - ratio of maximum to minimum over all processors
2457f296bb3SBarry Smith   Mess: number of messages sent
2467f296bb3SBarry Smith   AvgLen: average message length
2477f296bb3SBarry Smith   Reduct: number of global reductions
2487f296bb3SBarry Smith   Global: entire computation
2497f296bb3SBarry Smith   Stage: optional user-defined stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
2507f296bb3SBarry Smith      %T - percent time in this phase         %F - percent flop in this phase
2517f296bb3SBarry Smith      %M - percent messages in this phase     %L - percent message lengths in this phase
2527f296bb3SBarry Smith      %R - percent reductions in this phase
2537f296bb3SBarry Smith   Total Mflop/s: 10^6 * (sum of flop over all processors)/(max time over all processors)
2547f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------
2557f296bb3SBarry SmithPhase              Count      Time (sec)       Flop/sec                          --- Global ---  --- Stage ----  Total
2567f296bb3SBarry Smith                            Max    Ratio      Max    Ratio  Mess AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
2577f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------
2587f296bb3SBarry Smith...
2597f296bb3SBarry Smith
2607f296bb3SBarry Smith--- Event Stage 5: KSPSetUp 1
2617f296bb3SBarry Smith
2627f296bb3SBarry SmithMatLUFactorNum         1 1.0 3.6440e-03 1.1 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00  2  2  0  0  0  62100  0  0  0  5819
2637f296bb3SBarry SmithMatILUFactorSym        1 1.0 1.7111e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  26  0  0  0  0     0
2647f296bb3SBarry SmithMatGetRowIJ            1 1.0 1.1921e-06 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
2657f296bb3SBarry SmithMatGetOrdering         1 1.0 3.0041e-05 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
2667f296bb3SBarry SmithKSPSetUp               2 1.0 6.6495e-04 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00  0  0  0  0  1   9  0  0  0100     0
2677f296bb3SBarry SmithPCSetUp                2 1.0 5.4271e-03 1.2 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00  3  2  0  0  0  90100  0  0  0  3907
2687f296bb3SBarry SmithPCSetUpOnBlocks        1 1.0 5.3999e-03 1.2 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00  3  2  0  0  0  90100  0  0  0  3927
2697f296bb3SBarry Smith
2707f296bb3SBarry Smith--- Event Stage 6: KSPSolve 1
2717f296bb3SBarry Smith
2727f296bb3SBarry SmithMatMult               60 1.0 2.4068e-02 1.1 6.54e+07 1.0 6.0e+02 2.1e+04 0.0e+00 12 27 73 37  0  32 28100100  0 10731
2737f296bb3SBarry SmithMatSolve              61 1.0 1.9177e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 10 25  0  0  0  26 26  0  0  0 12491
2747f296bb3SBarry SmithVecMDot               59 1.0 1.4741e-02 1.3 4.86e+07 1.0 0.0e+00 0.0e+00 5.9e+01  7 21  0  0 27  18 21  0  0 49 13189
2757f296bb3SBarry SmithVecNorm               61 1.0 3.0417e-03 1.4 3.29e+06 1.0 0.0e+00 0.0e+00 6.1e+01  1  1  0  0 28   4  1  0  0 51  4332
2767f296bb3SBarry SmithVecScale              61 1.0 9.9802e-04 1.0 1.65e+06 1.0 0.0e+00 0.0e+00 0.0e+00  1  1  0  0  0   1  1  0  0  0  6602
2777f296bb3SBarry SmithVecCopy                2 1.0 5.9128e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
2787f296bb3SBarry SmithVecSet                64 1.0 8.0323e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
2797f296bb3SBarry SmithVecAXPY                3 1.0 7.4387e-05 1.1 1.62e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  8712
2807f296bb3SBarry SmithVecMAXPY              61 1.0 8.8558e-03 1.1 5.18e+07 1.0 0.0e+00 0.0e+00 0.0e+00  5 22  0  0  0  12 23  0  0  0 23393
2817f296bb3SBarry SmithVecScatterBegin       60 1.0 9.6416e-04 1.8 0.00e+00 0.0 6.0e+02 2.1e+04 0.0e+00  0  0 73 37  0   1  0100100  0     0
2827f296bb3SBarry SmithVecScatterEnd         60 1.0 6.1543e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0   8  0  0  0  0     0
2837f296bb3SBarry SmithVecNormalize          61 1.0 4.2675e-03 1.3 4.94e+06 1.0 0.0e+00 0.0e+00 6.1e+01  2  2  0  0 28   5  2  0  0 51  4632
2847f296bb3SBarry SmithKSPGMRESOrthog        59 1.0 2.2627e-02 1.1 9.72e+07 1.0 0.0e+00 0.0e+00 5.9e+01 11 41  0  0 27  29 42  0  0 49 17185
2857f296bb3SBarry SmithKSPSolve               1 1.0 7.2577e-02 1.0 2.31e+08 1.0 6.0e+02 2.1e+04 1.2e+02 39 98 73 37 56  99100100100100 12673
2867f296bb3SBarry SmithPCSetUpOnBlocks        1 1.0 9.5367e-07 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
2877f296bb3SBarry SmithPCApply               61 1.0 2.0427e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 11 25  0  0  0  28 26  0  0  0 11726
2887f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------
2897f296bb3SBarry Smith.... [Conclusion of overall summary, see part I] ...
2907f296bb3SBarry Smith```
2917f296bb3SBarry Smith
2927f296bb3SBarry SmithAs discussed in the preceding section, the performance summaries for
2937f296bb3SBarry Smithhigher-level PETSc routines include the statistics for the lower levels
2947f296bb3SBarry Smithof which they are made up. For example, the communication within
2957f296bb3SBarry Smithmatrix-vector products `MatMult()` consists of vector scatter
2967f296bb3SBarry Smithoperations, as given by the routines `VecScatterBegin()` and
2977f296bb3SBarry Smith`VecScatterEnd()`.
2987f296bb3SBarry Smith
2997f296bb3SBarry SmithThe final data presented are the percentages of the various statistics
3007f296bb3SBarry Smith(time (`%T`), flop/sec (`%F`), messages(`%M`), average message
3017f296bb3SBarry Smithlength (`%L`), and reductions (`%R`)) for each event relative to the
3027f296bb3SBarry Smithtotal computation and to any user-defined stages (discussed in
3037f296bb3SBarry Smith{any}`sec_profstages`). These statistics can aid in
3047f296bb3SBarry Smithoptimizing performance, since they indicate the sections of code that
3057f296bb3SBarry Smithcould benefit from various kinds of tuning.
3067f296bb3SBarry Smith{any}`ch_performance` gives suggestions about achieving good
3077f296bb3SBarry Smithperformance with PETSc codes.
3087f296bb3SBarry Smith
3097f296bb3SBarry SmithThe additional option `-log_view_memory` causes the display of additional columns of information about how much
3107f296bb3SBarry Smithmemory was allocated and freed during each logged event. This is useful
3117f296bb3SBarry Smithto understand what phases of a computation require the most memory.
3127f296bb3SBarry Smith
3137f296bb3SBarry Smith(sec_mpelogs)=
3147f296bb3SBarry Smith
3157f296bb3SBarry Smith### Using `-log_mpe` with Jumpshot
3167f296bb3SBarry Smith
3177f296bb3SBarry SmithIt is also possible to use the *Jumpshot* package
3187f296bb3SBarry Smith{cite}`upshot` to visualize PETSc events. This package comes
3197f296bb3SBarry Smithwith the MPE software, which is part of the MPICH
3207f296bb3SBarry Smith{cite}`mpich-web-page` implementation of MPI. The option
3217f296bb3SBarry Smith
3227f296bb3SBarry Smith```none
3237f296bb3SBarry Smith-log_mpe [logfile]
3247f296bb3SBarry Smith```
3257f296bb3SBarry Smith
3267f296bb3SBarry Smithcreates a logfile of events appropriate for viewing with *Jumpshot*. The
3277f296bb3SBarry Smithuser can either use the default logging file or specify a name via
3287f296bb3SBarry Smith`logfile`. Events can be deactivated as described in
3297f296bb3SBarry Smith{any}`sec_deactivate`.
3307f296bb3SBarry Smith
3317f296bb3SBarry SmithThe user can also log MPI events. To do this, simply consider the PETSc
3327f296bb3SBarry Smithapplication as any MPI application, and follow the MPI implementation’s
3337f296bb3SBarry Smithinstructions for logging MPI calls. For example, when using MPICH, this
3347f296bb3SBarry Smithmerely required adding `-llmpich` to the library list *before*
3357f296bb3SBarry Smith`-lmpich`.
3367f296bb3SBarry Smith
3377f296bb3SBarry Smith(sec_nestedevents)=
3387f296bb3SBarry Smith
3397f296bb3SBarry Smith### Profiling Nested Events
3407f296bb3SBarry Smith
3417f296bb3SBarry SmithIt is possible to output the PETSc logging information in a nested format
3427f296bb3SBarry Smithwhere the hierarchy of events is explicit. This output can be generated
3437f296bb3SBarry Smitheither as an XML file or as a text file in a format suitable for viewing as
3447f296bb3SBarry Smitha flame graph.
3457f296bb3SBarry Smith
3467f296bb3SBarry SmithOne can generate the XML output by passing the option `-log_view :[logfilename]:ascii_xml`.
3477f296bb3SBarry SmithIt can be viewed by copying `${PETSC_DIR}/share/petsc/xml/performance_xml2html.xsl`
3487f296bb3SBarry Smithinto the current directory, then opening the logfile in your browser.
3497f296bb3SBarry Smith
3507f296bb3SBarry SmithThe flame graph output can be generated with the option `-log_view :[logfile]:ascii_flamegraph`.
3517f296bb3SBarry SmithIt can then be visualised with either [FlameGraph](https://github.com/brendangregg/FlameGraph)
3527f296bb3SBarry Smithor [speedscope](https://www.speedscope.app). A flamegraph can be visualized directly from
3537f296bb3SBarry Smithstdout using, for example,
3547f296bb3SBarry Smith`ImageMagick's display utility <https://imagemagick.org/script/display.php>`:
3557f296bb3SBarry Smith
3567f296bb3SBarry Smith```
3577f296bb3SBarry Smithcd $PETSC_DIR/src/sys/tests
3587f296bb3SBarry Smithmake ex30
3597f296bb3SBarry Smithmpiexec -n 2 ./ex30 -log_view ::ascii_flamegraph | flamegraph | display
3607f296bb3SBarry Smith```
3617f296bb3SBarry Smith
3627f296bb3SBarry SmithNote that user-defined stages (see {any}`sec_profstages`) will be ignored when
3637f296bb3SBarry Smithusing this nested format.
3647f296bb3SBarry Smith
3657f296bb3SBarry Smith(sec_profileuser)=
3667f296bb3SBarry Smith
3677f296bb3SBarry Smith## Profiling Application Codes
3687f296bb3SBarry Smith
3697f296bb3SBarry SmithPETSc automatically logs object creation, times, and floating-point
3707f296bb3SBarry Smithcounts for the library routines. Users can easily supplement this
3717f296bb3SBarry Smithinformation by profiling their application codes as well. The basic
3727f296bb3SBarry Smithsteps involved in logging a user-defined portion of code, called an
3737f296bb3SBarry Smith*event*, are shown in the code fragment below:
3747f296bb3SBarry Smith
3757f296bb3SBarry Smith```
3767f296bb3SBarry SmithPetscLogEvent  USER_EVENT;
3777f296bb3SBarry SmithPetscClassId   classid;
3787f296bb3SBarry SmithPetscLogDouble user_event_flops;
3797f296bb3SBarry Smith
3807f296bb3SBarry SmithPetscClassIdRegister("class name",&classid);
3817f296bb3SBarry SmithPetscLogEventRegister("User event name",classid,&USER_EVENT);
3827f296bb3SBarry SmithPetscLogEventBegin(USER_EVENT,0,0,0,0);
3837f296bb3SBarry Smith/* code segment to monitor */
3847f296bb3SBarry SmithPetscLogFlops(user_event_flops);
3857f296bb3SBarry SmithPetscLogEventEnd(USER_EVENT,0,0,0,0);
3867f296bb3SBarry Smith```
3877f296bb3SBarry Smith
3887f296bb3SBarry SmithOne must register the event by calling `PetscLogEventRegister()`,
3897f296bb3SBarry Smithwhich assigns a unique integer to identify the event for profiling
3907f296bb3SBarry Smithpurposes:
3917f296bb3SBarry Smith
3927f296bb3SBarry Smith```
3937f296bb3SBarry SmithPetscLogEventRegister(const char string[],PetscClassId classid,PetscLogEvent *e);
3947f296bb3SBarry Smith```
3957f296bb3SBarry Smith
3967f296bb3SBarry SmithHere `string` is a user-defined event name, and `color` is an
3977f296bb3SBarry Smithoptional user-defined event color (for use with *Jumpshot* logging; see
3987f296bb3SBarry Smith{any}`sec_mpelogs`); one should see the manual page for
3997f296bb3SBarry Smithdetails. The argument returned in `e` should then be passed to the
4007f296bb3SBarry Smith`PetscLogEventBegin()` and `PetscLogEventEnd()` routines.
4017f296bb3SBarry Smith
4027f296bb3SBarry SmithEvents are logged by using the pair
4037f296bb3SBarry Smith
4047f296bb3SBarry Smith```
4057f296bb3SBarry SmithPetscLogEventBegin(int event,PetscObject o1,PetscObject o2,PetscObject o3,PetscObject o4);
4067f296bb3SBarry SmithPetscLogEventEnd(int event,PetscObject o1,PetscObject o2,PetscObject o3,PetscObject o4);
4077f296bb3SBarry Smith```
4087f296bb3SBarry Smith
4097f296bb3SBarry SmithThe four objects are the PETSc objects that are most closely associated
4107f296bb3SBarry Smithwith the event. For instance, in a matrix-vector product they would be
4117f296bb3SBarry Smiththe matrix and the two vectors. These objects can be omitted by
4127f296bb3SBarry Smithspecifying 0 for `o1` - `o4`. The code between these two routine
4137f296bb3SBarry Smithcalls will be automatically timed and logged as part of the specified
4147f296bb3SBarry Smithevent.
4157f296bb3SBarry Smith
4167f296bb3SBarry SmithEvents are collective by default on the communicator of `o1` (if present).
4177f296bb3SBarry SmithThey can be made not collective by using `PetscLogEventSetCollective()`.
4187f296bb3SBarry SmithNo synchronization is performed on collective events in optimized builds unless
4197f296bb3SBarry Smiththe command line option `-log_sync` is used; however, we do check for collective
4207f296bb3SBarry Smithsemantics in debug mode.
4217f296bb3SBarry Smith
4227f296bb3SBarry SmithThe user can log the number of floating-point operations for this
4237f296bb3SBarry Smithsegment of code by calling
4247f296bb3SBarry Smith
4257f296bb3SBarry Smith```
4267f296bb3SBarry SmithPetscLogFlops(number of flop for this code segment);
4277f296bb3SBarry Smith```
4287f296bb3SBarry Smith
4297f296bb3SBarry Smithbetween the calls to `PetscLogEventBegin()` and
4307f296bb3SBarry Smith`PetscLogEventEnd()`. This value will automatically be added to the
4317f296bb3SBarry Smithglobal flop counter for the entire program.
4327f296bb3SBarry Smith
4337f296bb3SBarry Smith(sec_profstages)=
4347f296bb3SBarry Smith
4357f296bb3SBarry Smith## Profiling Multiple Sections of Code
4367f296bb3SBarry Smith
4377f296bb3SBarry SmithBy default, the profiling produces a single set of statistics for all
4387f296bb3SBarry Smithcode between the `PetscInitialize()` and `PetscFinalize()` calls
4397f296bb3SBarry Smithwithin a program. One can independently monitor several "stages" of code
4407f296bb3SBarry Smithby switching among the various stages with the commands
4417f296bb3SBarry Smith
4427f296bb3SBarry Smith```
4437f296bb3SBarry SmithPetscLogStagePush(PetscLogStage stage);
4447f296bb3SBarry SmithPetscLogStagePop();
4457f296bb3SBarry Smith```
4467f296bb3SBarry Smith
4477f296bb3SBarry Smithsee the manual pages for details.
4487f296bb3SBarry SmithThe command
4497f296bb3SBarry Smith
4507f296bb3SBarry Smith```
4517f296bb3SBarry SmithPetscLogStageRegister(const char *name,PetscLogStage *stage)
4527f296bb3SBarry Smith```
4537f296bb3SBarry Smith
4547f296bb3SBarry Smithallows one to associate a name with a stage; these names are printed
4557f296bb3SBarry Smithwhenever summaries are generated with `-log_view`. The following code fragment uses three profiling
4567f296bb3SBarry Smithstages within an program.
4577f296bb3SBarry Smith
4587f296bb3SBarry Smith```
4597f296bb3SBarry SmithPetscInitialize(int *argc,char ***args,0,0);
4607f296bb3SBarry Smith/* stage 0 of code here */
4617f296bb3SBarry SmithPetscLogStageRegister("Stage 0 of Code", &stagenum0);
4627f296bb3SBarry Smithfor (i=0; i<ntimes; i++) {
4637f296bb3SBarry Smith    PetscLogStageRegister("Stage 1 of Code", &stagenum1);
4647f296bb3SBarry Smith    PetscLogStagePush(stagenum1);
4657f296bb3SBarry Smith    /* stage 1 of code here */
4667f296bb3SBarry Smith    PetscLogStagePop();
4677f296bb3SBarry Smith    PetscLogStageRegister("Stage 2 of Code", &stagenum2);
4687f296bb3SBarry Smith    PetscLogStagePush(stagenum2);
4697f296bb3SBarry Smith    /* stage 2 of code here */
4707f296bb3SBarry Smith    PetscLogStagePop();
4717f296bb3SBarry Smith}
4727f296bb3SBarry SmithPetscFinalize();
4737f296bb3SBarry Smith```
4747f296bb3SBarry Smith
4757f296bb3SBarry SmithThe listings above
4767f296bb3SBarry Smithshow output generated by
4777f296bb3SBarry Smith`-log_view` for a program that employs several profiling stages. In
4787f296bb3SBarry Smithparticular, this program is subdivided into six stages: loading a matrix and right-hand-side vector from a binary file,
4797f296bb3SBarry Smithsetting up the preconditioner, and solving the linear system; this
4807f296bb3SBarry Smithsequence is then repeated for a second linear system. For simplicity,
4817f296bb3SBarry Smiththe second listing contains output only for
4827f296bb3SBarry Smithstages 5 and 6 (linear solve of the second system), which comprise the
4837f296bb3SBarry Smithpart of this computation of most interest to us in terms of performance
4847f296bb3SBarry Smithmonitoring. This code organization (solving a small linear system
4857f296bb3SBarry Smithfollowed by a larger system) enables generation of more accurate
4867f296bb3SBarry Smithprofiling statistics for the second system by overcoming the often
4877f296bb3SBarry Smithconsiderable overhead of paging, as discussed in
4887f296bb3SBarry Smith{any}`sec_profaccuracy`.
4897f296bb3SBarry Smith
4907f296bb3SBarry Smith(sec_deactivate)=
4917f296bb3SBarry Smith
4927f296bb3SBarry Smith## Restricting Event Logging
4937f296bb3SBarry Smith
4947f296bb3SBarry SmithBy default, all PETSc operations are logged. To enable or disable the
4957f296bb3SBarry SmithPETSc logging of individual events, one uses the commands
4967f296bb3SBarry Smith
4977f296bb3SBarry Smith```
4987f296bb3SBarry SmithPetscLogEventActivate(PetscLogEvent event);
4997f296bb3SBarry SmithPetscLogEventDeactivate(PetscLogEvent event);
5007f296bb3SBarry Smith```
5017f296bb3SBarry Smith
5027f296bb3SBarry SmithThe `event` may be either a predefined PETSc event (as listed in the
5037f296bb3SBarry Smithfile `$PETSC_DIR/include/petsclog.h`) or one obtained with
5047f296bb3SBarry Smith`PetscLogEventRegister()` (as described in
5057f296bb3SBarry Smith{any}`sec_profileuser`).
5067f296bb3SBarry Smith
5077f296bb3SBarry SmithPETSc also provides routines that deactivate (or activate) logging for
5087f296bb3SBarry Smithentire components of the library. Currently, the components that support
5097f296bb3SBarry Smithsuch logging (de)activation are `Mat` (matrices), `Vec` (vectors),
5107f296bb3SBarry Smith`KSP` (linear solvers, including `KSP` and `PC`), and `SNES`
5117f296bb3SBarry Smith(nonlinear solvers):
5127f296bb3SBarry Smith
5137f296bb3SBarry Smith```
5147f296bb3SBarry SmithPetscLogEventDeactivateClass(MAT_CLASSID);
5157f296bb3SBarry SmithPetscLogEventDeactivateClass(KSP_CLASSID); /* includes PC and KSP */
5167f296bb3SBarry SmithPetscLogEventDeactivateClass(VEC_CLASSID);
5177f296bb3SBarry SmithPetscLogEventDeactivateClass(SNES_CLASSID);
5187f296bb3SBarry Smith```
5197f296bb3SBarry Smith
5207f296bb3SBarry Smithand
5217f296bb3SBarry Smith
5227f296bb3SBarry Smith```
5237f296bb3SBarry SmithPetscLogEventActivateClass(MAT_CLASSID);
5247f296bb3SBarry SmithPetscLogEventActivateClass(KSP_CLASSID);   /* includes PC and KSP */
5257f296bb3SBarry SmithPetscLogEventActivateClass(VEC_CLASSID);
5267f296bb3SBarry SmithPetscLogEventActivateClass(SNES_CLASSID);
5277f296bb3SBarry Smith```
5287f296bb3SBarry Smith
5297f296bb3SBarry Smith(sec_petscinfo)=
5307f296bb3SBarry Smith
5317f296bb3SBarry Smith## Interpreting `-info` Output: Informative Messages
5327f296bb3SBarry Smith
5337f296bb3SBarry SmithUsers can activate the printing of verbose information about algorithms,
5347f296bb3SBarry Smithdata structures, etc. to the screen by using the option `-info` or by
5357f296bb3SBarry Smithcalling `PetscInfoAllow(PETSC_TRUE)`. Such logging, which is used
5367f296bb3SBarry Smiththroughout the PETSc libraries, can aid the user in understanding
5377f296bb3SBarry Smithalgorithms and tuning program performance. For example, as discussed in
5387f296bb3SBarry Smith{any}`sec_matsparse`, `-info` activates the printing of
5397f296bb3SBarry Smithinformation about memory allocation during matrix assembly.
5407f296bb3SBarry Smith
5417f296bb3SBarry SmithOne can selectively turn off informative messages about any of the basic
5427f296bb3SBarry SmithPETSc objects (e.g., `Mat`, `SNES`) with the command
5437f296bb3SBarry Smith
5447f296bb3SBarry Smith```
5457f296bb3SBarry SmithPetscInfoDeactivateClass(int object_classid)
5467f296bb3SBarry Smith```
5477f296bb3SBarry Smith
5487f296bb3SBarry Smithwhere `object_classid` is one of `MAT_CLASSID`, `SNES_CLASSID`,
5497f296bb3SBarry Smithetc. Messages can be reactivated with the command
5507f296bb3SBarry Smith
5517f296bb3SBarry Smith```
5527f296bb3SBarry SmithPetscInfoActivateClass(int object_classid)
5537f296bb3SBarry Smith```
5547f296bb3SBarry Smith
5557f296bb3SBarry SmithSuch deactivation can be useful when one wishes to view information
5567f296bb3SBarry Smithabout higher-level PETSc libraries (e.g., `TS` and `SNES`) without
5577f296bb3SBarry Smithseeing all lower level data as well (e.g., `Mat`).
5587f296bb3SBarry Smith
5597f296bb3SBarry SmithOne can turn on or off logging for particular classes at runtime
5607f296bb3SBarry Smith
5617f296bb3SBarry Smith```console
5627f296bb3SBarry Smith-info [filename][:[~]<list,of,classnames>[:[~]self]]
5637f296bb3SBarry Smith```
5647f296bb3SBarry Smith
5657f296bb3SBarry SmithThe `list,of,classnames` is a list, separated by commas with no spaces, of classes one wishes to view the information on. For
5667f296bb3SBarry Smithexample `vec,ksp`. Information on all other classes will not be displayed. The ~ indicates to not display the list of classes but rather to display all other classes.
5677f296bb3SBarry Smith
5687f296bb3SBarry Smith`self` indicates to display information on objects that are associated with `PETSC_COMM_SELF` while `~self` indicates to display information only for parallel objects.
5697f296bb3SBarry Smith
5707f296bb3SBarry SmithSee `PetscInfo()` for links to all the info operations that are available.
5717f296bb3SBarry Smith
5727f296bb3SBarry SmithApplication programmers can log their own messages, as well, by using the
5737f296bb3SBarry Smithroutine
5747f296bb3SBarry Smith
5757f296bb3SBarry Smith```
5767f296bb3SBarry SmithPetscInfo(void* obj,char *message,...)
5777f296bb3SBarry Smith```
5787f296bb3SBarry Smith
5797f296bb3SBarry Smithwhere `obj` is the PETSc object associated most closely with the
5807f296bb3SBarry Smithlogging statement, `message`. For example, in the line search Newton
5817f296bb3SBarry Smithmethods, we use a statement such as
5827f296bb3SBarry Smith
5837f296bb3SBarry Smith```
5847f296bb3SBarry SmithPetscInfo(snes,"Cubic step, lambda %g\n",lambda);
5857f296bb3SBarry Smith```
5867f296bb3SBarry Smith
5877f296bb3SBarry Smith## Time
5887f296bb3SBarry Smith
5897f296bb3SBarry SmithPETSc application programmers can access the wall clock time directly
5907f296bb3SBarry Smithwith the command
5917f296bb3SBarry Smith
5927f296bb3SBarry Smith```
5937f296bb3SBarry SmithPetscLogDouble time;
5947f296bb3SBarry SmithPetscCall(PetscTime(&time));
5957f296bb3SBarry Smith```
5967f296bb3SBarry Smith
5977f296bb3SBarry Smithwhich returns the current time in seconds since the epoch, and is
5987f296bb3SBarry Smithcommonly implemented with `MPI_Wtime`. A floating point number is
5997f296bb3SBarry Smithreturned in order to express fractions of a second. In addition, as
6007f296bb3SBarry Smithdiscussed in {any}`sec_profileuser`, PETSc can automatically
6017f296bb3SBarry Smithprofile user-defined segments of code.
6027f296bb3SBarry Smith
6037f296bb3SBarry Smith## Saving Output to a File
6047f296bb3SBarry Smith
6057f296bb3SBarry SmithAll output from PETSc programs (including informative messages,
6067f296bb3SBarry Smithprofiling information, and convergence data) can be saved to a file by
6077f296bb3SBarry Smithusing the command line option `-history [filename]`. If no file name
6087f296bb3SBarry Smithis specified, the output is stored in the file
6097f296bb3SBarry Smith`${HOME}/.petschistory`. Note that this option only saves output
6107f296bb3SBarry Smithprinted with the `PetscPrintf()` and `PetscFPrintf()` commands, not
6117f296bb3SBarry Smiththe standard `printf()` and `fprintf()` statements.
6127f296bb3SBarry Smith
6137f296bb3SBarry Smith(sec_profaccuracy)=
6147f296bb3SBarry Smith
6157f296bb3SBarry Smith## Accurate Profiling and Paging Overheads
6167f296bb3SBarry Smith
6177f296bb3SBarry SmithOne factor that often plays a significant role in profiling a code is
6187f296bb3SBarry Smithpaging by the operating system. Generally, when running a program, only
6197f296bb3SBarry Smitha few pages required to start it are loaded into memory rather than the
6207f296bb3SBarry Smithentire executable. When the execution proceeds to code segments that are
6217f296bb3SBarry Smithnot in memory, a pagefault occurs, prompting the required pages to be
6227f296bb3SBarry Smithloaded from the disk (a very slow process). This activity distorts the
6237f296bb3SBarry Smithresults significantly. (The paging effects are noticeable in the log
6247f296bb3SBarry Smithfiles generated by `-log_mpe`, which is described in
6257f296bb3SBarry Smith{any}`sec_mpelogs`.)
6267f296bb3SBarry Smith
6277f296bb3SBarry SmithTo eliminate the effects of paging when profiling the performance of a
6287f296bb3SBarry Smithprogram, we have found an effective procedure is to run the *exact same
6297f296bb3SBarry Smithcode* on a small dummy problem before running it on the actual problem
6307f296bb3SBarry Smithof interest. We thus ensure that all code required by a solver is loaded
6317f296bb3SBarry Smithinto memory during solution of the small problem. When the code proceeds
6327f296bb3SBarry Smithto the actual (larger) problem of interest, all required pages have
6337f296bb3SBarry Smithalready been loaded into main memory, so that the performance numbers
6347f296bb3SBarry Smithare not distorted.
6357f296bb3SBarry Smith
6367f296bb3SBarry SmithWhen this procedure is used in conjunction with the user-defined stages
6377f296bb3SBarry Smithof profiling described in {any}`sec_profstages`, we can
6387f296bb3SBarry Smithfocus easily on the problem of interest. For example, we used this
6397f296bb3SBarry Smithtechnique in the program
6407f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/ksp/ksp/tutorials/ex10.c.html">KSP Tutorial ex10</a>
6417f296bb3SBarry Smithto generate the timings within
6427f296bb3SBarry Smith{any}`listing <listing_exparprof>` and {any}`listing <listing_exparprof2>`.
6437f296bb3SBarry SmithIn this case, the profiled code
6447f296bb3SBarry Smithof interest (solving the linear system for the larger problem) occurs
6457f296bb3SBarry Smithwithin event stages 5 and 6. {any}`sec_parperformance`
6467f296bb3SBarry Smithprovides details about interpreting such profiling data.
6477f296bb3SBarry Smith
6487f296bb3SBarry SmithIn particular, the macros
6497f296bb3SBarry Smith
6507f296bb3SBarry Smith```
6517f296bb3SBarry SmithPetscPreLoadBegin(PetscBool flag,char* stagename)
6527f296bb3SBarry SmithPetscPreLoadStage(char *stagename)
6537f296bb3SBarry Smith```
6547f296bb3SBarry Smith
6557f296bb3SBarry Smithand
6567f296bb3SBarry Smith
6577f296bb3SBarry Smith```
6587f296bb3SBarry SmithPetscPreLoadEnd()
6597f296bb3SBarry Smith```
6607f296bb3SBarry Smith
6617f296bb3SBarry Smithcan be used to easily convert a regular PETSc program to one that uses
6627f296bb3SBarry Smithpreloading. The command line options `-preload` `true` and
6637f296bb3SBarry Smith`-preload` `false` may be used to turn on and off preloading at run
6647f296bb3SBarry Smithtime for PETSc programs that use these macros.
6657f296bb3SBarry Smith
6667f296bb3SBarry Smith## NVIDIA Nsight Systems profiling
6677f296bb3SBarry Smith
6687f296bb3SBarry SmithNsight Systems will generate profiling data with a CUDA executable
6697f296bb3SBarry Smithwith the command `nsys`.
6707f296bb3SBarry SmithFor example, in serial
6717f296bb3SBarry Smith
6727f296bb3SBarry Smith```bash
6737f296bb3SBarry Smithnsys profile -t nvtx,cuda -o file --stats=true --force-overwrite true ./a.out
6747f296bb3SBarry Smith```
6757f296bb3SBarry Smith
6767f296bb3SBarry Smithwill generate a file `file.qdstrm` with performance data that is
6777f296bb3SBarry Smithannotated with PETSc events (methods) and Kokkos device kernel names.
6787f296bb3SBarry SmithThe Nsight Systems GUI, `nsys-ui`, can be used to navigate this file
6797f296bb3SBarry Smith(<https://developer.nvidia.com/nsight-systems>). The Nsight Systems GUI
6807f296bb3SBarry Smithlets you see a timeline of code performance information like kernels,
6817f296bb3SBarry Smithmemory mallocs and frees, CPU-GPU communication, and high-level data like time, sizes
6827f296bb3SBarry Smithof memory copies, and more, in a popup window when the mouse
6837f296bb3SBarry Smithhovers over the section.
6847f296bb3SBarry SmithTo view the data, start `nsys-ui` without any arguments and then `Import` the
6857f296bb3SBarry Smith`.qdstrm` file in the GUI.
6867f296bb3SBarry SmithA side effect of this viewing process is the generation of a file `file.nsys-rep`, which can be viewed directly
6877f296bb3SBarry Smithwith `nsys-ui` in the future.
6887f296bb3SBarry Smith
6897f296bb3SBarry SmithFor an MPI parallel job, only one process can call `nsys`,
6907f296bb3SBarry Smithsay have rank zero output `nsys` data and have all other
6917f296bb3SBarry Smithranks call the executable directly. For example with MPICH
6927f296bb3SBarry Smithor Open MPI - we can run a parallel job on 4 MPI tasks as:
6937f296bb3SBarry Smith
6947f296bb3SBarry Smith```console
6957f296bb3SBarry Smithmpiexec -n 1 nsys profile -t nvtx,cuda -o file_name --stats=true --force-overwrite true ./a.out : -n 3 ./a.out
6967f296bb3SBarry Smith```
6977f296bb3SBarry Smith
6987f296bb3SBarry Smith(sec_using_tau)=
6997f296bb3SBarry Smith
7007f296bb3SBarry SmithNote: The Nsight GUI can open profiling reports from elsewhere. For
7017f296bb3SBarry Smithexample, a report from a compute node can be analyzed on your local
7027f296bb3SBarry Smithmachine, but care should be taken to use the exact same versions of
7037f296bb3SBarry SmithNsight Systems that generated the report.
7047f296bb3SBarry SmithTo check the version of Nsight on the compute node run `nsys-ui` and
7057f296bb3SBarry Smithnote the version number at the top of the window.
7067f296bb3SBarry Smith
70756a72328SZach Atkins## ROCProfiler profiling
70856a72328SZach Atkins
70956a72328SZach AtkinsFor AMD GPUs, log events registered to PETSc can be displayed as ranges in trace files generated by `rocprof` by running with the `-log_roctx` flag.
71056a72328SZach AtkinsSee the `rocprof` [documentation](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest) for details of how to run the profiler.
71156a72328SZach AtkinsAt the least, you will need:
71256a72328SZach Atkins
71356a72328SZach Atkins```console
71456a72328SZach Atkinsmpiexec -n 1 rocprofv3 --marker-trace -o file_name -- ./path/to/application -log_roctx
71556a72328SZach Atkins```
71656a72328SZach Atkins
71756a72328SZach Atkins
7187f296bb3SBarry Smith## Using TAU
7197f296bb3SBarry Smith
7207f296bb3SBarry SmithTAU profiles can be generated without the need for instrumentation through the
7217f296bb3SBarry Smithuse of the perfstubs package. PETSc by default is configured with `--with-tau-perfstubs`.
7227f296bb3SBarry SmithTo generate profiles with TAU, first setup TAU:
7237f296bb3SBarry Smith
7247f296bb3SBarry Smith```bash
7257f296bb3SBarry Smithwget http://tau.uoregon.edu/tau.tgz
7267f296bb3SBarry Smith./configure -cc=mpicc -c++=mpicxx -mpi -bfd=download -unwind=download && make install
7277f296bb3SBarry Smithexport PATH=<tau dir>/x86_64/bin:$PATH
7287f296bb3SBarry Smith```
7297f296bb3SBarry Smith
7307f296bb3SBarry SmithFor more information on configuring TAU, see [http://tau.uoregon.edu](http://tau.uoregon.edu).
7317f296bb3SBarry SmithNext, run your program with TAU. For instance, to profile `ex56`,
7327f296bb3SBarry Smith
7337f296bb3SBarry Smith```bash
7347f296bb3SBarry Smithcd $PETSC_DIR/src/snes/tutorials
7357f296bb3SBarry Smithmake ex56
736*a8cf87e0SJunchao Zhangmpiexec -n 4 tau_exec -T mpi ./ex56 -log_perfstubs <args>
7377f296bb3SBarry Smith```
7387f296bb3SBarry Smith
7397f296bb3SBarry SmithThis should produce four `profile.*` files with profile data that can be
7407f296bb3SBarry Smithviewed with `paraprof/pprof`:
7417f296bb3SBarry Smith
7427f296bb3SBarry Smith```shell
7437f296bb3SBarry SmithReading Profile files in profile.*
7447f296bb3SBarry Smith
7457f296bb3SBarry SmithNODE 0;CONTEXT 0;THREAD 0:
7467f296bb3SBarry Smith---------------------------------------------------------------------------------------
7477f296bb3SBarry Smith%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
7487f296bb3SBarry Smith              msec   total msec                          usec/call
7497f296bb3SBarry Smith---------------------------------------------------------------------------------------
7507f296bb3SBarry Smith100.0           26        1,838           1       41322    1838424 .TAU application
7517f296bb3SBarry Smith 73.2            1        1,345           2         168     672950 SNESSolve
7527f296bb3SBarry Smith 62.2            3        1,142           2        1282     571442 SNESJacobianEval
7537f296bb3SBarry Smith 62.0        1,136        1,138           2          76     569494 DMPlexJacobianFE
7547f296bb3SBarry Smith 60.1        0.046        1,105           1          32    1105001 Solve 1
7557f296bb3SBarry Smith 15.2           87          279           5       11102      55943 Mesh Setup
7567f296bb3SBarry Smith 13.2        0.315          241           1          32     241765 Solve 0
7577f296bb3SBarry Smith  7.8           80          144       38785       38785          4 MPI_Allreduce()
7587f296bb3SBarry Smith  7.0           69          128           6       43386      21491 DualSpaceSetUp
7597f296bb3SBarry Smith  6.2            1          114           4          54      28536 PCSetUp
7607f296bb3SBarry Smith  6.0           12          110           2         892      55407 PCSetUp_GAMG+
7617f296bb3SBarry Smith  3.9           70           70           1           0      70888 MPI_Init_thread()
7627f296bb3SBarry Smith  3.7           68           68       41747           0          2 MPI Collective Sync
7637f296bb3SBarry Smith  3.6            8           66           4        3536      16548 SNESFunctionEval
7647f296bb3SBarry Smith  2.6           45           48         171         171        281 MPI_Bcast()
7657f296bb3SBarry Smith  1.9           34           34        7836           0          4 MPI_Barrier()
7667f296bb3SBarry Smith  1.8        0.567           33           2          68      16912  GAMG Coarsen
7677f296bb3SBarry Smith```
7687f296bb3SBarry Smith
7697f296bb3SBarry Smith```{raw} html
7707f296bb3SBarry Smith<hr>
7717f296bb3SBarry Smith```
7727f296bb3SBarry Smith
7737f296bb3SBarry Smith```{eval-rst}
7747f296bb3SBarry Smith.. bibliography:: /petsc.bib
7757f296bb3SBarry Smith   :filter: docname in docnames
7767f296bb3SBarry Smith```
777