xref: /petsc/doc/manual/profiling.md (revision 7f296bb328fcd4c99f2da7bfe8ba7ed8a4ebceee)
1*7f296bb3SBarry Smith(ch_profiling)=
2*7f296bb3SBarry Smith
3*7f296bb3SBarry Smith# Profiling
4*7f296bb3SBarry Smith
5*7f296bb3SBarry SmithPETSc includes a consistent, lightweight scheme for profiling
6*7f296bb3SBarry Smithapplication programs. The PETSc routines automatically log
7*7f296bb3SBarry Smithperformance data if certain options are specified at runtime. The user
8*7f296bb3SBarry Smithcan also log information about application codes for a complete picture
9*7f296bb3SBarry Smithof performance.
10*7f296bb3SBarry Smith
11*7f296bb3SBarry SmithIn addition, as described in
12*7f296bb3SBarry Smith{any}`sec_ploginfo`, PETSc provides a mechanism for
13*7f296bb3SBarry Smithprinting informative messages about computations.
14*7f296bb3SBarry Smith{any}`sec_profbasic` introduces the various profiling
15*7f296bb3SBarry Smithoptions in PETSc, while the remainder of the chapter focuses on details
16*7f296bb3SBarry Smithsuch as monitoring application codes and tips for accurate profiling.
17*7f296bb3SBarry Smith
18*7f296bb3SBarry Smith(sec_profbasic)=
19*7f296bb3SBarry Smith
20*7f296bb3SBarry Smith## Basic Profiling Information
21*7f296bb3SBarry Smith
22*7f296bb3SBarry SmithThe profiling options include the following:
23*7f296bb3SBarry Smith
24*7f296bb3SBarry Smith- `-log_view  [:filename]` - Prints an ASCII version of performance data at the
25*7f296bb3SBarry Smith  program’s conclusion. These statistics are comprehensive and concise
26*7f296bb3SBarry Smith  and require little overhead; thus, `-log_view` is intended as the
27*7f296bb3SBarry Smith  primary means of monitoring the performance of PETSc codes. See {any}`sec_ploginfo`
28*7f296bb3SBarry Smith- `-info [infofile]` - Prints verbose information about code to
29*7f296bb3SBarry Smith  stdout or an optional file. This option provides details about
30*7f296bb3SBarry Smith  algorithms, data structures, etc. Since the overhead of printing such
31*7f296bb3SBarry Smith  output slows a code, this option should not be used when evaluating a
32*7f296bb3SBarry Smith  program’s performance. See {any}`sec_PetscInfo`
33*7f296bb3SBarry Smith- `-log_trace [logfile]` - Traces the beginning and ending of all
34*7f296bb3SBarry Smith  PETSc events. This option, which can be used in conjunction with
35*7f296bb3SBarry Smith  `-info`, is useful to see where a program is hanging without
36*7f296bb3SBarry Smith  running in the debugger. See `PetscLogTraceBegin()`.
37*7f296bb3SBarry Smith
38*7f296bb3SBarry SmithAs discussed in {any}`sec_mpelogs`, additional profiling
39*7f296bb3SBarry Smithcan be done with MPE.
40*7f296bb3SBarry Smith
41*7f296bb3SBarry Smith(sec_ploginfo)=
42*7f296bb3SBarry Smith
43*7f296bb3SBarry Smith### Interpreting `-log_view` Output: The Basics
44*7f296bb3SBarry Smith
45*7f296bb3SBarry SmithAs shown in {any}`listing <listing_exprof>` in {any}`sec_profiling_programs`, the
46*7f296bb3SBarry Smithoption `-log_view` `[:filename]` activates printing of profile data to standard
47*7f296bb3SBarry Smithoutput or an ASCII file at the conclusion of a program. See `PetscLogView()` for all the possible
48*7f296bb3SBarry Smithoutput options.
49*7f296bb3SBarry Smith
50*7f296bb3SBarry SmithWe print performance data for each routine, organized by PETSc
51*7f296bb3SBarry Smithlibraries, followed by any user-defined events (discussed in
52*7f296bb3SBarry Smith{any}`sec_profileuser`). For each routine, the output data
53*7f296bb3SBarry Smithinclude the maximum time and floating point operation (flop) rate over
54*7f296bb3SBarry Smithall processes. Information about parallel performance is also included,
55*7f296bb3SBarry Smithas discussed in the following section.
56*7f296bb3SBarry Smith
57*7f296bb3SBarry SmithFor the purpose of PETSc floating point operation counting, we define
58*7f296bb3SBarry Smithone *flop* as one operation of any of the following types:
59*7f296bb3SBarry Smithmultiplication, division, addition, or subtraction. For example, one
60*7f296bb3SBarry Smith`VecAXPY()` operation, which computes $y = \alpha x + y$ for
61*7f296bb3SBarry Smithvectors of length $N$, requires $2N$ flop (consisting of
62*7f296bb3SBarry Smith$N$ additions and $N$ multiplications). Bear in mind that
63*7f296bb3SBarry Smithflop rates present only a limited view of performance, since memory
64*7f296bb3SBarry Smithloads and stores are the real performance barrier.
65*7f296bb3SBarry Smith
66*7f296bb3SBarry SmithFor simplicity, the remainder of this discussion focuses on interpreting
67*7f296bb3SBarry Smithprofile data for the `KSP` library, which provides the linear solvers
68*7f296bb3SBarry Smithat the heart of the PETSc package. Recall the hierarchical organization
69*7f296bb3SBarry Smithof the PETSc library, as shown in
70*7f296bb3SBarry Smith{any}`fig_library`. Each `KSP` solver is composed
71*7f296bb3SBarry Smithof a `PC` (preconditioner) and a `KSP` (Krylov subspace) part, which
72*7f296bb3SBarry Smithare in turn built on top of the `Mat` (matrix) and `Vec` (vector)
73*7f296bb3SBarry Smithmodules. Thus, operations in the `KSP` module are composed of
74*7f296bb3SBarry Smithlower-level operations in these packages. Note also that the nonlinear
75*7f296bb3SBarry Smithsolvers library, `SNES`, is built on top of the `KSP` module, and
76*7f296bb3SBarry Smiththe timestepping library, `TS`, is in turn built on top of `SNES`.
77*7f296bb3SBarry Smith
78*7f296bb3SBarry SmithWe briefly discuss interpretation of the sample output in
79*7f296bb3SBarry Smith{any}`listing <listing_exprof>`, which was generated by solving a
80*7f296bb3SBarry Smithlinear system on one process using restarted GMRES and ILU
81*7f296bb3SBarry Smithpreconditioning. The linear solvers in `KSP` consist of two basic
82*7f296bb3SBarry Smithphases, `KSPSetUp()` and `KSPSolve()`, each of which consists of a
83*7f296bb3SBarry Smithvariety of actions, depending on the particular solution technique. For
84*7f296bb3SBarry Smiththe case of using the `PCILU` preconditioner and `KSPGMRES` Krylov
85*7f296bb3SBarry Smithsubspace method, the breakdown of PETSc routines is listed below. As
86*7f296bb3SBarry Smithindicated by the levels of indentation, the operations in `KSPSetUp()`
87*7f296bb3SBarry Smithinclude all of the operations within `PCSetUp()`, which in turn
88*7f296bb3SBarry Smithinclude `MatILUFactor()`, and so on.
89*7f296bb3SBarry Smith
90*7f296bb3SBarry Smith- `KSPSetUp` - Set up linear solver
91*7f296bb3SBarry Smith
92*7f296bb3SBarry Smith  - `PCSetUp` - Set up preconditioner
93*7f296bb3SBarry Smith
94*7f296bb3SBarry Smith    - `MatILUFactor` - Factor preconditioning matrix
95*7f296bb3SBarry Smith
96*7f296bb3SBarry Smith      - `MatILUFactorSymbolic` - Symbolic factorization phase
97*7f296bb3SBarry Smith      - `MatLUFactorNumeric` - Numeric factorization phase
98*7f296bb3SBarry Smith
99*7f296bb3SBarry Smith- `KSPSolve` - Solve linear system
100*7f296bb3SBarry Smith
101*7f296bb3SBarry Smith  - `PCApply` - Apply preconditioner
102*7f296bb3SBarry Smith
103*7f296bb3SBarry Smith    - `MatSolve` - Forward/backward triangular solves
104*7f296bb3SBarry Smith
105*7f296bb3SBarry Smith  - `KSPGMRESOrthog` - Orthogonalization in GMRES
106*7f296bb3SBarry Smith
107*7f296bb3SBarry Smith    - `VecDot` or `VecMDot` - Inner products
108*7f296bb3SBarry Smith    - `VecAXPY` or `VecMAXPY` - vector updates
109*7f296bb3SBarry Smith
110*7f296bb3SBarry Smith  - `MatMult` - Matrix-vector product
111*7f296bb3SBarry Smith
112*7f296bb3SBarry Smith  - `MatMultAdd` - Matrix-vector product + vector addition
113*7f296bb3SBarry Smith
114*7f296bb3SBarry Smith    - `VecScale`, `VecNorm`, `VecAXPY`, `VecCopy`, ...
115*7f296bb3SBarry Smith
116*7f296bb3SBarry SmithThe summaries printed via `-log_view` reflect this routine hierarchy.
117*7f296bb3SBarry SmithFor example, the performance summaries for a particular high-level
118*7f296bb3SBarry Smithroutine such as `KSPSolve()` include all of the operations accumulated
119*7f296bb3SBarry Smithin the lower-level components that make up the routine.
120*7f296bb3SBarry Smith
121*7f296bb3SBarry SmithThe output produced with `-log_view` is flat, meaning that the hierarchy
122*7f296bb3SBarry Smithof PETSc operations is not completely clear. For a
123*7f296bb3SBarry Smithparticular problem, the user should generally have an idea of the basic
124*7f296bb3SBarry Smithoperations that are required for its implementation (e.g., which
125*7f296bb3SBarry Smithoperations are performed when using GMRES and ILU, as described above),
126*7f296bb3SBarry Smithso that interpreting the `-log_view` data should be relatively
127*7f296bb3SBarry Smithstraightforward.
128*7f296bb3SBarry SmithIf this is problematic then it is also possible to examine
129*7f296bb3SBarry Smiththe profiling information in a nested format. For more information see
130*7f296bb3SBarry Smith{any}`sec_nestedevents`.
131*7f296bb3SBarry Smith
132*7f296bb3SBarry Smith(sec_parperformance)=
133*7f296bb3SBarry Smith
134*7f296bb3SBarry Smith### Interpreting `-log_view` Output: Parallel Performance
135*7f296bb3SBarry Smith
136*7f296bb3SBarry SmithWe next discuss performance summaries for parallel programs, as shown
137*7f296bb3SBarry Smithwithin {any}`listing <listing_exparprof>` and {any}`listing <listing_exparprof2>`,
138*7f296bb3SBarry Smithwhich presents the
139*7f296bb3SBarry Smithoutput generated by the `-log_view` option. The program that generated
140*7f296bb3SBarry Smiththis data is
141*7f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/ksp/ksp/tutorials/ex10.c.html">KSP Tutorial ex10</a>.
142*7f296bb3SBarry SmithThe code loads a matrix and right-hand-side vector from a binary file
143*7f296bb3SBarry Smithand then solves the resulting linear system; the program then repeats
144*7f296bb3SBarry Smiththis process for a second linear system. This particular case was run on
145*7f296bb3SBarry Smithfour processors of an Intel x86_64 Linux cluster, using restarted GMRES
146*7f296bb3SBarry Smithand the block Jacobi preconditioner, where each block was solved with
147*7f296bb3SBarry SmithILU. The two input files `medium` and `arco6` can be obtained from
148*7f296bb3SBarry Smith[datafiles](https://gitlab.com/petsc/datafiles), see {any}`petsc_repositories`.
149*7f296bb3SBarry Smith
150*7f296bb3SBarry SmithThe first {any}`listing <listing_exparprof>` presents an overall
151*7f296bb3SBarry Smithperformance summary, including times, floating-point operations,
152*7f296bb3SBarry Smithcomputational rates, and message-passing activity (such as the number
153*7f296bb3SBarry Smithand size of messages sent and collective operations). Summaries for
154*7f296bb3SBarry Smithvarious user-defined stages of monitoring (as discussed in
155*7f296bb3SBarry Smith{any}`sec_profstages`) are also given. Information about the
156*7f296bb3SBarry Smithvarious phases of computation then follow (as shown separately here in
157*7f296bb3SBarry Smiththe second {any}`listing <listing_exparprof2>`). Finally, a summary of
158*7f296bb3SBarry Smithobject creation and destruction is presented.
159*7f296bb3SBarry Smith
160*7f296bb3SBarry Smith(listing_exparprof)=
161*7f296bb3SBarry Smith
162*7f296bb3SBarry Smith```none
163*7f296bb3SBarry Smithmpiexec -n 4 ./ex10 -f0 medium -f1 arco6 -ksp_gmres_classicalgramschmidt -log_view -mat_type baij \
164*7f296bb3SBarry Smith            -matload_block_size 3 -pc_type bjacobi -options_left
165*7f296bb3SBarry Smith
166*7f296bb3SBarry SmithNumber of iterations = 19
167*7f296bb3SBarry SmithResidual norm 1.088292e-05
168*7f296bb3SBarry SmithNumber of iterations = 59
169*7f296bb3SBarry SmithResidual norm 3.871022e-02
170*7f296bb3SBarry Smith---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
171*7f296bb3SBarry Smith
172*7f296bb3SBarry Smith./ex10 on a intel-bdw-opt named beboplogin4 with 4 processors, by jczhang Mon Apr 23 13:36:54 2018
173*7f296bb3SBarry SmithUsing PETSc Development Git Revision: v3.9-163-gbe3efd42 Git Date: 2018-04-16 10:45:40 -0500
174*7f296bb3SBarry Smith
175*7f296bb3SBarry Smith                         Max       Max/Min        Avg      Total
176*7f296bb3SBarry SmithTime (sec):           1.849e-01      1.00002   1.849e-01
177*7f296bb3SBarry SmithObjects:              1.060e+02      1.00000   1.060e+02
178*7f296bb3SBarry SmithFlops:                2.361e+08      1.00684   2.353e+08  9.413e+08
179*7f296bb3SBarry SmithFlops/sec:            1.277e+09      1.00685   1.273e+09  5.091e+09
180*7f296bb3SBarry SmithMPI Msg Count:        2.360e+02      1.34857   2.061e+02  8.245e+02
181*7f296bb3SBarry SmithMPI Msg Len (bytes):  1.256e+07      2.24620   4.071e+04  3.357e+07
182*7f296bb3SBarry SmithMPI Reductions:       2.160e+02      1.00000
183*7f296bb3SBarry Smith
184*7f296bb3SBarry SmithSummary of Stages:   ----- Time ------  ----- Flop -----  --- Messages ---  -- Message Lengths --  -- Reductions --
185*7f296bb3SBarry Smith                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total
186*7f296bb3SBarry Smith 0:      Main Stage: 5.9897e-04   0.3%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  2.000e+00   0.9%
187*7f296bb3SBarry Smith 1:   Load System 0: 2.9113e-03   1.6%  0.0000e+00   0.0%  3.550e+01   4.3%  5.984e+02        0.1%  2.200e+01  10.2%
188*7f296bb3SBarry Smith 2:      KSPSetUp 0: 7.7349e-04   0.4%  9.9360e+03   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  2.000e+00   0.9%
189*7f296bb3SBarry Smith 3:      KSPSolve 0: 1.7690e-03   1.0%  2.9673e+05   0.0%  1.520e+02  18.4%  1.800e+02        0.1%  3.900e+01  18.1%
190*7f296bb3SBarry Smith 4:   Load System 1: 1.0056e-01  54.4%  0.0000e+00   0.0%  3.700e+01   4.5%  5.657e+05       62.4%  2.200e+01  10.2%
191*7f296bb3SBarry Smith 5:      KSPSetUp 1: 5.6883e-03   3.1%  2.1205e+07   2.3%  0.000e+00   0.0%  0.000e+00        0.0%  2.000e+00   0.9%
192*7f296bb3SBarry Smith 6:      KSPSolve 1: 7.2578e-02  39.3%  9.1979e+08  97.7%  6.000e+02  72.8%  2.098e+04       37.5%  1.200e+02  55.6%
193*7f296bb3SBarry Smith
194*7f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------
195*7f296bb3SBarry Smith
196*7f296bb3SBarry Smith.... [Summary of various phases, see part II below] ...
197*7f296bb3SBarry Smith
198*7f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------
199*7f296bb3SBarry Smith
200*7f296bb3SBarry SmithObject Type          Creations   Destructions    (Reports information only for process 0.)
201*7f296bb3SBarry Smith...
202*7f296bb3SBarry Smith--- Event Stage 3: KSPSolve 0
203*7f296bb3SBarry Smith
204*7f296bb3SBarry Smith              Matrix     0              4
205*7f296bb3SBarry Smith              Vector    20             30
206*7f296bb3SBarry Smith           Index Set     0              3
207*7f296bb3SBarry Smith         Vec Scatter     0              1
208*7f296bb3SBarry Smith       Krylov Solver     0              2
209*7f296bb3SBarry Smith      Preconditioner     0              2
210*7f296bb3SBarry Smith```
211*7f296bb3SBarry Smith
212*7f296bb3SBarry SmithWe next focus on the summaries for the various phases of the
213*7f296bb3SBarry Smithcomputation, as given in the table within
214*7f296bb3SBarry Smiththe following {any}`listing <listing_exparprof2>`. The summary for each
215*7f296bb3SBarry Smithphase presents the maximum times and flop rates over all processes, as
216*7f296bb3SBarry Smithwell as the ratio of maximum to minimum times and flop rates for all
217*7f296bb3SBarry Smithprocesses. A ratio of approximately 1 indicates that computations within
218*7f296bb3SBarry Smitha given phase are well balanced among the processes; as the ratio
219*7f296bb3SBarry Smithincreases, the balance becomes increasingly poor. Also, the total
220*7f296bb3SBarry Smithcomputational rate (in units of MFlop/sec) is given for each phase in
221*7f296bb3SBarry Smiththe final column of the phase summary table.
222*7f296bb3SBarry Smith
223*7f296bb3SBarry Smith$$
224*7f296bb3SBarry Smith{\rm Total\: Mflop/sec} \:=\: 10^{-6} * ({\rm sum\; of\; flop\; over\; all\; processors})/({\rm max\; time\; over\; all\; processors})
225*7f296bb3SBarry Smith$$
226*7f296bb3SBarry Smith
227*7f296bb3SBarry Smith*Note*: Total computational rates $<$ 1 MFlop are listed as 0 in
228*7f296bb3SBarry Smiththis column of the phase summary table. Additional statistics for each
229*7f296bb3SBarry Smithphase include the total number of messages sent, the average message
230*7f296bb3SBarry Smithlength, and the number of global reductions.
231*7f296bb3SBarry Smith
232*7f296bb3SBarry Smith(listing_exparprof2)=
233*7f296bb3SBarry Smith
234*7f296bb3SBarry Smith```none
235*7f296bb3SBarry Smithmpiexec -n 4 ./ex10 -f0 medium -f1 arco6 -ksp_gmres_classicalgramschmidt -log_view -mat_type baij \
236*7f296bb3SBarry Smith            -matload_block_size 3 -pc_type bjacobi -options_left
237*7f296bb3SBarry Smith
238*7f296bb3SBarry Smith---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
239*7f296bb3SBarry Smith.... [Overall summary, see part I] ...
240*7f296bb3SBarry Smith
241*7f296bb3SBarry SmithPhase summary info:
242*7f296bb3SBarry Smith   Count: number of times phase was executed
243*7f296bb3SBarry Smith   Time and Flop/sec: Max - maximum over all processors
244*7f296bb3SBarry Smith                       Ratio - ratio of maximum to minimum over all processors
245*7f296bb3SBarry Smith   Mess: number of messages sent
246*7f296bb3SBarry Smith   AvgLen: average message length
247*7f296bb3SBarry Smith   Reduct: number of global reductions
248*7f296bb3SBarry Smith   Global: entire computation
249*7f296bb3SBarry Smith   Stage: optional user-defined stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
250*7f296bb3SBarry Smith      %T - percent time in this phase         %F - percent flop in this phase
251*7f296bb3SBarry Smith      %M - percent messages in this phase     %L - percent message lengths in this phase
252*7f296bb3SBarry Smith      %R - percent reductions in this phase
253*7f296bb3SBarry Smith   Total Mflop/s: 10^6 * (sum of flop over all processors)/(max time over all processors)
254*7f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------
255*7f296bb3SBarry SmithPhase              Count      Time (sec)       Flop/sec                          --- Global ---  --- Stage ----  Total
256*7f296bb3SBarry Smith                            Max    Ratio      Max    Ratio  Mess AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
257*7f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------
258*7f296bb3SBarry Smith...
259*7f296bb3SBarry Smith
260*7f296bb3SBarry Smith--- Event Stage 5: KSPSetUp 1
261*7f296bb3SBarry Smith
262*7f296bb3SBarry SmithMatLUFactorNum         1 1.0 3.6440e-03 1.1 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00  2  2  0  0  0  62100  0  0  0  5819
263*7f296bb3SBarry SmithMatILUFactorSym        1 1.0 1.7111e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  26  0  0  0  0     0
264*7f296bb3SBarry SmithMatGetRowIJ            1 1.0 1.1921e-06 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
265*7f296bb3SBarry SmithMatGetOrdering         1 1.0 3.0041e-05 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
266*7f296bb3SBarry SmithKSPSetUp               2 1.0 6.6495e-04 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00  0  0  0  0  1   9  0  0  0100     0
267*7f296bb3SBarry SmithPCSetUp                2 1.0 5.4271e-03 1.2 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00  3  2  0  0  0  90100  0  0  0  3907
268*7f296bb3SBarry SmithPCSetUpOnBlocks        1 1.0 5.3999e-03 1.2 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00  3  2  0  0  0  90100  0  0  0  3927
269*7f296bb3SBarry Smith
270*7f296bb3SBarry Smith--- Event Stage 6: KSPSolve 1
271*7f296bb3SBarry Smith
272*7f296bb3SBarry SmithMatMult               60 1.0 2.4068e-02 1.1 6.54e+07 1.0 6.0e+02 2.1e+04 0.0e+00 12 27 73 37  0  32 28100100  0 10731
273*7f296bb3SBarry SmithMatSolve              61 1.0 1.9177e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 10 25  0  0  0  26 26  0  0  0 12491
274*7f296bb3SBarry SmithVecMDot               59 1.0 1.4741e-02 1.3 4.86e+07 1.0 0.0e+00 0.0e+00 5.9e+01  7 21  0  0 27  18 21  0  0 49 13189
275*7f296bb3SBarry SmithVecNorm               61 1.0 3.0417e-03 1.4 3.29e+06 1.0 0.0e+00 0.0e+00 6.1e+01  1  1  0  0 28   4  1  0  0 51  4332
276*7f296bb3SBarry SmithVecScale              61 1.0 9.9802e-04 1.0 1.65e+06 1.0 0.0e+00 0.0e+00 0.0e+00  1  1  0  0  0   1  1  0  0  0  6602
277*7f296bb3SBarry SmithVecCopy                2 1.0 5.9128e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
278*7f296bb3SBarry SmithVecSet                64 1.0 8.0323e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
279*7f296bb3SBarry SmithVecAXPY                3 1.0 7.4387e-05 1.1 1.62e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  8712
280*7f296bb3SBarry SmithVecMAXPY              61 1.0 8.8558e-03 1.1 5.18e+07 1.0 0.0e+00 0.0e+00 0.0e+00  5 22  0  0  0  12 23  0  0  0 23393
281*7f296bb3SBarry SmithVecScatterBegin       60 1.0 9.6416e-04 1.8 0.00e+00 0.0 6.0e+02 2.1e+04 0.0e+00  0  0 73 37  0   1  0100100  0     0
282*7f296bb3SBarry SmithVecScatterEnd         60 1.0 6.1543e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0   8  0  0  0  0     0
283*7f296bb3SBarry SmithVecNormalize          61 1.0 4.2675e-03 1.3 4.94e+06 1.0 0.0e+00 0.0e+00 6.1e+01  2  2  0  0 28   5  2  0  0 51  4632
284*7f296bb3SBarry SmithKSPGMRESOrthog        59 1.0 2.2627e-02 1.1 9.72e+07 1.0 0.0e+00 0.0e+00 5.9e+01 11 41  0  0 27  29 42  0  0 49 17185
285*7f296bb3SBarry SmithKSPSolve               1 1.0 7.2577e-02 1.0 2.31e+08 1.0 6.0e+02 2.1e+04 1.2e+02 39 98 73 37 56  99100100100100 12673
286*7f296bb3SBarry SmithPCSetUpOnBlocks        1 1.0 9.5367e-07 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
287*7f296bb3SBarry SmithPCApply               61 1.0 2.0427e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 11 25  0  0  0  28 26  0  0  0 11726
288*7f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------
289*7f296bb3SBarry Smith.... [Conclusion of overall summary, see part I] ...
290*7f296bb3SBarry Smith```
291*7f296bb3SBarry Smith
292*7f296bb3SBarry SmithAs discussed in the preceding section, the performance summaries for
293*7f296bb3SBarry Smithhigher-level PETSc routines include the statistics for the lower levels
294*7f296bb3SBarry Smithof which they are made up. For example, the communication within
295*7f296bb3SBarry Smithmatrix-vector products `MatMult()` consists of vector scatter
296*7f296bb3SBarry Smithoperations, as given by the routines `VecScatterBegin()` and
297*7f296bb3SBarry Smith`VecScatterEnd()`.
298*7f296bb3SBarry Smith
299*7f296bb3SBarry SmithThe final data presented are the percentages of the various statistics
300*7f296bb3SBarry Smith(time (`%T`), flop/sec (`%F`), messages(`%M`), average message
301*7f296bb3SBarry Smithlength (`%L`), and reductions (`%R`)) for each event relative to the
302*7f296bb3SBarry Smithtotal computation and to any user-defined stages (discussed in
303*7f296bb3SBarry Smith{any}`sec_profstages`). These statistics can aid in
304*7f296bb3SBarry Smithoptimizing performance, since they indicate the sections of code that
305*7f296bb3SBarry Smithcould benefit from various kinds of tuning.
306*7f296bb3SBarry Smith{any}`ch_performance` gives suggestions about achieving good
307*7f296bb3SBarry Smithperformance with PETSc codes.
308*7f296bb3SBarry Smith
309*7f296bb3SBarry SmithThe additional option `-log_view_memory` causes the display of additional columns of information about how much
310*7f296bb3SBarry Smithmemory was allocated and freed during each logged event. This is useful
311*7f296bb3SBarry Smithto understand what phases of a computation require the most memory.
312*7f296bb3SBarry Smith
313*7f296bb3SBarry Smith(sec_mpelogs)=
314*7f296bb3SBarry Smith
315*7f296bb3SBarry Smith### Using `-log_mpe` with Jumpshot
316*7f296bb3SBarry Smith
317*7f296bb3SBarry SmithIt is also possible to use the *Jumpshot* package
318*7f296bb3SBarry Smith{cite}`upshot` to visualize PETSc events. This package comes
319*7f296bb3SBarry Smithwith the MPE software, which is part of the MPICH
320*7f296bb3SBarry Smith{cite}`mpich-web-page` implementation of MPI. The option
321*7f296bb3SBarry Smith
322*7f296bb3SBarry Smith```none
323*7f296bb3SBarry Smith-log_mpe [logfile]
324*7f296bb3SBarry Smith```
325*7f296bb3SBarry Smith
326*7f296bb3SBarry Smithcreates a logfile of events appropriate for viewing with *Jumpshot*. The
327*7f296bb3SBarry Smithuser can either use the default logging file or specify a name via
328*7f296bb3SBarry Smith`logfile`. Events can be deactivated as described in
329*7f296bb3SBarry Smith{any}`sec_deactivate`.
330*7f296bb3SBarry Smith
331*7f296bb3SBarry SmithThe user can also log MPI events. To do this, simply consider the PETSc
332*7f296bb3SBarry Smithapplication as any MPI application, and follow the MPI implementation’s
333*7f296bb3SBarry Smithinstructions for logging MPI calls. For example, when using MPICH, this
334*7f296bb3SBarry Smithmerely required adding `-llmpich` to the library list *before*
335*7f296bb3SBarry Smith`-lmpich`.
336*7f296bb3SBarry Smith
337*7f296bb3SBarry Smith(sec_nestedevents)=
338*7f296bb3SBarry Smith
339*7f296bb3SBarry Smith### Profiling Nested Events
340*7f296bb3SBarry Smith
341*7f296bb3SBarry SmithIt is possible to output the PETSc logging information in a nested format
342*7f296bb3SBarry Smithwhere the hierarchy of events is explicit. This output can be generated
343*7f296bb3SBarry Smitheither as an XML file or as a text file in a format suitable for viewing as
344*7f296bb3SBarry Smitha flame graph.
345*7f296bb3SBarry Smith
346*7f296bb3SBarry SmithOne can generate the XML output by passing the option `-log_view :[logfilename]:ascii_xml`.
347*7f296bb3SBarry SmithIt can be viewed by copying `${PETSC_DIR}/share/petsc/xml/performance_xml2html.xsl`
348*7f296bb3SBarry Smithinto the current directory, then opening the logfile in your browser.
349*7f296bb3SBarry Smith
350*7f296bb3SBarry SmithThe flame graph output can be generated with the option `-log_view :[logfile]:ascii_flamegraph`.
351*7f296bb3SBarry SmithIt can then be visualised with either [FlameGraph](https://github.com/brendangregg/FlameGraph)
352*7f296bb3SBarry Smithor [speedscope](https://www.speedscope.app). A flamegraph can be visualized directly from
353*7f296bb3SBarry Smithstdout using, for example,
354*7f296bb3SBarry Smith`ImageMagick's display utility <https://imagemagick.org/script/display.php>`:
355*7f296bb3SBarry Smith
356*7f296bb3SBarry Smith```
357*7f296bb3SBarry Smithcd $PETSC_DIR/src/sys/tests
358*7f296bb3SBarry Smithmake ex30
359*7f296bb3SBarry Smithmpiexec -n 2 ./ex30 -log_view ::ascii_flamegraph | flamegraph | display
360*7f296bb3SBarry Smith```
361*7f296bb3SBarry Smith
362*7f296bb3SBarry SmithNote that user-defined stages (see {any}`sec_profstages`) will be ignored when
363*7f296bb3SBarry Smithusing this nested format.
364*7f296bb3SBarry Smith
365*7f296bb3SBarry Smith(sec_profileuser)=
366*7f296bb3SBarry Smith
367*7f296bb3SBarry Smith## Profiling Application Codes
368*7f296bb3SBarry Smith
369*7f296bb3SBarry SmithPETSc automatically logs object creation, times, and floating-point
370*7f296bb3SBarry Smithcounts for the library routines. Users can easily supplement this
371*7f296bb3SBarry Smithinformation by profiling their application codes as well. The basic
372*7f296bb3SBarry Smithsteps involved in logging a user-defined portion of code, called an
373*7f296bb3SBarry Smith*event*, are shown in the code fragment below:
374*7f296bb3SBarry Smith
375*7f296bb3SBarry Smith```
376*7f296bb3SBarry SmithPetscLogEvent  USER_EVENT;
377*7f296bb3SBarry SmithPetscClassId   classid;
378*7f296bb3SBarry SmithPetscLogDouble user_event_flops;
379*7f296bb3SBarry Smith
380*7f296bb3SBarry SmithPetscClassIdRegister("class name",&classid);
381*7f296bb3SBarry SmithPetscLogEventRegister("User event name",classid,&USER_EVENT);
382*7f296bb3SBarry SmithPetscLogEventBegin(USER_EVENT,0,0,0,0);
383*7f296bb3SBarry Smith/* code segment to monitor */
384*7f296bb3SBarry SmithPetscLogFlops(user_event_flops);
385*7f296bb3SBarry SmithPetscLogEventEnd(USER_EVENT,0,0,0,0);
386*7f296bb3SBarry Smith```
387*7f296bb3SBarry Smith
388*7f296bb3SBarry SmithOne must register the event by calling `PetscLogEventRegister()`,
389*7f296bb3SBarry Smithwhich assigns a unique integer to identify the event for profiling
390*7f296bb3SBarry Smithpurposes:
391*7f296bb3SBarry Smith
392*7f296bb3SBarry Smith```
393*7f296bb3SBarry SmithPetscLogEventRegister(const char string[],PetscClassId classid,PetscLogEvent *e);
394*7f296bb3SBarry Smith```
395*7f296bb3SBarry Smith
396*7f296bb3SBarry SmithHere `string` is a user-defined event name, and `color` is an
397*7f296bb3SBarry Smithoptional user-defined event color (for use with *Jumpshot* logging; see
398*7f296bb3SBarry Smith{any}`sec_mpelogs`); one should see the manual page for
399*7f296bb3SBarry Smithdetails. The argument returned in `e` should then be passed to the
400*7f296bb3SBarry Smith`PetscLogEventBegin()` and `PetscLogEventEnd()` routines.
401*7f296bb3SBarry Smith
402*7f296bb3SBarry SmithEvents are logged by using the pair
403*7f296bb3SBarry Smith
404*7f296bb3SBarry Smith```
405*7f296bb3SBarry SmithPetscLogEventBegin(int event,PetscObject o1,PetscObject o2,PetscObject o3,PetscObject o4);
406*7f296bb3SBarry SmithPetscLogEventEnd(int event,PetscObject o1,PetscObject o2,PetscObject o3,PetscObject o4);
407*7f296bb3SBarry Smith```
408*7f296bb3SBarry Smith
409*7f296bb3SBarry SmithThe four objects are the PETSc objects that are most closely associated
410*7f296bb3SBarry Smithwith the event. For instance, in a matrix-vector product they would be
411*7f296bb3SBarry Smiththe matrix and the two vectors. These objects can be omitted by
412*7f296bb3SBarry Smithspecifying 0 for `o1` - `o4`. The code between these two routine
413*7f296bb3SBarry Smithcalls will be automatically timed and logged as part of the specified
414*7f296bb3SBarry Smithevent.
415*7f296bb3SBarry Smith
416*7f296bb3SBarry SmithEvents are collective by default on the communicator of `o1` (if present).
417*7f296bb3SBarry SmithThey can be made not collective by using `PetscLogEventSetCollective()`.
418*7f296bb3SBarry SmithNo synchronization is performed on collective events in optimized builds unless
419*7f296bb3SBarry Smiththe command line option `-log_sync` is used; however, we do check for collective
420*7f296bb3SBarry Smithsemantics in debug mode.
421*7f296bb3SBarry Smith
422*7f296bb3SBarry SmithThe user can log the number of floating-point operations for this
423*7f296bb3SBarry Smithsegment of code by calling
424*7f296bb3SBarry Smith
425*7f296bb3SBarry Smith```
426*7f296bb3SBarry SmithPetscLogFlops(number of flop for this code segment);
427*7f296bb3SBarry Smith```
428*7f296bb3SBarry Smith
429*7f296bb3SBarry Smithbetween the calls to `PetscLogEventBegin()` and
430*7f296bb3SBarry Smith`PetscLogEventEnd()`. This value will automatically be added to the
431*7f296bb3SBarry Smithglobal flop counter for the entire program.
432*7f296bb3SBarry Smith
433*7f296bb3SBarry Smith(sec_profstages)=
434*7f296bb3SBarry Smith
435*7f296bb3SBarry Smith## Profiling Multiple Sections of Code
436*7f296bb3SBarry Smith
437*7f296bb3SBarry SmithBy default, the profiling produces a single set of statistics for all
438*7f296bb3SBarry Smithcode between the `PetscInitialize()` and `PetscFinalize()` calls
439*7f296bb3SBarry Smithwithin a program. One can independently monitor several "stages" of code
440*7f296bb3SBarry Smithby switching among the various stages with the commands
441*7f296bb3SBarry Smith
442*7f296bb3SBarry Smith```
443*7f296bb3SBarry SmithPetscLogStagePush(PetscLogStage stage);
444*7f296bb3SBarry SmithPetscLogStagePop();
445*7f296bb3SBarry Smith```
446*7f296bb3SBarry Smith
447*7f296bb3SBarry Smithsee the manual pages for details.
448*7f296bb3SBarry SmithThe command
449*7f296bb3SBarry Smith
450*7f296bb3SBarry Smith```
451*7f296bb3SBarry SmithPetscLogStageRegister(const char *name,PetscLogStage *stage)
452*7f296bb3SBarry Smith```
453*7f296bb3SBarry Smith
454*7f296bb3SBarry Smithallows one to associate a name with a stage; these names are printed
455*7f296bb3SBarry Smithwhenever summaries are generated with `-log_view`. The following code fragment uses three profiling
456*7f296bb3SBarry Smithstages within an program.
457*7f296bb3SBarry Smith
458*7f296bb3SBarry Smith```
459*7f296bb3SBarry SmithPetscInitialize(int *argc,char ***args,0,0);
460*7f296bb3SBarry Smith/* stage 0 of code here */
461*7f296bb3SBarry SmithPetscLogStageRegister("Stage 0 of Code", &stagenum0);
462*7f296bb3SBarry Smithfor (i=0; i<ntimes; i++) {
463*7f296bb3SBarry Smith    PetscLogStageRegister("Stage 1 of Code", &stagenum1);
464*7f296bb3SBarry Smith    PetscLogStagePush(stagenum1);
465*7f296bb3SBarry Smith    /* stage 1 of code here */
466*7f296bb3SBarry Smith    PetscLogStagePop();
467*7f296bb3SBarry Smith    PetscLogStageRegister("Stage 2 of Code", &stagenum2);
468*7f296bb3SBarry Smith    PetscLogStagePush(stagenum2);
469*7f296bb3SBarry Smith    /* stage 2 of code here */
470*7f296bb3SBarry Smith    PetscLogStagePop();
471*7f296bb3SBarry Smith}
472*7f296bb3SBarry SmithPetscFinalize();
473*7f296bb3SBarry Smith```
474*7f296bb3SBarry Smith
475*7f296bb3SBarry SmithThe listings above
476*7f296bb3SBarry Smithshow output generated by
477*7f296bb3SBarry Smith`-log_view` for a program that employs several profiling stages. In
478*7f296bb3SBarry Smithparticular, this program is subdivided into six stages: loading a matrix and right-hand-side vector from a binary file,
479*7f296bb3SBarry Smithsetting up the preconditioner, and solving the linear system; this
480*7f296bb3SBarry Smithsequence is then repeated for a second linear system. For simplicity,
481*7f296bb3SBarry Smiththe second listing contains output only for
482*7f296bb3SBarry Smithstages 5 and 6 (linear solve of the second system), which comprise the
483*7f296bb3SBarry Smithpart of this computation of most interest to us in terms of performance
484*7f296bb3SBarry Smithmonitoring. This code organization (solving a small linear system
485*7f296bb3SBarry Smithfollowed by a larger system) enables generation of more accurate
486*7f296bb3SBarry Smithprofiling statistics for the second system by overcoming the often
487*7f296bb3SBarry Smithconsiderable overhead of paging, as discussed in
488*7f296bb3SBarry Smith{any}`sec_profaccuracy`.
489*7f296bb3SBarry Smith
490*7f296bb3SBarry Smith(sec_deactivate)=
491*7f296bb3SBarry Smith
492*7f296bb3SBarry Smith## Restricting Event Logging
493*7f296bb3SBarry Smith
494*7f296bb3SBarry SmithBy default, all PETSc operations are logged. To enable or disable the
495*7f296bb3SBarry SmithPETSc logging of individual events, one uses the commands
496*7f296bb3SBarry Smith
497*7f296bb3SBarry Smith```
498*7f296bb3SBarry SmithPetscLogEventActivate(PetscLogEvent event);
499*7f296bb3SBarry SmithPetscLogEventDeactivate(PetscLogEvent event);
500*7f296bb3SBarry Smith```
501*7f296bb3SBarry Smith
502*7f296bb3SBarry SmithThe `event` may be either a predefined PETSc event (as listed in the
503*7f296bb3SBarry Smithfile `$PETSC_DIR/include/petsclog.h`) or one obtained with
504*7f296bb3SBarry Smith`PetscLogEventRegister()` (as described in
505*7f296bb3SBarry Smith{any}`sec_profileuser`).
506*7f296bb3SBarry Smith
507*7f296bb3SBarry SmithPETSc also provides routines that deactivate (or activate) logging for
508*7f296bb3SBarry Smithentire components of the library. Currently, the components that support
509*7f296bb3SBarry Smithsuch logging (de)activation are `Mat` (matrices), `Vec` (vectors),
510*7f296bb3SBarry Smith`KSP` (linear solvers, including `KSP` and `PC`), and `SNES`
511*7f296bb3SBarry Smith(nonlinear solvers):
512*7f296bb3SBarry Smith
513*7f296bb3SBarry Smith```
514*7f296bb3SBarry SmithPetscLogEventDeactivateClass(MAT_CLASSID);
515*7f296bb3SBarry SmithPetscLogEventDeactivateClass(KSP_CLASSID); /* includes PC and KSP */
516*7f296bb3SBarry SmithPetscLogEventDeactivateClass(VEC_CLASSID);
517*7f296bb3SBarry SmithPetscLogEventDeactivateClass(SNES_CLASSID);
518*7f296bb3SBarry Smith```
519*7f296bb3SBarry Smith
520*7f296bb3SBarry Smithand
521*7f296bb3SBarry Smith
522*7f296bb3SBarry Smith```
523*7f296bb3SBarry SmithPetscLogEventActivateClass(MAT_CLASSID);
524*7f296bb3SBarry SmithPetscLogEventActivateClass(KSP_CLASSID);   /* includes PC and KSP */
525*7f296bb3SBarry SmithPetscLogEventActivateClass(VEC_CLASSID);
526*7f296bb3SBarry SmithPetscLogEventActivateClass(SNES_CLASSID);
527*7f296bb3SBarry Smith```
528*7f296bb3SBarry Smith
529*7f296bb3SBarry Smith(sec_petscinfo)=
530*7f296bb3SBarry Smith
531*7f296bb3SBarry Smith## Interpreting `-info` Output: Informative Messages
532*7f296bb3SBarry Smith
533*7f296bb3SBarry SmithUsers can activate the printing of verbose information about algorithms,
534*7f296bb3SBarry Smithdata structures, etc. to the screen by using the option `-info` or by
535*7f296bb3SBarry Smithcalling `PetscInfoAllow(PETSC_TRUE)`. Such logging, which is used
536*7f296bb3SBarry Smiththroughout the PETSc libraries, can aid the user in understanding
537*7f296bb3SBarry Smithalgorithms and tuning program performance. For example, as discussed in
538*7f296bb3SBarry Smith{any}`sec_matsparse`, `-info` activates the printing of
539*7f296bb3SBarry Smithinformation about memory allocation during matrix assembly.
540*7f296bb3SBarry Smith
541*7f296bb3SBarry SmithOne can selectively turn off informative messages about any of the basic
542*7f296bb3SBarry SmithPETSc objects (e.g., `Mat`, `SNES`) with the command
543*7f296bb3SBarry Smith
544*7f296bb3SBarry Smith```
545*7f296bb3SBarry SmithPetscInfoDeactivateClass(int object_classid)
546*7f296bb3SBarry Smith```
547*7f296bb3SBarry Smith
548*7f296bb3SBarry Smithwhere `object_classid` is one of `MAT_CLASSID`, `SNES_CLASSID`,
549*7f296bb3SBarry Smithetc. Messages can be reactivated with the command
550*7f296bb3SBarry Smith
551*7f296bb3SBarry Smith```
552*7f296bb3SBarry SmithPetscInfoActivateClass(int object_classid)
553*7f296bb3SBarry Smith```
554*7f296bb3SBarry Smith
555*7f296bb3SBarry SmithSuch deactivation can be useful when one wishes to view information
556*7f296bb3SBarry Smithabout higher-level PETSc libraries (e.g., `TS` and `SNES`) without
557*7f296bb3SBarry Smithseeing all lower level data as well (e.g., `Mat`).
558*7f296bb3SBarry Smith
559*7f296bb3SBarry SmithOne can turn on or off logging for particular classes at runtime
560*7f296bb3SBarry Smith
561*7f296bb3SBarry Smith```console
562*7f296bb3SBarry Smith-info [filename][:[~]<list,of,classnames>[:[~]self]]
563*7f296bb3SBarry Smith```
564*7f296bb3SBarry Smith
565*7f296bb3SBarry SmithThe `list,of,classnames` is a list, separated by commas with no spaces, of classes one wishes to view the information on. For
566*7f296bb3SBarry Smithexample `vec,ksp`. Information on all other classes will not be displayed. The ~ indicates to not display the list of classes but rather to display all other classes.
567*7f296bb3SBarry Smith
568*7f296bb3SBarry Smith`self` indicates to display information on objects that are associated with `PETSC_COMM_SELF` while `~self` indicates to display information only for parallel objects.
569*7f296bb3SBarry Smith
570*7f296bb3SBarry SmithSee `PetscInfo()` for links to all the info operations that are available.
571*7f296bb3SBarry Smith
572*7f296bb3SBarry SmithApplication programmers can log their own messages, as well, by using the
573*7f296bb3SBarry Smithroutine
574*7f296bb3SBarry Smith
575*7f296bb3SBarry Smith```
576*7f296bb3SBarry SmithPetscInfo(void* obj,char *message,...)
577*7f296bb3SBarry Smith```
578*7f296bb3SBarry Smith
579*7f296bb3SBarry Smithwhere `obj` is the PETSc object associated most closely with the
580*7f296bb3SBarry Smithlogging statement, `message`. For example, in the line search Newton
581*7f296bb3SBarry Smithmethods, we use a statement such as
582*7f296bb3SBarry Smith
583*7f296bb3SBarry Smith```
584*7f296bb3SBarry SmithPetscInfo(snes,"Cubic step, lambda %g\n",lambda);
585*7f296bb3SBarry Smith```
586*7f296bb3SBarry Smith
587*7f296bb3SBarry Smith## Time
588*7f296bb3SBarry Smith
589*7f296bb3SBarry SmithPETSc application programmers can access the wall clock time directly
590*7f296bb3SBarry Smithwith the command
591*7f296bb3SBarry Smith
592*7f296bb3SBarry Smith```
593*7f296bb3SBarry SmithPetscLogDouble time;
594*7f296bb3SBarry SmithPetscCall(PetscTime(&time));
595*7f296bb3SBarry Smith```
596*7f296bb3SBarry Smith
597*7f296bb3SBarry Smithwhich returns the current time in seconds since the epoch, and is
598*7f296bb3SBarry Smithcommonly implemented with `MPI_Wtime`. A floating point number is
599*7f296bb3SBarry Smithreturned in order to express fractions of a second. In addition, as
600*7f296bb3SBarry Smithdiscussed in {any}`sec_profileuser`, PETSc can automatically
601*7f296bb3SBarry Smithprofile user-defined segments of code.
602*7f296bb3SBarry Smith
603*7f296bb3SBarry Smith## Saving Output to a File
604*7f296bb3SBarry Smith
605*7f296bb3SBarry SmithAll output from PETSc programs (including informative messages,
606*7f296bb3SBarry Smithprofiling information, and convergence data) can be saved to a file by
607*7f296bb3SBarry Smithusing the command line option `-history [filename]`. If no file name
608*7f296bb3SBarry Smithis specified, the output is stored in the file
609*7f296bb3SBarry Smith`${HOME}/.petschistory`. Note that this option only saves output
610*7f296bb3SBarry Smithprinted with the `PetscPrintf()` and `PetscFPrintf()` commands, not
611*7f296bb3SBarry Smiththe standard `printf()` and `fprintf()` statements.
612*7f296bb3SBarry Smith
613*7f296bb3SBarry Smith(sec_profaccuracy)=
614*7f296bb3SBarry Smith
615*7f296bb3SBarry Smith## Accurate Profiling and Paging Overheads
616*7f296bb3SBarry Smith
617*7f296bb3SBarry SmithOne factor that often plays a significant role in profiling a code is
618*7f296bb3SBarry Smithpaging by the operating system. Generally, when running a program, only
619*7f296bb3SBarry Smitha few pages required to start it are loaded into memory rather than the
620*7f296bb3SBarry Smithentire executable. When the execution proceeds to code segments that are
621*7f296bb3SBarry Smithnot in memory, a pagefault occurs, prompting the required pages to be
622*7f296bb3SBarry Smithloaded from the disk (a very slow process). This activity distorts the
623*7f296bb3SBarry Smithresults significantly. (The paging effects are noticeable in the log
624*7f296bb3SBarry Smithfiles generated by `-log_mpe`, which is described in
625*7f296bb3SBarry Smith{any}`sec_mpelogs`.)
626*7f296bb3SBarry Smith
627*7f296bb3SBarry SmithTo eliminate the effects of paging when profiling the performance of a
628*7f296bb3SBarry Smithprogram, we have found an effective procedure is to run the *exact same
629*7f296bb3SBarry Smithcode* on a small dummy problem before running it on the actual problem
630*7f296bb3SBarry Smithof interest. We thus ensure that all code required by a solver is loaded
631*7f296bb3SBarry Smithinto memory during solution of the small problem. When the code proceeds
632*7f296bb3SBarry Smithto the actual (larger) problem of interest, all required pages have
633*7f296bb3SBarry Smithalready been loaded into main memory, so that the performance numbers
634*7f296bb3SBarry Smithare not distorted.
635*7f296bb3SBarry Smith
636*7f296bb3SBarry SmithWhen this procedure is used in conjunction with the user-defined stages
637*7f296bb3SBarry Smithof profiling described in {any}`sec_profstages`, we can
638*7f296bb3SBarry Smithfocus easily on the problem of interest. For example, we used this
639*7f296bb3SBarry Smithtechnique in the program
640*7f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/ksp/ksp/tutorials/ex10.c.html">KSP Tutorial ex10</a>
641*7f296bb3SBarry Smithto generate the timings within
642*7f296bb3SBarry Smith{any}`listing <listing_exparprof>` and {any}`listing <listing_exparprof2>`.
643*7f296bb3SBarry SmithIn this case, the profiled code
644*7f296bb3SBarry Smithof interest (solving the linear system for the larger problem) occurs
645*7f296bb3SBarry Smithwithin event stages 5 and 6. {any}`sec_parperformance`
646*7f296bb3SBarry Smithprovides details about interpreting such profiling data.
647*7f296bb3SBarry Smith
648*7f296bb3SBarry SmithIn particular, the macros
649*7f296bb3SBarry Smith
650*7f296bb3SBarry Smith```
651*7f296bb3SBarry SmithPetscPreLoadBegin(PetscBool flag,char* stagename)
652*7f296bb3SBarry SmithPetscPreLoadStage(char *stagename)
653*7f296bb3SBarry Smith```
654*7f296bb3SBarry Smith
655*7f296bb3SBarry Smithand
656*7f296bb3SBarry Smith
657*7f296bb3SBarry Smith```
658*7f296bb3SBarry SmithPetscPreLoadEnd()
659*7f296bb3SBarry Smith```
660*7f296bb3SBarry Smith
661*7f296bb3SBarry Smithcan be used to easily convert a regular PETSc program to one that uses
662*7f296bb3SBarry Smithpreloading. The command line options `-preload` `true` and
663*7f296bb3SBarry Smith`-preload` `false` may be used to turn on and off preloading at run
664*7f296bb3SBarry Smithtime for PETSc programs that use these macros.
665*7f296bb3SBarry Smith
666*7f296bb3SBarry Smith## NVIDIA Nsight Systems profiling
667*7f296bb3SBarry Smith
668*7f296bb3SBarry SmithNsight Systems will generate profiling data with a CUDA executable
669*7f296bb3SBarry Smithwith the command `nsys`.
670*7f296bb3SBarry SmithFor example, in serial
671*7f296bb3SBarry Smith
672*7f296bb3SBarry Smith```bash
673*7f296bb3SBarry Smithnsys profile -t nvtx,cuda -o file --stats=true --force-overwrite true ./a.out
674*7f296bb3SBarry Smith```
675*7f296bb3SBarry Smith
676*7f296bb3SBarry Smithwill generate a file `file.qdstrm` with performance data that is
677*7f296bb3SBarry Smithannotated with PETSc events (methods) and Kokkos device kernel names.
678*7f296bb3SBarry SmithThe Nsight Systems GUI, `nsys-ui`, can be used to navigate this file
679*7f296bb3SBarry Smith(<https://developer.nvidia.com/nsight-systems>). The Nsight Systems GUI
680*7f296bb3SBarry Smithlets you see a timeline of code performance information like kernels,
681*7f296bb3SBarry Smithmemory mallocs and frees, CPU-GPU communication, and high-level data like time, sizes
682*7f296bb3SBarry Smithof memory copies, and more, in a popup window when the mouse
683*7f296bb3SBarry Smithhovers over the section.
684*7f296bb3SBarry SmithTo view the data, start `nsys-ui` without any arguments and then `Import` the
685*7f296bb3SBarry Smith`.qdstrm` file in the GUI.
686*7f296bb3SBarry SmithA side effect of this viewing process is the generation of a file `file.nsys-rep`, which can be viewed directly
687*7f296bb3SBarry Smithwith `nsys-ui` in the future.
688*7f296bb3SBarry Smith
689*7f296bb3SBarry SmithFor an MPI parallel job, only one process can call `nsys`,
690*7f296bb3SBarry Smithsay have rank zero output `nsys` data and have all other
691*7f296bb3SBarry Smithranks call the executable directly. For example with MPICH
692*7f296bb3SBarry Smithor Open MPI - we can run a parallel job on 4 MPI tasks as:
693*7f296bb3SBarry Smith
694*7f296bb3SBarry Smith```console
695*7f296bb3SBarry Smithmpiexec -n 1 nsys profile -t nvtx,cuda -o file_name --stats=true --force-overwrite true ./a.out : -n 3 ./a.out
696*7f296bb3SBarry Smith```
697*7f296bb3SBarry Smith
698*7f296bb3SBarry Smith(sec_using_tau)=
699*7f296bb3SBarry Smith
700*7f296bb3SBarry SmithNote: The Nsight GUI can open profiling reports from elsewhere. For
701*7f296bb3SBarry Smithexample, a report from a compute node can be analyzed on your local
702*7f296bb3SBarry Smithmachine, but care should be taken to use the exact same versions of
703*7f296bb3SBarry SmithNsight Systems that generated the report.
704*7f296bb3SBarry SmithTo check the version of Nsight on the compute node run `nsys-ui` and
705*7f296bb3SBarry Smithnote the version number at the top of the window.
706*7f296bb3SBarry Smith
707*7f296bb3SBarry Smith## Using TAU
708*7f296bb3SBarry Smith
709*7f296bb3SBarry SmithTAU profiles can be generated without the need for instrumentation through the
710*7f296bb3SBarry Smithuse of the perfstubs package. PETSc by default is configured with `--with-tau-perfstubs`.
711*7f296bb3SBarry SmithTo generate profiles with TAU, first setup TAU:
712*7f296bb3SBarry Smith
713*7f296bb3SBarry Smith```bash
714*7f296bb3SBarry Smithwget http://tau.uoregon.edu/tau.tgz
715*7f296bb3SBarry Smith./configure -cc=mpicc -c++=mpicxx -mpi -bfd=download -unwind=download && make install
716*7f296bb3SBarry Smithexport PATH=<tau dir>/x86_64/bin:$PATH
717*7f296bb3SBarry Smith```
718*7f296bb3SBarry Smith
719*7f296bb3SBarry SmithFor more information on configuring TAU, see [http://tau.uoregon.edu](http://tau.uoregon.edu).
720*7f296bb3SBarry SmithNext, run your program with TAU. For instance, to profile `ex56`,
721*7f296bb3SBarry Smith
722*7f296bb3SBarry Smith```bash
723*7f296bb3SBarry Smithcd $PETSC_DIR/src/snes/tutorials
724*7f296bb3SBarry Smithmake ex56
725*7f296bb3SBarry Smithmpirun -n 4 tau_exec -T mpi ./ex56 -log_perfstubs <args>
726*7f296bb3SBarry Smith```
727*7f296bb3SBarry Smith
728*7f296bb3SBarry SmithThis should produce four `profile.*` files with profile data that can be
729*7f296bb3SBarry Smithviewed with `paraprof/pprof`:
730*7f296bb3SBarry Smith
731*7f296bb3SBarry Smith```shell
732*7f296bb3SBarry SmithReading Profile files in profile.*
733*7f296bb3SBarry Smith
734*7f296bb3SBarry SmithNODE 0;CONTEXT 0;THREAD 0:
735*7f296bb3SBarry Smith---------------------------------------------------------------------------------------
736*7f296bb3SBarry Smith%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
737*7f296bb3SBarry Smith              msec   total msec                          usec/call
738*7f296bb3SBarry Smith---------------------------------------------------------------------------------------
739*7f296bb3SBarry Smith100.0           26        1,838           1       41322    1838424 .TAU application
740*7f296bb3SBarry Smith 73.2            1        1,345           2         168     672950 SNESSolve
741*7f296bb3SBarry Smith 62.2            3        1,142           2        1282     571442 SNESJacobianEval
742*7f296bb3SBarry Smith 62.0        1,136        1,138           2          76     569494 DMPlexJacobianFE
743*7f296bb3SBarry Smith 60.1        0.046        1,105           1          32    1105001 Solve 1
744*7f296bb3SBarry Smith 15.2           87          279           5       11102      55943 Mesh Setup
745*7f296bb3SBarry Smith 13.2        0.315          241           1          32     241765 Solve 0
746*7f296bb3SBarry Smith  7.8           80          144       38785       38785          4 MPI_Allreduce()
747*7f296bb3SBarry Smith  7.0           69          128           6       43386      21491 DualSpaceSetUp
748*7f296bb3SBarry Smith  6.2            1          114           4          54      28536 PCSetUp
749*7f296bb3SBarry Smith  6.0           12          110           2         892      55407 PCSetUp_GAMG+
750*7f296bb3SBarry Smith  3.9           70           70           1           0      70888 MPI_Init_thread()
751*7f296bb3SBarry Smith  3.7           68           68       41747           0          2 MPI Collective Sync
752*7f296bb3SBarry Smith  3.6            8           66           4        3536      16548 SNESFunctionEval
753*7f296bb3SBarry Smith  2.6           45           48         171         171        281 MPI_Bcast()
754*7f296bb3SBarry Smith  1.9           34           34        7836           0          4 MPI_Barrier()
755*7f296bb3SBarry Smith  1.8        0.567           33           2          68      16912  GAMG Coarsen
756*7f296bb3SBarry Smith```
757*7f296bb3SBarry Smith
758*7f296bb3SBarry Smith```{raw} html
759*7f296bb3SBarry Smith<hr>
760*7f296bb3SBarry Smith```
761*7f296bb3SBarry Smith
762*7f296bb3SBarry Smith```{eval-rst}
763*7f296bb3SBarry Smith.. bibliography:: /petsc.bib
764*7f296bb3SBarry Smith   :filter: docname in docnames
765*7f296bb3SBarry Smith```
766