xref: /petsc/doc/manual/profiling.md (revision 2ff79c18c26c94ed8cb599682f680f231dca6444)
1(ch_profiling)=
2
3# Profiling
4
5PETSc includes a consistent, lightweight scheme for profiling
6application programs. The PETSc routines automatically log
7performance data if certain options are specified at runtime. The user
8can also log information about application codes for a complete picture
9of performance.
10
11In addition, as described in
12{any}`sec_ploginfo`, PETSc provides a mechanism for
13printing informative messages about computations.
14{any}`sec_profbasic` introduces the various profiling
15options in PETSc, while the remainder of the chapter focuses on details
16such as monitoring application codes and tips for accurate profiling.
17
18(sec_profbasic)=
19
20## Basic Profiling Information
21
22The profiling options include the following:
23
24- `-log_view  [:filename]` - Prints an ASCII version of performance data at the
25  program’s conclusion. These statistics are comprehensive and concise
26  and require little overhead; thus, `-log_view` is intended as the
27  primary means of monitoring the performance of PETSc codes. See {any}`sec_ploginfo`
28- `-info [infofile]` - Prints verbose information about code to
29  stdout or an optional file. This option provides details about
30  algorithms, data structures, etc. Since the overhead of printing such
31  output slows a code, this option should not be used when evaluating a
32  program’s performance. See {any}`sec_PetscInfo`
33- `-log_trace [logfile]` - Traces the beginning and ending of all
34  PETSc events. This option, which can be used in conjunction with
35  `-info`, is useful to see where a program is hanging without
36  running in the debugger. See `PetscLogTraceBegin()`.
37
38As discussed in {any}`sec_mpelogs`, additional profiling
39can be done with MPE.
40
41(sec_ploginfo)=
42
43### Interpreting `-log_view` Output: The Basics
44
45As shown in {any}`listing <listing_exprof>` in {any}`sec_profiling_programs`, the
46option `-log_view` `[:filename]` activates printing of profile data to standard
47output or an ASCII file at the conclusion of a program. See `PetscLogView()` for all the possible
48output options.
49
50We print performance data for each routine, organized by PETSc
51libraries, followed by any user-defined events (discussed in
52{any}`sec_profileuser`). For each routine, the output data
53include the maximum time and floating point operation (flop) rate over
54all processes. Information about parallel performance is also included,
55as discussed in the following section.
56
57For the purpose of PETSc floating point operation counting, we define
58one *flop* as one operation of any of the following types:
59multiplication, division, addition, or subtraction. For example, one
60`VecAXPY()` operation, which computes $y = \alpha x + y$ for
61vectors of length $N$, requires $2N$ flop (consisting of
62$N$ additions and $N$ multiplications). Bear in mind that
63flop rates present only a limited view of performance, since memory
64loads and stores are the real performance barrier.
65
66For simplicity, the remainder of this discussion focuses on interpreting
67profile data for the `KSP` library, which provides the linear solvers
68at the heart of the PETSc package. Recall the hierarchical organization
69of the PETSc library, as shown in
70{any}`fig_library`. Each `KSP` solver is composed
71of a `PC` (preconditioner) and a `KSP` (Krylov subspace) part, which
72are in turn built on top of the `Mat` (matrix) and `Vec` (vector)
73modules. Thus, operations in the `KSP` module are composed of
74lower-level operations in these packages. Note also that the nonlinear
75solvers library, `SNES`, is built on top of the `KSP` module, and
76the timestepping library, `TS`, is in turn built on top of `SNES`.
77
78We briefly discuss interpretation of the sample output in
79{any}`listing <listing_exprof>`, which was generated by solving a
80linear system on one process using restarted GMRES and ILU
81preconditioning. The linear solvers in `KSP` consist of two basic
82phases, `KSPSetUp()` and `KSPSolve()`, each of which consists of a
83variety of actions, depending on the particular solution technique. For
84the case of using the `PCILU` preconditioner and `KSPGMRES` Krylov
85subspace method, the breakdown of PETSc routines is listed below. As
86indicated by the levels of indentation, the operations in `KSPSetUp()`
87include all of the operations within `PCSetUp()`, which in turn
88include `MatILUFactor()`, and so on.
89
90- `KSPSetUp` - Set up linear solver
91
92  - `PCSetUp` - Set up preconditioner
93
94    - `MatILUFactor` - Factor matrix
95
96      - `MatILUFactorSymbolic` - Symbolic factorization phase
97      - `MatLUFactorNumeric` - Numeric factorization phase
98
99- `KSPSolve` - Solve linear system
100
101  - `PCApply` - Apply preconditioner
102
103    - `MatSolve` - Forward/backward triangular solves
104
105  - `KSPGMRESOrthog` - Orthogonalization in GMRES
106
107    - `VecDot` or `VecMDot` - Inner products
108    - `VecAXPY` or `VecMAXPY` - vector updates
109
110  - `MatMult` - Matrix-vector product
111
112  - `MatMultAdd` - Matrix-vector product + vector addition
113
114    - `VecScale`, `VecNorm`, `VecAXPY`, `VecCopy`, ...
115
116The summaries printed via `-log_view` reflect this routine hierarchy.
117For example, the performance summaries for a particular high-level
118routine such as `KSPSolve()` include all of the operations accumulated
119in the lower-level components that make up the routine.
120
121The output produced with `-log_view` is flat, meaning that the hierarchy
122of PETSc operations is not completely clear. For a
123particular problem, the user should generally have an idea of the basic
124operations that are required for its implementation (e.g., which
125operations are performed when using GMRES and ILU, as described above),
126so that interpreting the `-log_view` data should be relatively
127straightforward.
128If this is problematic then it is also possible to examine
129the profiling information in a nested format. For more information see
130{any}`sec_nestedevents`.
131
132(sec_parperformance)=
133
134### Interpreting `-log_view` Output: Parallel Performance
135
136We next discuss performance summaries for parallel programs, as shown
137within {any}`listing <listing_exparprof>` and {any}`listing <listing_exparprof2>`,
138which presents the
139output generated by the `-log_view` option. The program that generated
140this data is
141<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/ksp/ksp/tutorials/ex10.c.html">KSP Tutorial ex10</a>.
142The code loads a matrix and right-hand-side vector from a binary file
143and then solves the resulting linear system; the program then repeats
144this process for a second linear system. This particular case was run on
145four processors of an Intel x86_64 Linux cluster, using restarted GMRES
146and the block Jacobi preconditioner, where each block was solved with
147ILU. The two input files `medium` and `arco6` can be obtained from
148[datafiles](https://gitlab.com/petsc/datafiles), see {any}`petsc_repositories`.
149
150The first {any}`listing <listing_exparprof>` presents an overall
151performance summary, including times, floating-point operations,
152computational rates, and message-passing activity (such as the number
153and size of messages sent and collective operations). Summaries for
154various user-defined stages of monitoring (as discussed in
155{any}`sec_profstages`) are also given. Information about the
156various phases of computation then follow (as shown separately here in
157the second {any}`listing <listing_exparprof2>`). Finally, a summary of
158object creation and destruction is presented.
159
160(listing_exparprof)=
161
162```none
163mpiexec -n 4 ./ex10 -f0 medium -f1 arco6 -ksp_gmres_classicalgramschmidt -log_view -mat_type baij \
164            -matload_block_size 3 -pc_type bjacobi -options_left
165
166Number of iterations = 19
167Residual norm 1.088292e-05
168Number of iterations = 59
169Residual norm 3.871022e-02
170---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
171
172./ex10 on a intel-bdw-opt named beboplogin4 with 4 processors, by jczhang Mon Apr 23 13:36:54 2018
173Using PETSc Development Git Revision: v3.9-163-gbe3efd42 Git Date: 2018-04-16 10:45:40 -0500
174
175                         Max       Max/Min        Avg      Total
176Time (sec):           1.849e-01      1.00002   1.849e-01
177Objects:              1.060e+02      1.00000   1.060e+02
178Flops:                2.361e+08      1.00684   2.353e+08  9.413e+08
179Flops/sec:            1.277e+09      1.00685   1.273e+09  5.091e+09
180MPI Msg Count:        2.360e+02      1.34857   2.061e+02  8.245e+02
181MPI Msg Len (bytes):  1.256e+07      2.24620   4.071e+04  3.357e+07
182MPI Reductions:       2.160e+02      1.00000
183
184Summary of Stages:   ----- Time ------  ----- Flop -----  --- Messages ---  -- Message Lengths --  -- Reductions --
185                        Avg     %Total     Avg     %Total   counts   %Total     Avg         %Total   counts   %Total
186 0:      Main Stage: 5.9897e-04   0.3%  0.0000e+00   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  2.000e+00   0.9%
187 1:   Load System 0: 2.9113e-03   1.6%  0.0000e+00   0.0%  3.550e+01   4.3%  5.984e+02        0.1%  2.200e+01  10.2%
188 2:      KSPSetUp 0: 7.7349e-04   0.4%  9.9360e+03   0.0%  0.000e+00   0.0%  0.000e+00        0.0%  2.000e+00   0.9%
189 3:      KSPSolve 0: 1.7690e-03   1.0%  2.9673e+05   0.0%  1.520e+02  18.4%  1.800e+02        0.1%  3.900e+01  18.1%
190 4:   Load System 1: 1.0056e-01  54.4%  0.0000e+00   0.0%  3.700e+01   4.5%  5.657e+05       62.4%  2.200e+01  10.2%
191 5:      KSPSetUp 1: 5.6883e-03   3.1%  2.1205e+07   2.3%  0.000e+00   0.0%  0.000e+00        0.0%  2.000e+00   0.9%
192 6:      KSPSolve 1: 7.2578e-02  39.3%  9.1979e+08  97.7%  6.000e+02  72.8%  2.098e+04       37.5%  1.200e+02  55.6%
193
194------------------------------------------------------------------------------------------------------------------------
195
196.... [Summary of various phases, see part II below] ...
197
198------------------------------------------------------------------------------------------------------------------------
199
200Object Type          Creations   Destructions    (Reports information only for process 0.)
201...
202--- Event Stage 3: KSPSolve 0
203
204              Matrix     0              4
205              Vector    20             30
206           Index Set     0              3
207         Vec Scatter     0              1
208       Krylov Solver     0              2
209      Preconditioner     0              2
210```
211
212We next focus on the summaries for the various phases of the
213computation, as given in the table within
214the following {any}`listing <listing_exparprof2>`. The summary for each
215phase presents the maximum times and flop rates over all processes, as
216well as the ratio of maximum to minimum times and flop rates for all
217processes. A ratio of approximately 1 indicates that computations within
218a given phase are well balanced among the processes; as the ratio
219increases, the balance becomes increasingly poor. Also, the total
220computational rate (in units of MFlop/sec) is given for each phase in
221the final column of the phase summary table.
222
223$$
224{\rm Total\: Mflop/sec} \:=\: 10^{-6} * ({\rm sum\; of\; flop\; over\; all\; processors})/({\rm max\; time\; over\; all\; processors})
225$$
226
227*Note*: Total computational rates $<$ 1 MFlop are listed as 0 in
228this column of the phase summary table. Additional statistics for each
229phase include the total number of messages sent, the average message
230length, and the number of global reductions.
231
232(listing_exparprof2)=
233
234```none
235mpiexec -n 4 ./ex10 -f0 medium -f1 arco6 -ksp_gmres_classicalgramschmidt -log_view -mat_type baij \
236            -matload_block_size 3 -pc_type bjacobi -options_left
237
238---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
239.... [Overall summary, see part I] ...
240
241Phase summary info:
242   Count: number of times phase was executed
243   Time and Flop/sec: Max - maximum over all processors
244                       Ratio - ratio of maximum to minimum over all processors
245   Mess: number of messages sent
246   AvgLen: average message length
247   Reduct: number of global reductions
248   Global: entire computation
249   Stage: optional user-defined stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
250      %T - percent time in this phase         %F - percent flop in this phase
251      %M - percent messages in this phase     %L - percent message lengths in this phase
252      %R - percent reductions in this phase
253   Total Mflop/s: 10^6 * (sum of flop over all processors)/(max time over all processors)
254------------------------------------------------------------------------------------------------------------------------
255Phase              Count      Time (sec)       Flop/sec                          --- Global ---  --- Stage ----  Total
256                            Max    Ratio      Max    Ratio  Mess AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
257------------------------------------------------------------------------------------------------------------------------
258...
259
260--- Event Stage 5: KSPSetUp 1
261
262MatLUFactorNum         1 1.0 3.6440e-03 1.1 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00  2  2  0  0  0  62100  0  0  0  5819
263MatILUFactorSym        1 1.0 1.7111e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0  26  0  0  0  0     0
264MatGetRowIJ            1 1.0 1.1921e-06 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
265MatGetOrdering         1 1.0 3.0041e-05 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
266KSPSetUp               2 1.0 6.6495e-04 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00  0  0  0  0  1   9  0  0  0100     0
267PCSetUp                2 1.0 5.4271e-03 1.2 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00  3  2  0  0  0  90100  0  0  0  3907
268PCSetUpOnBlocks        1 1.0 5.3999e-03 1.2 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00  3  2  0  0  0  90100  0  0  0  3927
269
270--- Event Stage 6: KSPSolve 1
271
272MatMult               60 1.0 2.4068e-02 1.1 6.54e+07 1.0 6.0e+02 2.1e+04 0.0e+00 12 27 73 37  0  32 28100100  0 10731
273MatSolve              61 1.0 1.9177e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 10 25  0  0  0  26 26  0  0  0 12491
274VecMDot               59 1.0 1.4741e-02 1.3 4.86e+07 1.0 0.0e+00 0.0e+00 5.9e+01  7 21  0  0 27  18 21  0  0 49 13189
275VecNorm               61 1.0 3.0417e-03 1.4 3.29e+06 1.0 0.0e+00 0.0e+00 6.1e+01  1  1  0  0 28   4  1  0  0 51  4332
276VecScale              61 1.0 9.9802e-04 1.0 1.65e+06 1.0 0.0e+00 0.0e+00 0.0e+00  1  1  0  0  0   1  1  0  0  0  6602
277VecCopy                2 1.0 5.9128e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
278VecSet                64 1.0 8.0323e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
279VecAXPY                3 1.0 7.4387e-05 1.1 1.62e+05 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  8712
280VecMAXPY              61 1.0 8.8558e-03 1.1 5.18e+07 1.0 0.0e+00 0.0e+00 0.0e+00  5 22  0  0  0  12 23  0  0  0 23393
281VecScatterBegin       60 1.0 9.6416e-04 1.8 0.00e+00 0.0 6.0e+02 2.1e+04 0.0e+00  0  0 73 37  0   1  0100100  0     0
282VecScatterEnd         60 1.0 6.1543e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  3  0  0  0  0   8  0  0  0  0     0
283VecNormalize          61 1.0 4.2675e-03 1.3 4.94e+06 1.0 0.0e+00 0.0e+00 6.1e+01  2  2  0  0 28   5  2  0  0 51  4632
284KSPGMRESOrthog        59 1.0 2.2627e-02 1.1 9.72e+07 1.0 0.0e+00 0.0e+00 5.9e+01 11 41  0  0 27  29 42  0  0 49 17185
285KSPSolve               1 1.0 7.2577e-02 1.0 2.31e+08 1.0 6.0e+02 2.1e+04 1.2e+02 39 98 73 37 56  99100100100100 12673
286PCSetUpOnBlocks        1 1.0 9.5367e-07 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
287PCApply               61 1.0 2.0427e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 11 25  0  0  0  28 26  0  0  0 11726
288------------------------------------------------------------------------------------------------------------------------
289.... [Conclusion of overall summary, see part I] ...
290```
291
292As discussed in the preceding section, the performance summaries for
293higher-level PETSc routines include the statistics for the lower levels
294of which they are made up. For example, the communication within
295matrix-vector products `MatMult()` consists of vector scatter
296operations, as given by the routines `VecScatterBegin()` and
297`VecScatterEnd()`.
298
299The final data presented are the percentages of the various statistics
300(time (`%T`), flop/sec (`%F`), messages(`%M`), average message
301length (`%L`), and reductions (`%R`)) for each event relative to the
302total computation and to any user-defined stages (discussed in
303{any}`sec_profstages`). These statistics can aid in
304optimizing performance, since they indicate the sections of code that
305could benefit from various kinds of tuning.
306{any}`ch_performance` gives suggestions about achieving good
307performance with PETSc codes.
308
309The additional option `-log_view_memory` causes the display of additional columns of information about how much
310memory was allocated and freed during each logged event. This is useful
311to understand what phases of a computation require the most memory.
312
313(sec_mpelogs)=
314
315### Using `-log_mpe` with Jumpshot
316
317It is also possible to use the *Jumpshot* package
318{cite}`upshot` to visualize PETSc events. This package comes
319with the MPE software, which is part of the MPICH
320{cite}`mpich-web-page` implementation of MPI. The option
321
322```none
323-log_mpe [logfile]
324```
325
326creates a logfile of events appropriate for viewing with *Jumpshot*. The
327user can either use the default logging file or specify a name via
328`logfile`. Events can be deactivated as described in
329{any}`sec_deactivate`.
330
331The user can also log MPI events. To do this, simply consider the PETSc
332application as any MPI application, and follow the MPI implementation’s
333instructions for logging MPI calls. For example, when using MPICH, this
334merely required adding `-llmpich` to the library list *before*
335`-lmpich`.
336
337(sec_nestedevents)=
338
339### Profiling Nested Events
340
341It is possible to output the PETSc logging information in a nested format
342where the hierarchy of events is explicit. This output can be generated
343either as an XML file or as a text file in a format suitable for viewing as
344a flame graph.
345
346One can generate the XML output by passing the option `-log_view :[logfilename]:ascii_xml`.
347It can be viewed by copying `${PETSC_DIR}/share/petsc/xml/performance_xml2html.xsl`
348into the current directory, then opening the logfile in your browser.
349
350The flame graph output can be generated with the option `-log_view :[logfile]:ascii_flamegraph`.
351It can then be visualised with either [FlameGraph](https://github.com/brendangregg/FlameGraph)
352or [speedscope](https://www.speedscope.app). A flamegraph can be visualized directly from
353stdout using, for example,
354`ImageMagick's display utility <https://imagemagick.org/script/display.php>`:
355
356```
357cd $PETSC_DIR/src/sys/tests
358make ex30
359mpiexec -n 2 ./ex30 -log_view ::ascii_flamegraph | flamegraph | display
360```
361
362Note that user-defined stages (see {any}`sec_profstages`) will be ignored when
363using this nested format.
364
365(sec_profileuser)=
366
367## Profiling Application Codes
368
369PETSc automatically logs object creation, times, and floating-point
370counts for the library routines. Users can easily supplement this
371information by profiling their application codes as well. The basic
372steps involved in logging a user-defined portion of code, called an
373*event*, are shown in the code fragment below:
374
375```
376PetscLogEvent  USER_EVENT;
377PetscClassId   classid;
378PetscLogDouble user_event_flops;
379
380PetscClassIdRegister("class name",&classid);
381PetscLogEventRegister("User event name",classid,&USER_EVENT);
382PetscLogEventBegin(USER_EVENT,0,0,0,0);
383/* code segment to monitor */
384PetscLogFlops(user_event_flops);
385PetscLogEventEnd(USER_EVENT,0,0,0,0);
386```
387
388One must register the event by calling `PetscLogEventRegister()`,
389which assigns a unique integer to identify the event for profiling
390purposes:
391
392```
393PetscLogEventRegister(const char string[],PetscClassId classid,PetscLogEvent *e);
394```
395
396Here `string` is a user-defined event name, and `color` is an
397optional user-defined event color (for use with *Jumpshot* logging; see
398{any}`sec_mpelogs`); one should see the manual page for
399details. The argument returned in `e` should then be passed to the
400`PetscLogEventBegin()` and `PetscLogEventEnd()` routines.
401
402Events are logged by using the pair
403
404```
405PetscLogEventBegin(int event,PetscObject o1,PetscObject o2,PetscObject o3,PetscObject o4);
406PetscLogEventEnd(int event,PetscObject o1,PetscObject o2,PetscObject o3,PetscObject o4);
407```
408
409The four objects are the PETSc objects that are most closely associated
410with the event. For instance, in a matrix-vector product they would be
411the matrix and the two vectors. These objects can be omitted by
412specifying 0 for `o1` - `o4`. The code between these two routine
413calls will be automatically timed and logged as part of the specified
414event.
415
416Events are collective by default on the communicator of `o1` (if present).
417They can be made not collective by using `PetscLogEventSetCollective()`.
418No synchronization is performed on collective events in optimized builds unless
419the command line option `-log_sync` is used; however, we do check for collective
420semantics in debug mode.
421
422The user can log the number of floating-point operations for this
423segment of code by calling
424
425```
426PetscLogFlops(number of flop for this code segment);
427```
428
429between the calls to `PetscLogEventBegin()` and
430`PetscLogEventEnd()`. This value will automatically be added to the
431global flop counter for the entire program.
432
433(sec_profstages)=
434
435## Profiling Multiple Sections of Code
436
437By default, the profiling produces a single set of statistics for all
438code between the `PetscInitialize()` and `PetscFinalize()` calls
439within a program. One can independently monitor several "stages" of code
440by switching among the various stages with the commands
441
442```
443PetscLogStagePush(PetscLogStage stage);
444PetscLogStagePop();
445```
446
447see the manual pages for details.
448The command
449
450```
451PetscLogStageRegister(const char *name,PetscLogStage *stage)
452```
453
454allows one to associate a name with a stage; these names are printed
455whenever summaries are generated with `-log_view`. The following code fragment uses three profiling
456stages within an program.
457
458```
459PetscInitialize(int *argc,char ***args,0,0);
460/* stage 0 of code here */
461PetscLogStageRegister("Stage 0 of Code", &stagenum0);
462for (i=0; i<ntimes; i++) {
463    PetscLogStageRegister("Stage 1 of Code", &stagenum1);
464    PetscLogStagePush(stagenum1);
465    /* stage 1 of code here */
466    PetscLogStagePop();
467    PetscLogStageRegister("Stage 2 of Code", &stagenum2);
468    PetscLogStagePush(stagenum2);
469    /* stage 2 of code here */
470    PetscLogStagePop();
471}
472PetscFinalize();
473```
474
475The listings above
476show output generated by
477`-log_view` for a program that employs several profiling stages. In
478particular, this program is subdivided into six stages: loading a matrix and right-hand-side vector from a binary file,
479setting up the preconditioner, and solving the linear system; this
480sequence is then repeated for a second linear system. For simplicity,
481the second listing contains output only for
482stages 5 and 6 (linear solve of the second system), which comprise the
483part of this computation of most interest to us in terms of performance
484monitoring. This code organization (solving a small linear system
485followed by a larger system) enables generation of more accurate
486profiling statistics for the second system by overcoming the often
487considerable overhead of paging, as discussed in
488{any}`sec_profaccuracy`.
489
490(sec_deactivate)=
491
492## Restricting Event Logging
493
494By default, all PETSc operations are logged. To enable or disable the
495PETSc logging of individual events, one uses the commands
496
497```
498PetscLogEventActivate(PetscLogEvent event);
499PetscLogEventDeactivate(PetscLogEvent event);
500```
501
502The `event` may be either a predefined PETSc event (as listed in the
503file `$PETSC_DIR/include/petsclog.h`) or one obtained with
504`PetscLogEventRegister()` (as described in
505{any}`sec_profileuser`).
506
507PETSc also provides routines that deactivate (or activate) logging for
508entire components of the library. Currently, the components that support
509such logging (de)activation are `Mat` (matrices), `Vec` (vectors),
510`KSP` (linear solvers, including `KSP` and `PC`), and `SNES`
511(nonlinear solvers):
512
513```
514PetscLogEventDeactivateClass(MAT_CLASSID);
515PetscLogEventDeactivateClass(KSP_CLASSID); /* includes PC and KSP */
516PetscLogEventDeactivateClass(VEC_CLASSID);
517PetscLogEventDeactivateClass(SNES_CLASSID);
518```
519
520and
521
522```
523PetscLogEventActivateClass(MAT_CLASSID);
524PetscLogEventActivateClass(KSP_CLASSID);   /* includes PC and KSP */
525PetscLogEventActivateClass(VEC_CLASSID);
526PetscLogEventActivateClass(SNES_CLASSID);
527```
528
529(sec_petscinfo)=
530
531## Interpreting `-info` Output: Informative Messages
532
533Users can activate the printing of verbose information about algorithms,
534data structures, etc. to the screen by using the option `-info` or by
535calling `PetscInfoAllow(PETSC_TRUE)`. Such logging, which is used
536throughout the PETSc libraries, can aid the user in understanding
537algorithms and tuning program performance. For example, as discussed in
538{any}`sec_matsparse`, `-info` activates the printing of
539information about memory allocation during matrix assembly.
540
541One can selectively turn off informative messages about any of the basic
542PETSc objects (e.g., `Mat`, `SNES`) with the command
543
544```
545PetscInfoDeactivateClass(int object_classid)
546```
547
548where `object_classid` is one of `MAT_CLASSID`, `SNES_CLASSID`,
549etc. Messages can be reactivated with the command
550
551```
552PetscInfoActivateClass(int object_classid)
553```
554
555Such deactivation can be useful when one wishes to view information
556about higher-level PETSc libraries (e.g., `TS` and `SNES`) without
557seeing all lower level data as well (e.g., `Mat`).
558
559One can turn on or off logging for particular classes at runtime
560
561```console
562-info [filename][:[~]<list,of,classnames>[:[~]self]]
563```
564
565The `list,of,classnames` is a list, separated by commas with no spaces, of classes one wishes to view the information on. For
566example `vec,ksp`. Information on all other classes will not be displayed. The ~ indicates to not display the list of classes but rather to display all other classes.
567
568`self` indicates to display information on objects that are associated with `PETSC_COMM_SELF` while `~self` indicates to display information only for parallel objects.
569
570See `PetscInfo()` for links to all the info operations that are available.
571
572Application programmers can log their own messages, as well, by using the
573routine
574
575```
576PetscInfo(void* obj,char *message,...)
577```
578
579where `obj` is the PETSc object associated most closely with the
580logging statement, `message`. For example, in the line search Newton
581methods, we use a statement such as
582
583```
584PetscInfo(snes,"Cubic step, lambda %g\n",lambda);
585```
586
587## Time
588
589PETSc application programmers can access the wall clock time directly
590with the command
591
592```
593PetscLogDouble time;
594PetscCall(PetscTime(&time));
595```
596
597which returns the current time in seconds since the epoch, and is
598commonly implemented with `MPI_Wtime`. A floating point number is
599returned in order to express fractions of a second. In addition, as
600discussed in {any}`sec_profileuser`, PETSc can automatically
601profile user-defined segments of code.
602
603## Saving Output to a File
604
605All output from PETSc programs (including informative messages,
606profiling information, and convergence data) can be saved to a file by
607using the command line option `-history [filename]`. If no file name
608is specified, the output is stored in the file
609`${HOME}/.petschistory`. Note that this option only saves output
610printed with the `PetscPrintf()` and `PetscFPrintf()` commands, not
611the standard `printf()` and `fprintf()` statements.
612
613(sec_profaccuracy)=
614
615## Accurate Profiling and Paging Overheads
616
617One factor that often plays a significant role in profiling a code is
618paging by the operating system. Generally, when running a program, only
619a few pages required to start it are loaded into memory rather than the
620entire executable. When the execution proceeds to code segments that are
621not in memory, a pagefault occurs, prompting the required pages to be
622loaded from the disk (a very slow process). This activity distorts the
623results significantly. (The paging effects are noticeable in the log
624files generated by `-log_mpe`, which is described in
625{any}`sec_mpelogs`.)
626
627To eliminate the effects of paging when profiling the performance of a
628program, we have found an effective procedure is to run the *exact same
629code* on a small dummy problem before running it on the actual problem
630of interest. We thus ensure that all code required by a solver is loaded
631into memory during solution of the small problem. When the code proceeds
632to the actual (larger) problem of interest, all required pages have
633already been loaded into main memory, so that the performance numbers
634are not distorted.
635
636When this procedure is used in conjunction with the user-defined stages
637of profiling described in {any}`sec_profstages`, we can
638focus easily on the problem of interest. For example, we used this
639technique in the program
640<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/ksp/ksp/tutorials/ex10.c.html">KSP Tutorial ex10</a>
641to generate the timings within
642{any}`listing <listing_exparprof>` and {any}`listing <listing_exparprof2>`.
643In this case, the profiled code
644of interest (solving the linear system for the larger problem) occurs
645within event stages 5 and 6. {any}`sec_parperformance`
646provides details about interpreting such profiling data.
647
648In particular, the macros
649
650```
651PetscPreLoadBegin(PetscBool flag,char* stagename)
652PetscPreLoadStage(char *stagename)
653```
654
655and
656
657```
658PetscPreLoadEnd()
659```
660
661can be used to easily convert a regular PETSc program to one that uses
662preloading. The command line options `-preload` `true` and
663`-preload` `false` may be used to turn on and off preloading at run
664time for PETSc programs that use these macros.
665
666## NVIDIA Nsight Systems profiling
667
668Nsight Systems will generate profiling data with a CUDA executable
669with the command `nsys`.
670For example, in serial
671
672```bash
673nsys profile -t nvtx,cuda -o file --stats=true --force-overwrite true ./a.out
674```
675
676will generate a file `file.qdstrm` with performance data that is
677annotated with PETSc events (methods) and Kokkos device kernel names.
678The Nsight Systems GUI, `nsys-ui`, can be used to navigate this file
679(<https://developer.nvidia.com/nsight-systems>). The Nsight Systems GUI
680lets you see a timeline of code performance information like kernels,
681memory mallocs and frees, CPU-GPU communication, and high-level data like time, sizes
682of memory copies, and more, in a popup window when the mouse
683hovers over the section.
684To view the data, start `nsys-ui` without any arguments and then `Import` the
685`.qdstrm` file in the GUI.
686A side effect of this viewing process is the generation of a file `file.nsys-rep`, which can be viewed directly
687with `nsys-ui` in the future.
688
689For an MPI parallel job, only one process can call `nsys`,
690say have rank zero output `nsys` data and have all other
691ranks call the executable directly. For example with MPICH
692or Open MPI - we can run a parallel job on 4 MPI tasks as:
693
694```console
695mpiexec -n 1 nsys profile -t nvtx,cuda -o file_name --stats=true --force-overwrite true ./a.out : -n 3 ./a.out
696```
697
698(sec_using_tau)=
699
700Note: The Nsight GUI can open profiling reports from elsewhere. For
701example, a report from a compute node can be analyzed on your local
702machine, but care should be taken to use the exact same versions of
703Nsight Systems that generated the report.
704To check the version of Nsight on the compute node run `nsys-ui` and
705note the version number at the top of the window.
706
707## ROCProfiler profiling
708
709For AMD GPUs, log events registered to PETSc can be displayed as ranges in trace files generated by `rocprof` by running with the `-log_roctx` flag.
710See the `rocprof` [documentation](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest) for details of how to run the profiler.
711At the least, you will need:
712
713```console
714mpiexec -n 1 rocprofv3 --marker-trace -o file_name -- ./path/to/application -log_roctx
715```
716
717
718## Using TAU
719
720TAU profiles can be generated without the need for instrumentation through the
721use of the perfstubs package. PETSc by default is configured with `--with-tau-perfstubs`.
722To generate profiles with TAU, first setup TAU:
723
724```bash
725wget http://tau.uoregon.edu/tau.tgz
726./configure -cc=mpicc -c++=mpicxx -mpi -bfd=download -unwind=download && make install
727export PATH=<tau dir>/x86_64/bin:$PATH
728```
729
730For more information on configuring TAU, see [http://tau.uoregon.edu](http://tau.uoregon.edu).
731Next, run your program with TAU. For instance, to profile `ex56`,
732
733```bash
734cd $PETSC_DIR/src/snes/tutorials
735make ex56
736mpirun -n 4 tau_exec -T mpi ./ex56 -log_perfstubs <args>
737```
738
739This should produce four `profile.*` files with profile data that can be
740viewed with `paraprof/pprof`:
741
742```shell
743Reading Profile files in profile.*
744
745NODE 0;CONTEXT 0;THREAD 0:
746---------------------------------------------------------------------------------------
747%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
748              msec   total msec                          usec/call
749---------------------------------------------------------------------------------------
750100.0           26        1,838           1       41322    1838424 .TAU application
751 73.2            1        1,345           2         168     672950 SNESSolve
752 62.2            3        1,142           2        1282     571442 SNESJacobianEval
753 62.0        1,136        1,138           2          76     569494 DMPlexJacobianFE
754 60.1        0.046        1,105           1          32    1105001 Solve 1
755 15.2           87          279           5       11102      55943 Mesh Setup
756 13.2        0.315          241           1          32     241765 Solve 0
757  7.8           80          144       38785       38785          4 MPI_Allreduce()
758  7.0           69          128           6       43386      21491 DualSpaceSetUp
759  6.2            1          114           4          54      28536 PCSetUp
760  6.0           12          110           2         892      55407 PCSetUp_GAMG+
761  3.9           70           70           1           0      70888 MPI_Init_thread()
762  3.7           68           68       41747           0          2 MPI Collective Sync
763  3.6            8           66           4        3536      16548 SNESFunctionEval
764  2.6           45           48         171         171        281 MPI_Bcast()
765  1.9           34           34        7836           0          4 MPI_Barrier()
766  1.8        0.567           33           2          68      16912  GAMG Coarsen
767```
768
769```{raw} html
770<hr>
771```
772
773```{eval-rst}
774.. bibliography:: /petsc.bib
775   :filter: docname in docnames
776```
777