1(ch_profiling)= 2 3# Profiling 4 5PETSc includes a consistent, lightweight scheme for profiling 6application programs. The PETSc routines automatically log 7performance data if certain options are specified at runtime. The user 8can also log information about application codes for a complete picture 9of performance. 10 11In addition, as described in 12{any}`sec_ploginfo`, PETSc provides a mechanism for 13printing informative messages about computations. 14{any}`sec_profbasic` introduces the various profiling 15options in PETSc, while the remainder of the chapter focuses on details 16such as monitoring application codes and tips for accurate profiling. 17 18(sec_profbasic)= 19 20## Basic Profiling Information 21 22The profiling options include the following: 23 24- `-log_view [:filename]` - Prints an ASCII version of performance data at the 25 program’s conclusion. These statistics are comprehensive and concise 26 and require little overhead; thus, `-log_view` is intended as the 27 primary means of monitoring the performance of PETSc codes. See {any}`sec_ploginfo` 28- `-info [infofile]` - Prints verbose information about code to 29 stdout or an optional file. This option provides details about 30 algorithms, data structures, etc. Since the overhead of printing such 31 output slows a code, this option should not be used when evaluating a 32 program’s performance. See {any}`sec_PetscInfo` 33- `-log_trace [logfile]` - Traces the beginning and ending of all 34 PETSc events. This option, which can be used in conjunction with 35 `-info`, is useful to see where a program is hanging without 36 running in the debugger. See `PetscLogTraceBegin()`. 37 38As discussed in {any}`sec_mpelogs`, additional profiling 39can be done with MPE. 40 41(sec_ploginfo)= 42 43### Interpreting `-log_view` Output: The Basics 44 45As shown in {any}`listing <listing_exprof>` in {any}`sec_profiling_programs`, the 46option `-log_view` `[:filename]` activates printing of profile data to standard 47output or an ASCII file at the conclusion of a program. See `PetscLogView()` for all the possible 48output options. 49 50We print performance data for each routine, organized by PETSc 51libraries, followed by any user-defined events (discussed in 52{any}`sec_profileuser`). For each routine, the output data 53include the maximum time and floating point operation (flop) rate over 54all processes. Information about parallel performance is also included, 55as discussed in the following section. 56 57For the purpose of PETSc floating point operation counting, we define 58one *flop* as one operation of any of the following types: 59multiplication, division, addition, or subtraction. For example, one 60`VecAXPY()` operation, which computes $y = \alpha x + y$ for 61vectors of length $N$, requires $2N$ flop (consisting of 62$N$ additions and $N$ multiplications). Bear in mind that 63flop rates present only a limited view of performance, since memory 64loads and stores are the real performance barrier. 65 66For simplicity, the remainder of this discussion focuses on interpreting 67profile data for the `KSP` library, which provides the linear solvers 68at the heart of the PETSc package. Recall the hierarchical organization 69of the PETSc library, as shown in 70{any}`fig_library`. Each `KSP` solver is composed 71of a `PC` (preconditioner) and a `KSP` (Krylov subspace) part, which 72are in turn built on top of the `Mat` (matrix) and `Vec` (vector) 73modules. Thus, operations in the `KSP` module are composed of 74lower-level operations in these packages. Note also that the nonlinear 75solvers library, `SNES`, is built on top of the `KSP` module, and 76the timestepping library, `TS`, is in turn built on top of `SNES`. 77 78We briefly discuss interpretation of the sample output in 79{any}`listing <listing_exprof>`, which was generated by solving a 80linear system on one process using restarted GMRES and ILU 81preconditioning. The linear solvers in `KSP` consist of two basic 82phases, `KSPSetUp()` and `KSPSolve()`, each of which consists of a 83variety of actions, depending on the particular solution technique. For 84the case of using the `PCILU` preconditioner and `KSPGMRES` Krylov 85subspace method, the breakdown of PETSc routines is listed below. As 86indicated by the levels of indentation, the operations in `KSPSetUp()` 87include all of the operations within `PCSetUp()`, which in turn 88include `MatILUFactor()`, and so on. 89 90- `KSPSetUp` - Set up linear solver 91 92 - `PCSetUp` - Set up preconditioner 93 94 - `MatILUFactor` - Factor matrix 95 96 - `MatILUFactorSymbolic` - Symbolic factorization phase 97 - `MatLUFactorNumeric` - Numeric factorization phase 98 99- `KSPSolve` - Solve linear system 100 101 - `PCApply` - Apply preconditioner 102 103 - `MatSolve` - Forward/backward triangular solves 104 105 - `KSPGMRESOrthog` - Orthogonalization in GMRES 106 107 - `VecDot` or `VecMDot` - Inner products 108 - `VecAXPY` or `VecMAXPY` - vector updates 109 110 - `MatMult` - Matrix-vector product 111 112 - `MatMultAdd` - Matrix-vector product + vector addition 113 114 - `VecScale`, `VecNorm`, `VecAXPY`, `VecCopy`, ... 115 116The summaries printed via `-log_view` reflect this routine hierarchy. 117For example, the performance summaries for a particular high-level 118routine such as `KSPSolve()` include all of the operations accumulated 119in the lower-level components that make up the routine. 120 121The output produced with `-log_view` is flat, meaning that the hierarchy 122of PETSc operations is not completely clear. For a 123particular problem, the user should generally have an idea of the basic 124operations that are required for its implementation (e.g., which 125operations are performed when using GMRES and ILU, as described above), 126so that interpreting the `-log_view` data should be relatively 127straightforward. 128If this is problematic then it is also possible to examine 129the profiling information in a nested format. For more information see 130{any}`sec_nestedevents`. 131 132(sec_parperformance)= 133 134### Interpreting `-log_view` Output: Parallel Performance 135 136We next discuss performance summaries for parallel programs, as shown 137within {any}`listing <listing_exparprof>` and {any}`listing <listing_exparprof2>`, 138which presents the 139output generated by the `-log_view` option. The program that generated 140this data is 141<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/ksp/ksp/tutorials/ex10.c.html">KSP Tutorial ex10</a>. 142The code loads a matrix and right-hand-side vector from a binary file 143and then solves the resulting linear system; the program then repeats 144this process for a second linear system. This particular case was run on 145four processors of an Intel x86_64 Linux cluster, using restarted GMRES 146and the block Jacobi preconditioner, where each block was solved with 147ILU. The two input files `medium` and `arco6` can be obtained from 148[datafiles](https://gitlab.com/petsc/datafiles), see {any}`petsc_repositories`. 149 150The first {any}`listing <listing_exparprof>` presents an overall 151performance summary, including times, floating-point operations, 152computational rates, and message-passing activity (such as the number 153and size of messages sent and collective operations). Summaries for 154various user-defined stages of monitoring (as discussed in 155{any}`sec_profstages`) are also given. Information about the 156various phases of computation then follow (as shown separately here in 157the second {any}`listing <listing_exparprof2>`). Finally, a summary of 158object creation and destruction is presented. 159 160(listing_exparprof)= 161 162```none 163mpiexec -n 4 ./ex10 -f0 medium -f1 arco6 -ksp_gmres_classicalgramschmidt -log_view -mat_type baij \ 164 -matload_block_size 3 -pc_type bjacobi -options_left 165 166Number of iterations = 19 167Residual norm 1.088292e-05 168Number of iterations = 59 169Residual norm 3.871022e-02 170---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- 171 172./ex10 on a intel-bdw-opt named beboplogin4 with 4 processors, by jczhang Mon Apr 23 13:36:54 2018 173Using PETSc Development Git Revision: v3.9-163-gbe3efd42 Git Date: 2018-04-16 10:45:40 -0500 174 175 Max Max/Min Avg Total 176Time (sec): 1.849e-01 1.00002 1.849e-01 177Objects: 1.060e+02 1.00000 1.060e+02 178Flops: 2.361e+08 1.00684 2.353e+08 9.413e+08 179Flops/sec: 1.277e+09 1.00685 1.273e+09 5.091e+09 180MPI Msg Count: 2.360e+02 1.34857 2.061e+02 8.245e+02 181MPI Msg Len (bytes): 1.256e+07 2.24620 4.071e+04 3.357e+07 182MPI Reductions: 2.160e+02 1.00000 183 184Summary of Stages: ----- Time ------ ----- Flop ----- --- Messages --- -- Message Lengths -- -- Reductions -- 185 Avg %Total Avg %Total counts %Total Avg %Total counts %Total 186 0: Main Stage: 5.9897e-04 0.3% 0.0000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 2.000e+00 0.9% 187 1: Load System 0: 2.9113e-03 1.6% 0.0000e+00 0.0% 3.550e+01 4.3% 5.984e+02 0.1% 2.200e+01 10.2% 188 2: KSPSetUp 0: 7.7349e-04 0.4% 9.9360e+03 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 2.000e+00 0.9% 189 3: KSPSolve 0: 1.7690e-03 1.0% 2.9673e+05 0.0% 1.520e+02 18.4% 1.800e+02 0.1% 3.900e+01 18.1% 190 4: Load System 1: 1.0056e-01 54.4% 0.0000e+00 0.0% 3.700e+01 4.5% 5.657e+05 62.4% 2.200e+01 10.2% 191 5: KSPSetUp 1: 5.6883e-03 3.1% 2.1205e+07 2.3% 0.000e+00 0.0% 0.000e+00 0.0% 2.000e+00 0.9% 192 6: KSPSolve 1: 7.2578e-02 39.3% 9.1979e+08 97.7% 6.000e+02 72.8% 2.098e+04 37.5% 1.200e+02 55.6% 193 194------------------------------------------------------------------------------------------------------------------------ 195 196.... [Summary of various phases, see part II below] ... 197 198------------------------------------------------------------------------------------------------------------------------ 199 200Object Type Creations Destructions (Reports information only for process 0.) 201... 202--- Event Stage 3: KSPSolve 0 203 204 Matrix 0 4 205 Vector 20 30 206 Index Set 0 3 207 Vec Scatter 0 1 208 Krylov Solver 0 2 209 Preconditioner 0 2 210``` 211 212We next focus on the summaries for the various phases of the 213computation, as given in the table within 214the following {any}`listing <listing_exparprof2>`. The summary for each 215phase presents the maximum times and flop rates over all processes, as 216well as the ratio of maximum to minimum times and flop rates for all 217processes. A ratio of approximately 1 indicates that computations within 218a given phase are well balanced among the processes; as the ratio 219increases, the balance becomes increasingly poor. Also, the total 220computational rate (in units of MFlop/sec) is given for each phase in 221the final column of the phase summary table. 222 223$$ 224{\rm Total\: Mflop/sec} \:=\: 10^{-6} * ({\rm sum\; of\; flop\; over\; all\; processors})/({\rm max\; time\; over\; all\; processors}) 225$$ 226 227*Note*: Total computational rates $<$ 1 MFlop are listed as 0 in 228this column of the phase summary table. Additional statistics for each 229phase include the total number of messages sent, the average message 230length, and the number of global reductions. 231 232(listing_exparprof2)= 233 234```none 235mpiexec -n 4 ./ex10 -f0 medium -f1 arco6 -ksp_gmres_classicalgramschmidt -log_view -mat_type baij \ 236 -matload_block_size 3 -pc_type bjacobi -options_left 237 238---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- 239.... [Overall summary, see part I] ... 240 241Phase summary info: 242 Count: number of times phase was executed 243 Time and Flop/sec: Max - maximum over all processors 244 Ratio - ratio of maximum to minimum over all processors 245 Mess: number of messages sent 246 AvgLen: average message length 247 Reduct: number of global reductions 248 Global: entire computation 249 Stage: optional user-defined stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). 250 %T - percent time in this phase %F - percent flop in this phase 251 %M - percent messages in this phase %L - percent message lengths in this phase 252 %R - percent reductions in this phase 253 Total Mflop/s: 10^6 * (sum of flop over all processors)/(max time over all processors) 254------------------------------------------------------------------------------------------------------------------------ 255Phase Count Time (sec) Flop/sec --- Global --- --- Stage ---- Total 256 Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s 257------------------------------------------------------------------------------------------------------------------------ 258... 259 260--- Event Stage 5: KSPSetUp 1 261 262MatLUFactorNum 1 1.0 3.6440e-03 1.1 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00 2 2 0 0 0 62100 0 0 0 5819 263MatILUFactorSym 1 1.0 1.7111e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 26 0 0 0 0 0 264MatGetRowIJ 1 1.0 1.1921e-06 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 265MatGetOrdering 1 1.0 3.0041e-05 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0 266KSPSetUp 2 1.0 6.6495e-04 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 1 9 0 0 0100 0 267PCSetUp 2 1.0 5.4271e-03 1.2 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00 3 2 0 0 0 90100 0 0 0 3907 268PCSetUpOnBlocks 1 1.0 5.3999e-03 1.2 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00 3 2 0 0 0 90100 0 0 0 3927 269 270--- Event Stage 6: KSPSolve 1 271 272MatMult 60 1.0 2.4068e-02 1.1 6.54e+07 1.0 6.0e+02 2.1e+04 0.0e+00 12 27 73 37 0 32 28100100 0 10731 273MatSolve 61 1.0 1.9177e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 10 25 0 0 0 26 26 0 0 0 12491 274VecMDot 59 1.0 1.4741e-02 1.3 4.86e+07 1.0 0.0e+00 0.0e+00 5.9e+01 7 21 0 0 27 18 21 0 0 49 13189 275VecNorm 61 1.0 3.0417e-03 1.4 3.29e+06 1.0 0.0e+00 0.0e+00 6.1e+01 1 1 0 0 28 4 1 0 0 51 4332 276VecScale 61 1.0 9.9802e-04 1.0 1.65e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 1 0 0 0 1 1 0 0 0 6602 277VecCopy 2 1.0 5.9128e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 278VecSet 64 1.0 8.0323e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0 279VecAXPY 3 1.0 7.4387e-05 1.1 1.62e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 8712 280VecMAXPY 61 1.0 8.8558e-03 1.1 5.18e+07 1.0 0.0e+00 0.0e+00 0.0e+00 5 22 0 0 0 12 23 0 0 0 23393 281VecScatterBegin 60 1.0 9.6416e-04 1.8 0.00e+00 0.0 6.0e+02 2.1e+04 0.0e+00 0 0 73 37 0 1 0100100 0 0 282VecScatterEnd 60 1.0 6.1543e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 8 0 0 0 0 0 283VecNormalize 61 1.0 4.2675e-03 1.3 4.94e+06 1.0 0.0e+00 0.0e+00 6.1e+01 2 2 0 0 28 5 2 0 0 51 4632 284KSPGMRESOrthog 59 1.0 2.2627e-02 1.1 9.72e+07 1.0 0.0e+00 0.0e+00 5.9e+01 11 41 0 0 27 29 42 0 0 49 17185 285KSPSolve 1 1.0 7.2577e-02 1.0 2.31e+08 1.0 6.0e+02 2.1e+04 1.2e+02 39 98 73 37 56 99100100100100 12673 286PCSetUpOnBlocks 1 1.0 9.5367e-07 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 287PCApply 61 1.0 2.0427e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 11 25 0 0 0 28 26 0 0 0 11726 288------------------------------------------------------------------------------------------------------------------------ 289.... [Conclusion of overall summary, see part I] ... 290``` 291 292As discussed in the preceding section, the performance summaries for 293higher-level PETSc routines include the statistics for the lower levels 294of which they are made up. For example, the communication within 295matrix-vector products `MatMult()` consists of vector scatter 296operations, as given by the routines `VecScatterBegin()` and 297`VecScatterEnd()`. 298 299The final data presented are the percentages of the various statistics 300(time (`%T`), flop/sec (`%F`), messages(`%M`), average message 301length (`%L`), and reductions (`%R`)) for each event relative to the 302total computation and to any user-defined stages (discussed in 303{any}`sec_profstages`). These statistics can aid in 304optimizing performance, since they indicate the sections of code that 305could benefit from various kinds of tuning. 306{any}`ch_performance` gives suggestions about achieving good 307performance with PETSc codes. 308 309The additional option `-log_view_memory` causes the display of additional columns of information about how much 310memory was allocated and freed during each logged event. This is useful 311to understand what phases of a computation require the most memory. 312 313(sec_mpelogs)= 314 315### Using `-log_mpe` with Jumpshot 316 317It is also possible to use the *Jumpshot* package 318{cite}`upshot` to visualize PETSc events. This package comes 319with the MPE software, which is part of the MPICH 320{cite}`mpich-web-page` implementation of MPI. The option 321 322```none 323-log_mpe [logfile] 324``` 325 326creates a logfile of events appropriate for viewing with *Jumpshot*. The 327user can either use the default logging file or specify a name via 328`logfile`. Events can be deactivated as described in 329{any}`sec_deactivate`. 330 331The user can also log MPI events. To do this, simply consider the PETSc 332application as any MPI application, and follow the MPI implementation’s 333instructions for logging MPI calls. For example, when using MPICH, this 334merely required adding `-llmpich` to the library list *before* 335`-lmpich`. 336 337(sec_nestedevents)= 338 339### Profiling Nested Events 340 341It is possible to output the PETSc logging information in a nested format 342where the hierarchy of events is explicit. This output can be generated 343either as an XML file or as a text file in a format suitable for viewing as 344a flame graph. 345 346One can generate the XML output by passing the option `-log_view :[logfilename]:ascii_xml`. 347It can be viewed by copying `${PETSC_DIR}/share/petsc/xml/performance_xml2html.xsl` 348into the current directory, then opening the logfile in your browser. 349 350The flame graph output can be generated with the option `-log_view :[logfile]:ascii_flamegraph`. 351It can then be visualised with either [FlameGraph](https://github.com/brendangregg/FlameGraph) 352or [speedscope](https://www.speedscope.app). A flamegraph can be visualized directly from 353stdout using, for example, 354`ImageMagick's display utility <https://imagemagick.org/script/display.php>`: 355 356``` 357cd $PETSC_DIR/src/sys/tests 358make ex30 359mpiexec -n 2 ./ex30 -log_view ::ascii_flamegraph | flamegraph | display 360``` 361 362Note that user-defined stages (see {any}`sec_profstages`) will be ignored when 363using this nested format. 364 365(sec_profileuser)= 366 367## Profiling Application Codes 368 369PETSc automatically logs object creation, times, and floating-point 370counts for the library routines. Users can easily supplement this 371information by profiling their application codes as well. The basic 372steps involved in logging a user-defined portion of code, called an 373*event*, are shown in the code fragment below: 374 375``` 376PetscLogEvent USER_EVENT; 377PetscClassId classid; 378PetscLogDouble user_event_flops; 379 380PetscClassIdRegister("class name",&classid); 381PetscLogEventRegister("User event name",classid,&USER_EVENT); 382PetscLogEventBegin(USER_EVENT,0,0,0,0); 383/* code segment to monitor */ 384PetscLogFlops(user_event_flops); 385PetscLogEventEnd(USER_EVENT,0,0,0,0); 386``` 387 388One must register the event by calling `PetscLogEventRegister()`, 389which assigns a unique integer to identify the event for profiling 390purposes: 391 392``` 393PetscLogEventRegister(const char string[],PetscClassId classid,PetscLogEvent *e); 394``` 395 396Here `string` is a user-defined event name, and `color` is an 397optional user-defined event color (for use with *Jumpshot* logging; see 398{any}`sec_mpelogs`); one should see the manual page for 399details. The argument returned in `e` should then be passed to the 400`PetscLogEventBegin()` and `PetscLogEventEnd()` routines. 401 402Events are logged by using the pair 403 404``` 405PetscLogEventBegin(int event,PetscObject o1,PetscObject o2,PetscObject o3,PetscObject o4); 406PetscLogEventEnd(int event,PetscObject o1,PetscObject o2,PetscObject o3,PetscObject o4); 407``` 408 409The four objects are the PETSc objects that are most closely associated 410with the event. For instance, in a matrix-vector product they would be 411the matrix and the two vectors. These objects can be omitted by 412specifying 0 for `o1` - `o4`. The code between these two routine 413calls will be automatically timed and logged as part of the specified 414event. 415 416Events are collective by default on the communicator of `o1` (if present). 417They can be made not collective by using `PetscLogEventSetCollective()`. 418No synchronization is performed on collective events in optimized builds unless 419the command line option `-log_sync` is used; however, we do check for collective 420semantics in debug mode. 421 422The user can log the number of floating-point operations for this 423segment of code by calling 424 425``` 426PetscLogFlops(number of flop for this code segment); 427``` 428 429between the calls to `PetscLogEventBegin()` and 430`PetscLogEventEnd()`. This value will automatically be added to the 431global flop counter for the entire program. 432 433(sec_profstages)= 434 435## Profiling Multiple Sections of Code 436 437By default, the profiling produces a single set of statistics for all 438code between the `PetscInitialize()` and `PetscFinalize()` calls 439within a program. One can independently monitor several "stages" of code 440by switching among the various stages with the commands 441 442``` 443PetscLogStagePush(PetscLogStage stage); 444PetscLogStagePop(); 445``` 446 447see the manual pages for details. 448The command 449 450``` 451PetscLogStageRegister(const char *name,PetscLogStage *stage) 452``` 453 454allows one to associate a name with a stage; these names are printed 455whenever summaries are generated with `-log_view`. The following code fragment uses three profiling 456stages within an program. 457 458``` 459PetscInitialize(int *argc,char ***args,0,0); 460/* stage 0 of code here */ 461PetscLogStageRegister("Stage 0 of Code", &stagenum0); 462for (i=0; i<ntimes; i++) { 463 PetscLogStageRegister("Stage 1 of Code", &stagenum1); 464 PetscLogStagePush(stagenum1); 465 /* stage 1 of code here */ 466 PetscLogStagePop(); 467 PetscLogStageRegister("Stage 2 of Code", &stagenum2); 468 PetscLogStagePush(stagenum2); 469 /* stage 2 of code here */ 470 PetscLogStagePop(); 471} 472PetscFinalize(); 473``` 474 475The listings above 476show output generated by 477`-log_view` for a program that employs several profiling stages. In 478particular, this program is subdivided into six stages: loading a matrix and right-hand-side vector from a binary file, 479setting up the preconditioner, and solving the linear system; this 480sequence is then repeated for a second linear system. For simplicity, 481the second listing contains output only for 482stages 5 and 6 (linear solve of the second system), which comprise the 483part of this computation of most interest to us in terms of performance 484monitoring. This code organization (solving a small linear system 485followed by a larger system) enables generation of more accurate 486profiling statistics for the second system by overcoming the often 487considerable overhead of paging, as discussed in 488{any}`sec_profaccuracy`. 489 490(sec_deactivate)= 491 492## Restricting Event Logging 493 494By default, all PETSc operations are logged. To enable or disable the 495PETSc logging of individual events, one uses the commands 496 497``` 498PetscLogEventActivate(PetscLogEvent event); 499PetscLogEventDeactivate(PetscLogEvent event); 500``` 501 502The `event` may be either a predefined PETSc event (as listed in the 503file `$PETSC_DIR/include/petsclog.h`) or one obtained with 504`PetscLogEventRegister()` (as described in 505{any}`sec_profileuser`). 506 507PETSc also provides routines that deactivate (or activate) logging for 508entire components of the library. Currently, the components that support 509such logging (de)activation are `Mat` (matrices), `Vec` (vectors), 510`KSP` (linear solvers, including `KSP` and `PC`), and `SNES` 511(nonlinear solvers): 512 513``` 514PetscLogEventDeactivateClass(MAT_CLASSID); 515PetscLogEventDeactivateClass(KSP_CLASSID); /* includes PC and KSP */ 516PetscLogEventDeactivateClass(VEC_CLASSID); 517PetscLogEventDeactivateClass(SNES_CLASSID); 518``` 519 520and 521 522``` 523PetscLogEventActivateClass(MAT_CLASSID); 524PetscLogEventActivateClass(KSP_CLASSID); /* includes PC and KSP */ 525PetscLogEventActivateClass(VEC_CLASSID); 526PetscLogEventActivateClass(SNES_CLASSID); 527``` 528 529(sec_petscinfo)= 530 531## Interpreting `-info` Output: Informative Messages 532 533Users can activate the printing of verbose information about algorithms, 534data structures, etc. to the screen by using the option `-info` or by 535calling `PetscInfoAllow(PETSC_TRUE)`. Such logging, which is used 536throughout the PETSc libraries, can aid the user in understanding 537algorithms and tuning program performance. For example, as discussed in 538{any}`sec_matsparse`, `-info` activates the printing of 539information about memory allocation during matrix assembly. 540 541One can selectively turn off informative messages about any of the basic 542PETSc objects (e.g., `Mat`, `SNES`) with the command 543 544``` 545PetscInfoDeactivateClass(int object_classid) 546``` 547 548where `object_classid` is one of `MAT_CLASSID`, `SNES_CLASSID`, 549etc. Messages can be reactivated with the command 550 551``` 552PetscInfoActivateClass(int object_classid) 553``` 554 555Such deactivation can be useful when one wishes to view information 556about higher-level PETSc libraries (e.g., `TS` and `SNES`) without 557seeing all lower level data as well (e.g., `Mat`). 558 559One can turn on or off logging for particular classes at runtime 560 561```console 562-info [filename][:[~]<list,of,classnames>[:[~]self]] 563``` 564 565The `list,of,classnames` is a list, separated by commas with no spaces, of classes one wishes to view the information on. For 566example `vec,ksp`. Information on all other classes will not be displayed. The ~ indicates to not display the list of classes but rather to display all other classes. 567 568`self` indicates to display information on objects that are associated with `PETSC_COMM_SELF` while `~self` indicates to display information only for parallel objects. 569 570See `PetscInfo()` for links to all the info operations that are available. 571 572Application programmers can log their own messages, as well, by using the 573routine 574 575``` 576PetscInfo(void* obj,char *message,...) 577``` 578 579where `obj` is the PETSc object associated most closely with the 580logging statement, `message`. For example, in the line search Newton 581methods, we use a statement such as 582 583``` 584PetscInfo(snes,"Cubic step, lambda %g\n",lambda); 585``` 586 587## Time 588 589PETSc application programmers can access the wall clock time directly 590with the command 591 592``` 593PetscLogDouble time; 594PetscCall(PetscTime(&time)); 595``` 596 597which returns the current time in seconds since the epoch, and is 598commonly implemented with `MPI_Wtime`. A floating point number is 599returned in order to express fractions of a second. In addition, as 600discussed in {any}`sec_profileuser`, PETSc can automatically 601profile user-defined segments of code. 602 603## Saving Output to a File 604 605All output from PETSc programs (including informative messages, 606profiling information, and convergence data) can be saved to a file by 607using the command line option `-history [filename]`. If no file name 608is specified, the output is stored in the file 609`${HOME}/.petschistory`. Note that this option only saves output 610printed with the `PetscPrintf()` and `PetscFPrintf()` commands, not 611the standard `printf()` and `fprintf()` statements. 612 613(sec_profaccuracy)= 614 615## Accurate Profiling and Paging Overheads 616 617One factor that often plays a significant role in profiling a code is 618paging by the operating system. Generally, when running a program, only 619a few pages required to start it are loaded into memory rather than the 620entire executable. When the execution proceeds to code segments that are 621not in memory, a pagefault occurs, prompting the required pages to be 622loaded from the disk (a very slow process). This activity distorts the 623results significantly. (The paging effects are noticeable in the log 624files generated by `-log_mpe`, which is described in 625{any}`sec_mpelogs`.) 626 627To eliminate the effects of paging when profiling the performance of a 628program, we have found an effective procedure is to run the *exact same 629code* on a small dummy problem before running it on the actual problem 630of interest. We thus ensure that all code required by a solver is loaded 631into memory during solution of the small problem. When the code proceeds 632to the actual (larger) problem of interest, all required pages have 633already been loaded into main memory, so that the performance numbers 634are not distorted. 635 636When this procedure is used in conjunction with the user-defined stages 637of profiling described in {any}`sec_profstages`, we can 638focus easily on the problem of interest. For example, we used this 639technique in the program 640<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/ksp/ksp/tutorials/ex10.c.html">KSP Tutorial ex10</a> 641to generate the timings within 642{any}`listing <listing_exparprof>` and {any}`listing <listing_exparprof2>`. 643In this case, the profiled code 644of interest (solving the linear system for the larger problem) occurs 645within event stages 5 and 6. {any}`sec_parperformance` 646provides details about interpreting such profiling data. 647 648In particular, the macros 649 650``` 651PetscPreLoadBegin(PetscBool flag,char* stagename) 652PetscPreLoadStage(char *stagename) 653``` 654 655and 656 657``` 658PetscPreLoadEnd() 659``` 660 661can be used to easily convert a regular PETSc program to one that uses 662preloading. The command line options `-preload` `true` and 663`-preload` `false` may be used to turn on and off preloading at run 664time for PETSc programs that use these macros. 665 666## NVIDIA Nsight Systems profiling 667 668Nsight Systems will generate profiling data with a CUDA executable 669with the command `nsys`. 670For example, in serial 671 672```bash 673nsys profile -t nvtx,cuda -o file --stats=true --force-overwrite true ./a.out 674``` 675 676will generate a file `file.qdstrm` with performance data that is 677annotated with PETSc events (methods) and Kokkos device kernel names. 678The Nsight Systems GUI, `nsys-ui`, can be used to navigate this file 679(<https://developer.nvidia.com/nsight-systems>). The Nsight Systems GUI 680lets you see a timeline of code performance information like kernels, 681memory mallocs and frees, CPU-GPU communication, and high-level data like time, sizes 682of memory copies, and more, in a popup window when the mouse 683hovers over the section. 684To view the data, start `nsys-ui` without any arguments and then `Import` the 685`.qdstrm` file in the GUI. 686A side effect of this viewing process is the generation of a file `file.nsys-rep`, which can be viewed directly 687with `nsys-ui` in the future. 688 689For an MPI parallel job, only one process can call `nsys`, 690say have rank zero output `nsys` data and have all other 691ranks call the executable directly. For example with MPICH 692or Open MPI - we can run a parallel job on 4 MPI tasks as: 693 694```console 695mpiexec -n 1 nsys profile -t nvtx,cuda -o file_name --stats=true --force-overwrite true ./a.out : -n 3 ./a.out 696``` 697 698(sec_using_tau)= 699 700Note: The Nsight GUI can open profiling reports from elsewhere. For 701example, a report from a compute node can be analyzed on your local 702machine, but care should be taken to use the exact same versions of 703Nsight Systems that generated the report. 704To check the version of Nsight on the compute node run `nsys-ui` and 705note the version number at the top of the window. 706 707## ROCProfiler profiling 708 709For AMD GPUs, log events registered to PETSc can be displayed as ranges in trace files generated by `rocprof` by running with the `-log_roctx` flag. 710See the `rocprof` [documentation](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest) for details of how to run the profiler. 711At the least, you will need: 712 713```console 714mpiexec -n 1 rocprofv3 --marker-trace -o file_name -- ./path/to/application -log_roctx 715``` 716 717 718## Using TAU 719 720TAU profiles can be generated without the need for instrumentation through the 721use of the perfstubs package. PETSc by default is configured with `--with-tau-perfstubs`. 722To generate profiles with TAU, first setup TAU: 723 724```bash 725wget http://tau.uoregon.edu/tau.tgz 726./configure -cc=mpicc -c++=mpicxx -mpi -bfd=download -unwind=download && make install 727export PATH=<tau dir>/x86_64/bin:$PATH 728``` 729 730For more information on configuring TAU, see [http://tau.uoregon.edu](http://tau.uoregon.edu). 731Next, run your program with TAU. For instance, to profile `ex56`, 732 733```bash 734cd $PETSC_DIR/src/snes/tutorials 735make ex56 736mpirun -n 4 tau_exec -T mpi ./ex56 -log_perfstubs <args> 737``` 738 739This should produce four `profile.*` files with profile data that can be 740viewed with `paraprof/pprof`: 741 742```shell 743Reading Profile files in profile.* 744 745NODE 0;CONTEXT 0;THREAD 0: 746--------------------------------------------------------------------------------------- 747%Time Exclusive Inclusive #Call #Subrs Inclusive Name 748 msec total msec usec/call 749--------------------------------------------------------------------------------------- 750100.0 26 1,838 1 41322 1838424 .TAU application 751 73.2 1 1,345 2 168 672950 SNESSolve 752 62.2 3 1,142 2 1282 571442 SNESJacobianEval 753 62.0 1,136 1,138 2 76 569494 DMPlexJacobianFE 754 60.1 0.046 1,105 1 32 1105001 Solve 1 755 15.2 87 279 5 11102 55943 Mesh Setup 756 13.2 0.315 241 1 32 241765 Solve 0 757 7.8 80 144 38785 38785 4 MPI_Allreduce() 758 7.0 69 128 6 43386 21491 DualSpaceSetUp 759 6.2 1 114 4 54 28536 PCSetUp 760 6.0 12 110 2 892 55407 PCSetUp_GAMG+ 761 3.9 70 70 1 0 70888 MPI_Init_thread() 762 3.7 68 68 41747 0 2 MPI Collective Sync 763 3.6 8 66 4 3536 16548 SNESFunctionEval 764 2.6 45 48 171 171 281 MPI_Bcast() 765 1.9 34 34 7836 0 4 MPI_Barrier() 766 1.8 0.567 33 2 68 16912 GAMG Coarsen 767``` 768 769```{raw} html 770<hr> 771``` 772 773```{eval-rst} 774.. bibliography:: /petsc.bib 775 :filter: docname in docnames 776``` 777