1*7f296bb3SBarry Smith(ch_profiling)= 2*7f296bb3SBarry Smith 3*7f296bb3SBarry Smith# Profiling 4*7f296bb3SBarry Smith 5*7f296bb3SBarry SmithPETSc includes a consistent, lightweight scheme for profiling 6*7f296bb3SBarry Smithapplication programs. The PETSc routines automatically log 7*7f296bb3SBarry Smithperformance data if certain options are specified at runtime. The user 8*7f296bb3SBarry Smithcan also log information about application codes for a complete picture 9*7f296bb3SBarry Smithof performance. 10*7f296bb3SBarry Smith 11*7f296bb3SBarry SmithIn addition, as described in 12*7f296bb3SBarry Smith{any}`sec_ploginfo`, PETSc provides a mechanism for 13*7f296bb3SBarry Smithprinting informative messages about computations. 14*7f296bb3SBarry Smith{any}`sec_profbasic` introduces the various profiling 15*7f296bb3SBarry Smithoptions in PETSc, while the remainder of the chapter focuses on details 16*7f296bb3SBarry Smithsuch as monitoring application codes and tips for accurate profiling. 17*7f296bb3SBarry Smith 18*7f296bb3SBarry Smith(sec_profbasic)= 19*7f296bb3SBarry Smith 20*7f296bb3SBarry Smith## Basic Profiling Information 21*7f296bb3SBarry Smith 22*7f296bb3SBarry SmithThe profiling options include the following: 23*7f296bb3SBarry Smith 24*7f296bb3SBarry Smith- `-log_view [:filename]` - Prints an ASCII version of performance data at the 25*7f296bb3SBarry Smith program’s conclusion. These statistics are comprehensive and concise 26*7f296bb3SBarry Smith and require little overhead; thus, `-log_view` is intended as the 27*7f296bb3SBarry Smith primary means of monitoring the performance of PETSc codes. See {any}`sec_ploginfo` 28*7f296bb3SBarry Smith- `-info [infofile]` - Prints verbose information about code to 29*7f296bb3SBarry Smith stdout or an optional file. This option provides details about 30*7f296bb3SBarry Smith algorithms, data structures, etc. Since the overhead of printing such 31*7f296bb3SBarry Smith output slows a code, this option should not be used when evaluating a 32*7f296bb3SBarry Smith program’s performance. See {any}`sec_PetscInfo` 33*7f296bb3SBarry Smith- `-log_trace [logfile]` - Traces the beginning and ending of all 34*7f296bb3SBarry Smith PETSc events. This option, which can be used in conjunction with 35*7f296bb3SBarry Smith `-info`, is useful to see where a program is hanging without 36*7f296bb3SBarry Smith running in the debugger. See `PetscLogTraceBegin()`. 37*7f296bb3SBarry Smith 38*7f296bb3SBarry SmithAs discussed in {any}`sec_mpelogs`, additional profiling 39*7f296bb3SBarry Smithcan be done with MPE. 40*7f296bb3SBarry Smith 41*7f296bb3SBarry Smith(sec_ploginfo)= 42*7f296bb3SBarry Smith 43*7f296bb3SBarry Smith### Interpreting `-log_view` Output: The Basics 44*7f296bb3SBarry Smith 45*7f296bb3SBarry SmithAs shown in {any}`listing <listing_exprof>` in {any}`sec_profiling_programs`, the 46*7f296bb3SBarry Smithoption `-log_view` `[:filename]` activates printing of profile data to standard 47*7f296bb3SBarry Smithoutput or an ASCII file at the conclusion of a program. See `PetscLogView()` for all the possible 48*7f296bb3SBarry Smithoutput options. 49*7f296bb3SBarry Smith 50*7f296bb3SBarry SmithWe print performance data for each routine, organized by PETSc 51*7f296bb3SBarry Smithlibraries, followed by any user-defined events (discussed in 52*7f296bb3SBarry Smith{any}`sec_profileuser`). For each routine, the output data 53*7f296bb3SBarry Smithinclude the maximum time and floating point operation (flop) rate over 54*7f296bb3SBarry Smithall processes. Information about parallel performance is also included, 55*7f296bb3SBarry Smithas discussed in the following section. 56*7f296bb3SBarry Smith 57*7f296bb3SBarry SmithFor the purpose of PETSc floating point operation counting, we define 58*7f296bb3SBarry Smithone *flop* as one operation of any of the following types: 59*7f296bb3SBarry Smithmultiplication, division, addition, or subtraction. For example, one 60*7f296bb3SBarry Smith`VecAXPY()` operation, which computes $y = \alpha x + y$ for 61*7f296bb3SBarry Smithvectors of length $N$, requires $2N$ flop (consisting of 62*7f296bb3SBarry Smith$N$ additions and $N$ multiplications). Bear in mind that 63*7f296bb3SBarry Smithflop rates present only a limited view of performance, since memory 64*7f296bb3SBarry Smithloads and stores are the real performance barrier. 65*7f296bb3SBarry Smith 66*7f296bb3SBarry SmithFor simplicity, the remainder of this discussion focuses on interpreting 67*7f296bb3SBarry Smithprofile data for the `KSP` library, which provides the linear solvers 68*7f296bb3SBarry Smithat the heart of the PETSc package. Recall the hierarchical organization 69*7f296bb3SBarry Smithof the PETSc library, as shown in 70*7f296bb3SBarry Smith{any}`fig_library`. Each `KSP` solver is composed 71*7f296bb3SBarry Smithof a `PC` (preconditioner) and a `KSP` (Krylov subspace) part, which 72*7f296bb3SBarry Smithare in turn built on top of the `Mat` (matrix) and `Vec` (vector) 73*7f296bb3SBarry Smithmodules. Thus, operations in the `KSP` module are composed of 74*7f296bb3SBarry Smithlower-level operations in these packages. Note also that the nonlinear 75*7f296bb3SBarry Smithsolvers library, `SNES`, is built on top of the `KSP` module, and 76*7f296bb3SBarry Smiththe timestepping library, `TS`, is in turn built on top of `SNES`. 77*7f296bb3SBarry Smith 78*7f296bb3SBarry SmithWe briefly discuss interpretation of the sample output in 79*7f296bb3SBarry Smith{any}`listing <listing_exprof>`, which was generated by solving a 80*7f296bb3SBarry Smithlinear system on one process using restarted GMRES and ILU 81*7f296bb3SBarry Smithpreconditioning. The linear solvers in `KSP` consist of two basic 82*7f296bb3SBarry Smithphases, `KSPSetUp()` and `KSPSolve()`, each of which consists of a 83*7f296bb3SBarry Smithvariety of actions, depending on the particular solution technique. For 84*7f296bb3SBarry Smiththe case of using the `PCILU` preconditioner and `KSPGMRES` Krylov 85*7f296bb3SBarry Smithsubspace method, the breakdown of PETSc routines is listed below. As 86*7f296bb3SBarry Smithindicated by the levels of indentation, the operations in `KSPSetUp()` 87*7f296bb3SBarry Smithinclude all of the operations within `PCSetUp()`, which in turn 88*7f296bb3SBarry Smithinclude `MatILUFactor()`, and so on. 89*7f296bb3SBarry Smith 90*7f296bb3SBarry Smith- `KSPSetUp` - Set up linear solver 91*7f296bb3SBarry Smith 92*7f296bb3SBarry Smith - `PCSetUp` - Set up preconditioner 93*7f296bb3SBarry Smith 94*7f296bb3SBarry Smith - `MatILUFactor` - Factor preconditioning matrix 95*7f296bb3SBarry Smith 96*7f296bb3SBarry Smith - `MatILUFactorSymbolic` - Symbolic factorization phase 97*7f296bb3SBarry Smith - `MatLUFactorNumeric` - Numeric factorization phase 98*7f296bb3SBarry Smith 99*7f296bb3SBarry Smith- `KSPSolve` - Solve linear system 100*7f296bb3SBarry Smith 101*7f296bb3SBarry Smith - `PCApply` - Apply preconditioner 102*7f296bb3SBarry Smith 103*7f296bb3SBarry Smith - `MatSolve` - Forward/backward triangular solves 104*7f296bb3SBarry Smith 105*7f296bb3SBarry Smith - `KSPGMRESOrthog` - Orthogonalization in GMRES 106*7f296bb3SBarry Smith 107*7f296bb3SBarry Smith - `VecDot` or `VecMDot` - Inner products 108*7f296bb3SBarry Smith - `VecAXPY` or `VecMAXPY` - vector updates 109*7f296bb3SBarry Smith 110*7f296bb3SBarry Smith - `MatMult` - Matrix-vector product 111*7f296bb3SBarry Smith 112*7f296bb3SBarry Smith - `MatMultAdd` - Matrix-vector product + vector addition 113*7f296bb3SBarry Smith 114*7f296bb3SBarry Smith - `VecScale`, `VecNorm`, `VecAXPY`, `VecCopy`, ... 115*7f296bb3SBarry Smith 116*7f296bb3SBarry SmithThe summaries printed via `-log_view` reflect this routine hierarchy. 117*7f296bb3SBarry SmithFor example, the performance summaries for a particular high-level 118*7f296bb3SBarry Smithroutine such as `KSPSolve()` include all of the operations accumulated 119*7f296bb3SBarry Smithin the lower-level components that make up the routine. 120*7f296bb3SBarry Smith 121*7f296bb3SBarry SmithThe output produced with `-log_view` is flat, meaning that the hierarchy 122*7f296bb3SBarry Smithof PETSc operations is not completely clear. For a 123*7f296bb3SBarry Smithparticular problem, the user should generally have an idea of the basic 124*7f296bb3SBarry Smithoperations that are required for its implementation (e.g., which 125*7f296bb3SBarry Smithoperations are performed when using GMRES and ILU, as described above), 126*7f296bb3SBarry Smithso that interpreting the `-log_view` data should be relatively 127*7f296bb3SBarry Smithstraightforward. 128*7f296bb3SBarry SmithIf this is problematic then it is also possible to examine 129*7f296bb3SBarry Smiththe profiling information in a nested format. For more information see 130*7f296bb3SBarry Smith{any}`sec_nestedevents`. 131*7f296bb3SBarry Smith 132*7f296bb3SBarry Smith(sec_parperformance)= 133*7f296bb3SBarry Smith 134*7f296bb3SBarry Smith### Interpreting `-log_view` Output: Parallel Performance 135*7f296bb3SBarry Smith 136*7f296bb3SBarry SmithWe next discuss performance summaries for parallel programs, as shown 137*7f296bb3SBarry Smithwithin {any}`listing <listing_exparprof>` and {any}`listing <listing_exparprof2>`, 138*7f296bb3SBarry Smithwhich presents the 139*7f296bb3SBarry Smithoutput generated by the `-log_view` option. The program that generated 140*7f296bb3SBarry Smiththis data is 141*7f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/ksp/ksp/tutorials/ex10.c.html">KSP Tutorial ex10</a>. 142*7f296bb3SBarry SmithThe code loads a matrix and right-hand-side vector from a binary file 143*7f296bb3SBarry Smithand then solves the resulting linear system; the program then repeats 144*7f296bb3SBarry Smiththis process for a second linear system. This particular case was run on 145*7f296bb3SBarry Smithfour processors of an Intel x86_64 Linux cluster, using restarted GMRES 146*7f296bb3SBarry Smithand the block Jacobi preconditioner, where each block was solved with 147*7f296bb3SBarry SmithILU. The two input files `medium` and `arco6` can be obtained from 148*7f296bb3SBarry Smith[datafiles](https://gitlab.com/petsc/datafiles), see {any}`petsc_repositories`. 149*7f296bb3SBarry Smith 150*7f296bb3SBarry SmithThe first {any}`listing <listing_exparprof>` presents an overall 151*7f296bb3SBarry Smithperformance summary, including times, floating-point operations, 152*7f296bb3SBarry Smithcomputational rates, and message-passing activity (such as the number 153*7f296bb3SBarry Smithand size of messages sent and collective operations). Summaries for 154*7f296bb3SBarry Smithvarious user-defined stages of monitoring (as discussed in 155*7f296bb3SBarry Smith{any}`sec_profstages`) are also given. Information about the 156*7f296bb3SBarry Smithvarious phases of computation then follow (as shown separately here in 157*7f296bb3SBarry Smiththe second {any}`listing <listing_exparprof2>`). Finally, a summary of 158*7f296bb3SBarry Smithobject creation and destruction is presented. 159*7f296bb3SBarry Smith 160*7f296bb3SBarry Smith(listing_exparprof)= 161*7f296bb3SBarry Smith 162*7f296bb3SBarry Smith```none 163*7f296bb3SBarry Smithmpiexec -n 4 ./ex10 -f0 medium -f1 arco6 -ksp_gmres_classicalgramschmidt -log_view -mat_type baij \ 164*7f296bb3SBarry Smith -matload_block_size 3 -pc_type bjacobi -options_left 165*7f296bb3SBarry Smith 166*7f296bb3SBarry SmithNumber of iterations = 19 167*7f296bb3SBarry SmithResidual norm 1.088292e-05 168*7f296bb3SBarry SmithNumber of iterations = 59 169*7f296bb3SBarry SmithResidual norm 3.871022e-02 170*7f296bb3SBarry Smith---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- 171*7f296bb3SBarry Smith 172*7f296bb3SBarry Smith./ex10 on a intel-bdw-opt named beboplogin4 with 4 processors, by jczhang Mon Apr 23 13:36:54 2018 173*7f296bb3SBarry SmithUsing PETSc Development Git Revision: v3.9-163-gbe3efd42 Git Date: 2018-04-16 10:45:40 -0500 174*7f296bb3SBarry Smith 175*7f296bb3SBarry Smith Max Max/Min Avg Total 176*7f296bb3SBarry SmithTime (sec): 1.849e-01 1.00002 1.849e-01 177*7f296bb3SBarry SmithObjects: 1.060e+02 1.00000 1.060e+02 178*7f296bb3SBarry SmithFlops: 2.361e+08 1.00684 2.353e+08 9.413e+08 179*7f296bb3SBarry SmithFlops/sec: 1.277e+09 1.00685 1.273e+09 5.091e+09 180*7f296bb3SBarry SmithMPI Msg Count: 2.360e+02 1.34857 2.061e+02 8.245e+02 181*7f296bb3SBarry SmithMPI Msg Len (bytes): 1.256e+07 2.24620 4.071e+04 3.357e+07 182*7f296bb3SBarry SmithMPI Reductions: 2.160e+02 1.00000 183*7f296bb3SBarry Smith 184*7f296bb3SBarry SmithSummary of Stages: ----- Time ------ ----- Flop ----- --- Messages --- -- Message Lengths -- -- Reductions -- 185*7f296bb3SBarry Smith Avg %Total Avg %Total counts %Total Avg %Total counts %Total 186*7f296bb3SBarry Smith 0: Main Stage: 5.9897e-04 0.3% 0.0000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 2.000e+00 0.9% 187*7f296bb3SBarry Smith 1: Load System 0: 2.9113e-03 1.6% 0.0000e+00 0.0% 3.550e+01 4.3% 5.984e+02 0.1% 2.200e+01 10.2% 188*7f296bb3SBarry Smith 2: KSPSetUp 0: 7.7349e-04 0.4% 9.9360e+03 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 2.000e+00 0.9% 189*7f296bb3SBarry Smith 3: KSPSolve 0: 1.7690e-03 1.0% 2.9673e+05 0.0% 1.520e+02 18.4% 1.800e+02 0.1% 3.900e+01 18.1% 190*7f296bb3SBarry Smith 4: Load System 1: 1.0056e-01 54.4% 0.0000e+00 0.0% 3.700e+01 4.5% 5.657e+05 62.4% 2.200e+01 10.2% 191*7f296bb3SBarry Smith 5: KSPSetUp 1: 5.6883e-03 3.1% 2.1205e+07 2.3% 0.000e+00 0.0% 0.000e+00 0.0% 2.000e+00 0.9% 192*7f296bb3SBarry Smith 6: KSPSolve 1: 7.2578e-02 39.3% 9.1979e+08 97.7% 6.000e+02 72.8% 2.098e+04 37.5% 1.200e+02 55.6% 193*7f296bb3SBarry Smith 194*7f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------ 195*7f296bb3SBarry Smith 196*7f296bb3SBarry Smith.... [Summary of various phases, see part II below] ... 197*7f296bb3SBarry Smith 198*7f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------ 199*7f296bb3SBarry Smith 200*7f296bb3SBarry SmithObject Type Creations Destructions (Reports information only for process 0.) 201*7f296bb3SBarry Smith... 202*7f296bb3SBarry Smith--- Event Stage 3: KSPSolve 0 203*7f296bb3SBarry Smith 204*7f296bb3SBarry Smith Matrix 0 4 205*7f296bb3SBarry Smith Vector 20 30 206*7f296bb3SBarry Smith Index Set 0 3 207*7f296bb3SBarry Smith Vec Scatter 0 1 208*7f296bb3SBarry Smith Krylov Solver 0 2 209*7f296bb3SBarry Smith Preconditioner 0 2 210*7f296bb3SBarry Smith``` 211*7f296bb3SBarry Smith 212*7f296bb3SBarry SmithWe next focus on the summaries for the various phases of the 213*7f296bb3SBarry Smithcomputation, as given in the table within 214*7f296bb3SBarry Smiththe following {any}`listing <listing_exparprof2>`. The summary for each 215*7f296bb3SBarry Smithphase presents the maximum times and flop rates over all processes, as 216*7f296bb3SBarry Smithwell as the ratio of maximum to minimum times and flop rates for all 217*7f296bb3SBarry Smithprocesses. A ratio of approximately 1 indicates that computations within 218*7f296bb3SBarry Smitha given phase are well balanced among the processes; as the ratio 219*7f296bb3SBarry Smithincreases, the balance becomes increasingly poor. Also, the total 220*7f296bb3SBarry Smithcomputational rate (in units of MFlop/sec) is given for each phase in 221*7f296bb3SBarry Smiththe final column of the phase summary table. 222*7f296bb3SBarry Smith 223*7f296bb3SBarry Smith$$ 224*7f296bb3SBarry Smith{\rm Total\: Mflop/sec} \:=\: 10^{-6} * ({\rm sum\; of\; flop\; over\; all\; processors})/({\rm max\; time\; over\; all\; processors}) 225*7f296bb3SBarry Smith$$ 226*7f296bb3SBarry Smith 227*7f296bb3SBarry Smith*Note*: Total computational rates $<$ 1 MFlop are listed as 0 in 228*7f296bb3SBarry Smiththis column of the phase summary table. Additional statistics for each 229*7f296bb3SBarry Smithphase include the total number of messages sent, the average message 230*7f296bb3SBarry Smithlength, and the number of global reductions. 231*7f296bb3SBarry Smith 232*7f296bb3SBarry Smith(listing_exparprof2)= 233*7f296bb3SBarry Smith 234*7f296bb3SBarry Smith```none 235*7f296bb3SBarry Smithmpiexec -n 4 ./ex10 -f0 medium -f1 arco6 -ksp_gmres_classicalgramschmidt -log_view -mat_type baij \ 236*7f296bb3SBarry Smith -matload_block_size 3 -pc_type bjacobi -options_left 237*7f296bb3SBarry Smith 238*7f296bb3SBarry Smith---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- 239*7f296bb3SBarry Smith.... [Overall summary, see part I] ... 240*7f296bb3SBarry Smith 241*7f296bb3SBarry SmithPhase summary info: 242*7f296bb3SBarry Smith Count: number of times phase was executed 243*7f296bb3SBarry Smith Time and Flop/sec: Max - maximum over all processors 244*7f296bb3SBarry Smith Ratio - ratio of maximum to minimum over all processors 245*7f296bb3SBarry Smith Mess: number of messages sent 246*7f296bb3SBarry Smith AvgLen: average message length 247*7f296bb3SBarry Smith Reduct: number of global reductions 248*7f296bb3SBarry Smith Global: entire computation 249*7f296bb3SBarry Smith Stage: optional user-defined stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). 250*7f296bb3SBarry Smith %T - percent time in this phase %F - percent flop in this phase 251*7f296bb3SBarry Smith %M - percent messages in this phase %L - percent message lengths in this phase 252*7f296bb3SBarry Smith %R - percent reductions in this phase 253*7f296bb3SBarry Smith Total Mflop/s: 10^6 * (sum of flop over all processors)/(max time over all processors) 254*7f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------ 255*7f296bb3SBarry SmithPhase Count Time (sec) Flop/sec --- Global --- --- Stage ---- Total 256*7f296bb3SBarry Smith Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s 257*7f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------ 258*7f296bb3SBarry Smith... 259*7f296bb3SBarry Smith 260*7f296bb3SBarry Smith--- Event Stage 5: KSPSetUp 1 261*7f296bb3SBarry Smith 262*7f296bb3SBarry SmithMatLUFactorNum 1 1.0 3.6440e-03 1.1 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00 2 2 0 0 0 62100 0 0 0 5819 263*7f296bb3SBarry SmithMatILUFactorSym 1 1.0 1.7111e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 26 0 0 0 0 0 264*7f296bb3SBarry SmithMatGetRowIJ 1 1.0 1.1921e-06 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 265*7f296bb3SBarry SmithMatGetOrdering 1 1.0 3.0041e-05 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0 266*7f296bb3SBarry SmithKSPSetUp 2 1.0 6.6495e-04 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 1 9 0 0 0100 0 267*7f296bb3SBarry SmithPCSetUp 2 1.0 5.4271e-03 1.2 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00 3 2 0 0 0 90100 0 0 0 3907 268*7f296bb3SBarry SmithPCSetUpOnBlocks 1 1.0 5.3999e-03 1.2 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00 3 2 0 0 0 90100 0 0 0 3927 269*7f296bb3SBarry Smith 270*7f296bb3SBarry Smith--- Event Stage 6: KSPSolve 1 271*7f296bb3SBarry Smith 272*7f296bb3SBarry SmithMatMult 60 1.0 2.4068e-02 1.1 6.54e+07 1.0 6.0e+02 2.1e+04 0.0e+00 12 27 73 37 0 32 28100100 0 10731 273*7f296bb3SBarry SmithMatSolve 61 1.0 1.9177e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 10 25 0 0 0 26 26 0 0 0 12491 274*7f296bb3SBarry SmithVecMDot 59 1.0 1.4741e-02 1.3 4.86e+07 1.0 0.0e+00 0.0e+00 5.9e+01 7 21 0 0 27 18 21 0 0 49 13189 275*7f296bb3SBarry SmithVecNorm 61 1.0 3.0417e-03 1.4 3.29e+06 1.0 0.0e+00 0.0e+00 6.1e+01 1 1 0 0 28 4 1 0 0 51 4332 276*7f296bb3SBarry SmithVecScale 61 1.0 9.9802e-04 1.0 1.65e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 1 0 0 0 1 1 0 0 0 6602 277*7f296bb3SBarry SmithVecCopy 2 1.0 5.9128e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 278*7f296bb3SBarry SmithVecSet 64 1.0 8.0323e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0 279*7f296bb3SBarry SmithVecAXPY 3 1.0 7.4387e-05 1.1 1.62e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 8712 280*7f296bb3SBarry SmithVecMAXPY 61 1.0 8.8558e-03 1.1 5.18e+07 1.0 0.0e+00 0.0e+00 0.0e+00 5 22 0 0 0 12 23 0 0 0 23393 281*7f296bb3SBarry SmithVecScatterBegin 60 1.0 9.6416e-04 1.8 0.00e+00 0.0 6.0e+02 2.1e+04 0.0e+00 0 0 73 37 0 1 0100100 0 0 282*7f296bb3SBarry SmithVecScatterEnd 60 1.0 6.1543e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 8 0 0 0 0 0 283*7f296bb3SBarry SmithVecNormalize 61 1.0 4.2675e-03 1.3 4.94e+06 1.0 0.0e+00 0.0e+00 6.1e+01 2 2 0 0 28 5 2 0 0 51 4632 284*7f296bb3SBarry SmithKSPGMRESOrthog 59 1.0 2.2627e-02 1.1 9.72e+07 1.0 0.0e+00 0.0e+00 5.9e+01 11 41 0 0 27 29 42 0 0 49 17185 285*7f296bb3SBarry SmithKSPSolve 1 1.0 7.2577e-02 1.0 2.31e+08 1.0 6.0e+02 2.1e+04 1.2e+02 39 98 73 37 56 99100100100100 12673 286*7f296bb3SBarry SmithPCSetUpOnBlocks 1 1.0 9.5367e-07 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 287*7f296bb3SBarry SmithPCApply 61 1.0 2.0427e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 11 25 0 0 0 28 26 0 0 0 11726 288*7f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------ 289*7f296bb3SBarry Smith.... [Conclusion of overall summary, see part I] ... 290*7f296bb3SBarry Smith``` 291*7f296bb3SBarry Smith 292*7f296bb3SBarry SmithAs discussed in the preceding section, the performance summaries for 293*7f296bb3SBarry Smithhigher-level PETSc routines include the statistics for the lower levels 294*7f296bb3SBarry Smithof which they are made up. For example, the communication within 295*7f296bb3SBarry Smithmatrix-vector products `MatMult()` consists of vector scatter 296*7f296bb3SBarry Smithoperations, as given by the routines `VecScatterBegin()` and 297*7f296bb3SBarry Smith`VecScatterEnd()`. 298*7f296bb3SBarry Smith 299*7f296bb3SBarry SmithThe final data presented are the percentages of the various statistics 300*7f296bb3SBarry Smith(time (`%T`), flop/sec (`%F`), messages(`%M`), average message 301*7f296bb3SBarry Smithlength (`%L`), and reductions (`%R`)) for each event relative to the 302*7f296bb3SBarry Smithtotal computation and to any user-defined stages (discussed in 303*7f296bb3SBarry Smith{any}`sec_profstages`). These statistics can aid in 304*7f296bb3SBarry Smithoptimizing performance, since they indicate the sections of code that 305*7f296bb3SBarry Smithcould benefit from various kinds of tuning. 306*7f296bb3SBarry Smith{any}`ch_performance` gives suggestions about achieving good 307*7f296bb3SBarry Smithperformance with PETSc codes. 308*7f296bb3SBarry Smith 309*7f296bb3SBarry SmithThe additional option `-log_view_memory` causes the display of additional columns of information about how much 310*7f296bb3SBarry Smithmemory was allocated and freed during each logged event. This is useful 311*7f296bb3SBarry Smithto understand what phases of a computation require the most memory. 312*7f296bb3SBarry Smith 313*7f296bb3SBarry Smith(sec_mpelogs)= 314*7f296bb3SBarry Smith 315*7f296bb3SBarry Smith### Using `-log_mpe` with Jumpshot 316*7f296bb3SBarry Smith 317*7f296bb3SBarry SmithIt is also possible to use the *Jumpshot* package 318*7f296bb3SBarry Smith{cite}`upshot` to visualize PETSc events. This package comes 319*7f296bb3SBarry Smithwith the MPE software, which is part of the MPICH 320*7f296bb3SBarry Smith{cite}`mpich-web-page` implementation of MPI. The option 321*7f296bb3SBarry Smith 322*7f296bb3SBarry Smith```none 323*7f296bb3SBarry Smith-log_mpe [logfile] 324*7f296bb3SBarry Smith``` 325*7f296bb3SBarry Smith 326*7f296bb3SBarry Smithcreates a logfile of events appropriate for viewing with *Jumpshot*. The 327*7f296bb3SBarry Smithuser can either use the default logging file or specify a name via 328*7f296bb3SBarry Smith`logfile`. Events can be deactivated as described in 329*7f296bb3SBarry Smith{any}`sec_deactivate`. 330*7f296bb3SBarry Smith 331*7f296bb3SBarry SmithThe user can also log MPI events. To do this, simply consider the PETSc 332*7f296bb3SBarry Smithapplication as any MPI application, and follow the MPI implementation’s 333*7f296bb3SBarry Smithinstructions for logging MPI calls. For example, when using MPICH, this 334*7f296bb3SBarry Smithmerely required adding `-llmpich` to the library list *before* 335*7f296bb3SBarry Smith`-lmpich`. 336*7f296bb3SBarry Smith 337*7f296bb3SBarry Smith(sec_nestedevents)= 338*7f296bb3SBarry Smith 339*7f296bb3SBarry Smith### Profiling Nested Events 340*7f296bb3SBarry Smith 341*7f296bb3SBarry SmithIt is possible to output the PETSc logging information in a nested format 342*7f296bb3SBarry Smithwhere the hierarchy of events is explicit. This output can be generated 343*7f296bb3SBarry Smitheither as an XML file or as a text file in a format suitable for viewing as 344*7f296bb3SBarry Smitha flame graph. 345*7f296bb3SBarry Smith 346*7f296bb3SBarry SmithOne can generate the XML output by passing the option `-log_view :[logfilename]:ascii_xml`. 347*7f296bb3SBarry SmithIt can be viewed by copying `${PETSC_DIR}/share/petsc/xml/performance_xml2html.xsl` 348*7f296bb3SBarry Smithinto the current directory, then opening the logfile in your browser. 349*7f296bb3SBarry Smith 350*7f296bb3SBarry SmithThe flame graph output can be generated with the option `-log_view :[logfile]:ascii_flamegraph`. 351*7f296bb3SBarry SmithIt can then be visualised with either [FlameGraph](https://github.com/brendangregg/FlameGraph) 352*7f296bb3SBarry Smithor [speedscope](https://www.speedscope.app). A flamegraph can be visualized directly from 353*7f296bb3SBarry Smithstdout using, for example, 354*7f296bb3SBarry Smith`ImageMagick's display utility <https://imagemagick.org/script/display.php>`: 355*7f296bb3SBarry Smith 356*7f296bb3SBarry Smith``` 357*7f296bb3SBarry Smithcd $PETSC_DIR/src/sys/tests 358*7f296bb3SBarry Smithmake ex30 359*7f296bb3SBarry Smithmpiexec -n 2 ./ex30 -log_view ::ascii_flamegraph | flamegraph | display 360*7f296bb3SBarry Smith``` 361*7f296bb3SBarry Smith 362*7f296bb3SBarry SmithNote that user-defined stages (see {any}`sec_profstages`) will be ignored when 363*7f296bb3SBarry Smithusing this nested format. 364*7f296bb3SBarry Smith 365*7f296bb3SBarry Smith(sec_profileuser)= 366*7f296bb3SBarry Smith 367*7f296bb3SBarry Smith## Profiling Application Codes 368*7f296bb3SBarry Smith 369*7f296bb3SBarry SmithPETSc automatically logs object creation, times, and floating-point 370*7f296bb3SBarry Smithcounts for the library routines. Users can easily supplement this 371*7f296bb3SBarry Smithinformation by profiling their application codes as well. The basic 372*7f296bb3SBarry Smithsteps involved in logging a user-defined portion of code, called an 373*7f296bb3SBarry Smith*event*, are shown in the code fragment below: 374*7f296bb3SBarry Smith 375*7f296bb3SBarry Smith``` 376*7f296bb3SBarry SmithPetscLogEvent USER_EVENT; 377*7f296bb3SBarry SmithPetscClassId classid; 378*7f296bb3SBarry SmithPetscLogDouble user_event_flops; 379*7f296bb3SBarry Smith 380*7f296bb3SBarry SmithPetscClassIdRegister("class name",&classid); 381*7f296bb3SBarry SmithPetscLogEventRegister("User event name",classid,&USER_EVENT); 382*7f296bb3SBarry SmithPetscLogEventBegin(USER_EVENT,0,0,0,0); 383*7f296bb3SBarry Smith/* code segment to monitor */ 384*7f296bb3SBarry SmithPetscLogFlops(user_event_flops); 385*7f296bb3SBarry SmithPetscLogEventEnd(USER_EVENT,0,0,0,0); 386*7f296bb3SBarry Smith``` 387*7f296bb3SBarry Smith 388*7f296bb3SBarry SmithOne must register the event by calling `PetscLogEventRegister()`, 389*7f296bb3SBarry Smithwhich assigns a unique integer to identify the event for profiling 390*7f296bb3SBarry Smithpurposes: 391*7f296bb3SBarry Smith 392*7f296bb3SBarry Smith``` 393*7f296bb3SBarry SmithPetscLogEventRegister(const char string[],PetscClassId classid,PetscLogEvent *e); 394*7f296bb3SBarry Smith``` 395*7f296bb3SBarry Smith 396*7f296bb3SBarry SmithHere `string` is a user-defined event name, and `color` is an 397*7f296bb3SBarry Smithoptional user-defined event color (for use with *Jumpshot* logging; see 398*7f296bb3SBarry Smith{any}`sec_mpelogs`); one should see the manual page for 399*7f296bb3SBarry Smithdetails. The argument returned in `e` should then be passed to the 400*7f296bb3SBarry Smith`PetscLogEventBegin()` and `PetscLogEventEnd()` routines. 401*7f296bb3SBarry Smith 402*7f296bb3SBarry SmithEvents are logged by using the pair 403*7f296bb3SBarry Smith 404*7f296bb3SBarry Smith``` 405*7f296bb3SBarry SmithPetscLogEventBegin(int event,PetscObject o1,PetscObject o2,PetscObject o3,PetscObject o4); 406*7f296bb3SBarry SmithPetscLogEventEnd(int event,PetscObject o1,PetscObject o2,PetscObject o3,PetscObject o4); 407*7f296bb3SBarry Smith``` 408*7f296bb3SBarry Smith 409*7f296bb3SBarry SmithThe four objects are the PETSc objects that are most closely associated 410*7f296bb3SBarry Smithwith the event. For instance, in a matrix-vector product they would be 411*7f296bb3SBarry Smiththe matrix and the two vectors. These objects can be omitted by 412*7f296bb3SBarry Smithspecifying 0 for `o1` - `o4`. The code between these two routine 413*7f296bb3SBarry Smithcalls will be automatically timed and logged as part of the specified 414*7f296bb3SBarry Smithevent. 415*7f296bb3SBarry Smith 416*7f296bb3SBarry SmithEvents are collective by default on the communicator of `o1` (if present). 417*7f296bb3SBarry SmithThey can be made not collective by using `PetscLogEventSetCollective()`. 418*7f296bb3SBarry SmithNo synchronization is performed on collective events in optimized builds unless 419*7f296bb3SBarry Smiththe command line option `-log_sync` is used; however, we do check for collective 420*7f296bb3SBarry Smithsemantics in debug mode. 421*7f296bb3SBarry Smith 422*7f296bb3SBarry SmithThe user can log the number of floating-point operations for this 423*7f296bb3SBarry Smithsegment of code by calling 424*7f296bb3SBarry Smith 425*7f296bb3SBarry Smith``` 426*7f296bb3SBarry SmithPetscLogFlops(number of flop for this code segment); 427*7f296bb3SBarry Smith``` 428*7f296bb3SBarry Smith 429*7f296bb3SBarry Smithbetween the calls to `PetscLogEventBegin()` and 430*7f296bb3SBarry Smith`PetscLogEventEnd()`. This value will automatically be added to the 431*7f296bb3SBarry Smithglobal flop counter for the entire program. 432*7f296bb3SBarry Smith 433*7f296bb3SBarry Smith(sec_profstages)= 434*7f296bb3SBarry Smith 435*7f296bb3SBarry Smith## Profiling Multiple Sections of Code 436*7f296bb3SBarry Smith 437*7f296bb3SBarry SmithBy default, the profiling produces a single set of statistics for all 438*7f296bb3SBarry Smithcode between the `PetscInitialize()` and `PetscFinalize()` calls 439*7f296bb3SBarry Smithwithin a program. One can independently monitor several "stages" of code 440*7f296bb3SBarry Smithby switching among the various stages with the commands 441*7f296bb3SBarry Smith 442*7f296bb3SBarry Smith``` 443*7f296bb3SBarry SmithPetscLogStagePush(PetscLogStage stage); 444*7f296bb3SBarry SmithPetscLogStagePop(); 445*7f296bb3SBarry Smith``` 446*7f296bb3SBarry Smith 447*7f296bb3SBarry Smithsee the manual pages for details. 448*7f296bb3SBarry SmithThe command 449*7f296bb3SBarry Smith 450*7f296bb3SBarry Smith``` 451*7f296bb3SBarry SmithPetscLogStageRegister(const char *name,PetscLogStage *stage) 452*7f296bb3SBarry Smith``` 453*7f296bb3SBarry Smith 454*7f296bb3SBarry Smithallows one to associate a name with a stage; these names are printed 455*7f296bb3SBarry Smithwhenever summaries are generated with `-log_view`. The following code fragment uses three profiling 456*7f296bb3SBarry Smithstages within an program. 457*7f296bb3SBarry Smith 458*7f296bb3SBarry Smith``` 459*7f296bb3SBarry SmithPetscInitialize(int *argc,char ***args,0,0); 460*7f296bb3SBarry Smith/* stage 0 of code here */ 461*7f296bb3SBarry SmithPetscLogStageRegister("Stage 0 of Code", &stagenum0); 462*7f296bb3SBarry Smithfor (i=0; i<ntimes; i++) { 463*7f296bb3SBarry Smith PetscLogStageRegister("Stage 1 of Code", &stagenum1); 464*7f296bb3SBarry Smith PetscLogStagePush(stagenum1); 465*7f296bb3SBarry Smith /* stage 1 of code here */ 466*7f296bb3SBarry Smith PetscLogStagePop(); 467*7f296bb3SBarry Smith PetscLogStageRegister("Stage 2 of Code", &stagenum2); 468*7f296bb3SBarry Smith PetscLogStagePush(stagenum2); 469*7f296bb3SBarry Smith /* stage 2 of code here */ 470*7f296bb3SBarry Smith PetscLogStagePop(); 471*7f296bb3SBarry Smith} 472*7f296bb3SBarry SmithPetscFinalize(); 473*7f296bb3SBarry Smith``` 474*7f296bb3SBarry Smith 475*7f296bb3SBarry SmithThe listings above 476*7f296bb3SBarry Smithshow output generated by 477*7f296bb3SBarry Smith`-log_view` for a program that employs several profiling stages. In 478*7f296bb3SBarry Smithparticular, this program is subdivided into six stages: loading a matrix and right-hand-side vector from a binary file, 479*7f296bb3SBarry Smithsetting up the preconditioner, and solving the linear system; this 480*7f296bb3SBarry Smithsequence is then repeated for a second linear system. For simplicity, 481*7f296bb3SBarry Smiththe second listing contains output only for 482*7f296bb3SBarry Smithstages 5 and 6 (linear solve of the second system), which comprise the 483*7f296bb3SBarry Smithpart of this computation of most interest to us in terms of performance 484*7f296bb3SBarry Smithmonitoring. This code organization (solving a small linear system 485*7f296bb3SBarry Smithfollowed by a larger system) enables generation of more accurate 486*7f296bb3SBarry Smithprofiling statistics for the second system by overcoming the often 487*7f296bb3SBarry Smithconsiderable overhead of paging, as discussed in 488*7f296bb3SBarry Smith{any}`sec_profaccuracy`. 489*7f296bb3SBarry Smith 490*7f296bb3SBarry Smith(sec_deactivate)= 491*7f296bb3SBarry Smith 492*7f296bb3SBarry Smith## Restricting Event Logging 493*7f296bb3SBarry Smith 494*7f296bb3SBarry SmithBy default, all PETSc operations are logged. To enable or disable the 495*7f296bb3SBarry SmithPETSc logging of individual events, one uses the commands 496*7f296bb3SBarry Smith 497*7f296bb3SBarry Smith``` 498*7f296bb3SBarry SmithPetscLogEventActivate(PetscLogEvent event); 499*7f296bb3SBarry SmithPetscLogEventDeactivate(PetscLogEvent event); 500*7f296bb3SBarry Smith``` 501*7f296bb3SBarry Smith 502*7f296bb3SBarry SmithThe `event` may be either a predefined PETSc event (as listed in the 503*7f296bb3SBarry Smithfile `$PETSC_DIR/include/petsclog.h`) or one obtained with 504*7f296bb3SBarry Smith`PetscLogEventRegister()` (as described in 505*7f296bb3SBarry Smith{any}`sec_profileuser`). 506*7f296bb3SBarry Smith 507*7f296bb3SBarry SmithPETSc also provides routines that deactivate (or activate) logging for 508*7f296bb3SBarry Smithentire components of the library. Currently, the components that support 509*7f296bb3SBarry Smithsuch logging (de)activation are `Mat` (matrices), `Vec` (vectors), 510*7f296bb3SBarry Smith`KSP` (linear solvers, including `KSP` and `PC`), and `SNES` 511*7f296bb3SBarry Smith(nonlinear solvers): 512*7f296bb3SBarry Smith 513*7f296bb3SBarry Smith``` 514*7f296bb3SBarry SmithPetscLogEventDeactivateClass(MAT_CLASSID); 515*7f296bb3SBarry SmithPetscLogEventDeactivateClass(KSP_CLASSID); /* includes PC and KSP */ 516*7f296bb3SBarry SmithPetscLogEventDeactivateClass(VEC_CLASSID); 517*7f296bb3SBarry SmithPetscLogEventDeactivateClass(SNES_CLASSID); 518*7f296bb3SBarry Smith``` 519*7f296bb3SBarry Smith 520*7f296bb3SBarry Smithand 521*7f296bb3SBarry Smith 522*7f296bb3SBarry Smith``` 523*7f296bb3SBarry SmithPetscLogEventActivateClass(MAT_CLASSID); 524*7f296bb3SBarry SmithPetscLogEventActivateClass(KSP_CLASSID); /* includes PC and KSP */ 525*7f296bb3SBarry SmithPetscLogEventActivateClass(VEC_CLASSID); 526*7f296bb3SBarry SmithPetscLogEventActivateClass(SNES_CLASSID); 527*7f296bb3SBarry Smith``` 528*7f296bb3SBarry Smith 529*7f296bb3SBarry Smith(sec_petscinfo)= 530*7f296bb3SBarry Smith 531*7f296bb3SBarry Smith## Interpreting `-info` Output: Informative Messages 532*7f296bb3SBarry Smith 533*7f296bb3SBarry SmithUsers can activate the printing of verbose information about algorithms, 534*7f296bb3SBarry Smithdata structures, etc. to the screen by using the option `-info` or by 535*7f296bb3SBarry Smithcalling `PetscInfoAllow(PETSC_TRUE)`. Such logging, which is used 536*7f296bb3SBarry Smiththroughout the PETSc libraries, can aid the user in understanding 537*7f296bb3SBarry Smithalgorithms and tuning program performance. For example, as discussed in 538*7f296bb3SBarry Smith{any}`sec_matsparse`, `-info` activates the printing of 539*7f296bb3SBarry Smithinformation about memory allocation during matrix assembly. 540*7f296bb3SBarry Smith 541*7f296bb3SBarry SmithOne can selectively turn off informative messages about any of the basic 542*7f296bb3SBarry SmithPETSc objects (e.g., `Mat`, `SNES`) with the command 543*7f296bb3SBarry Smith 544*7f296bb3SBarry Smith``` 545*7f296bb3SBarry SmithPetscInfoDeactivateClass(int object_classid) 546*7f296bb3SBarry Smith``` 547*7f296bb3SBarry Smith 548*7f296bb3SBarry Smithwhere `object_classid` is one of `MAT_CLASSID`, `SNES_CLASSID`, 549*7f296bb3SBarry Smithetc. Messages can be reactivated with the command 550*7f296bb3SBarry Smith 551*7f296bb3SBarry Smith``` 552*7f296bb3SBarry SmithPetscInfoActivateClass(int object_classid) 553*7f296bb3SBarry Smith``` 554*7f296bb3SBarry Smith 555*7f296bb3SBarry SmithSuch deactivation can be useful when one wishes to view information 556*7f296bb3SBarry Smithabout higher-level PETSc libraries (e.g., `TS` and `SNES`) without 557*7f296bb3SBarry Smithseeing all lower level data as well (e.g., `Mat`). 558*7f296bb3SBarry Smith 559*7f296bb3SBarry SmithOne can turn on or off logging for particular classes at runtime 560*7f296bb3SBarry Smith 561*7f296bb3SBarry Smith```console 562*7f296bb3SBarry Smith-info [filename][:[~]<list,of,classnames>[:[~]self]] 563*7f296bb3SBarry Smith``` 564*7f296bb3SBarry Smith 565*7f296bb3SBarry SmithThe `list,of,classnames` is a list, separated by commas with no spaces, of classes one wishes to view the information on. For 566*7f296bb3SBarry Smithexample `vec,ksp`. Information on all other classes will not be displayed. The ~ indicates to not display the list of classes but rather to display all other classes. 567*7f296bb3SBarry Smith 568*7f296bb3SBarry Smith`self` indicates to display information on objects that are associated with `PETSC_COMM_SELF` while `~self` indicates to display information only for parallel objects. 569*7f296bb3SBarry Smith 570*7f296bb3SBarry SmithSee `PetscInfo()` for links to all the info operations that are available. 571*7f296bb3SBarry Smith 572*7f296bb3SBarry SmithApplication programmers can log their own messages, as well, by using the 573*7f296bb3SBarry Smithroutine 574*7f296bb3SBarry Smith 575*7f296bb3SBarry Smith``` 576*7f296bb3SBarry SmithPetscInfo(void* obj,char *message,...) 577*7f296bb3SBarry Smith``` 578*7f296bb3SBarry Smith 579*7f296bb3SBarry Smithwhere `obj` is the PETSc object associated most closely with the 580*7f296bb3SBarry Smithlogging statement, `message`. For example, in the line search Newton 581*7f296bb3SBarry Smithmethods, we use a statement such as 582*7f296bb3SBarry Smith 583*7f296bb3SBarry Smith``` 584*7f296bb3SBarry SmithPetscInfo(snes,"Cubic step, lambda %g\n",lambda); 585*7f296bb3SBarry Smith``` 586*7f296bb3SBarry Smith 587*7f296bb3SBarry Smith## Time 588*7f296bb3SBarry Smith 589*7f296bb3SBarry SmithPETSc application programmers can access the wall clock time directly 590*7f296bb3SBarry Smithwith the command 591*7f296bb3SBarry Smith 592*7f296bb3SBarry Smith``` 593*7f296bb3SBarry SmithPetscLogDouble time; 594*7f296bb3SBarry SmithPetscCall(PetscTime(&time)); 595*7f296bb3SBarry Smith``` 596*7f296bb3SBarry Smith 597*7f296bb3SBarry Smithwhich returns the current time in seconds since the epoch, and is 598*7f296bb3SBarry Smithcommonly implemented with `MPI_Wtime`. A floating point number is 599*7f296bb3SBarry Smithreturned in order to express fractions of a second. In addition, as 600*7f296bb3SBarry Smithdiscussed in {any}`sec_profileuser`, PETSc can automatically 601*7f296bb3SBarry Smithprofile user-defined segments of code. 602*7f296bb3SBarry Smith 603*7f296bb3SBarry Smith## Saving Output to a File 604*7f296bb3SBarry Smith 605*7f296bb3SBarry SmithAll output from PETSc programs (including informative messages, 606*7f296bb3SBarry Smithprofiling information, and convergence data) can be saved to a file by 607*7f296bb3SBarry Smithusing the command line option `-history [filename]`. If no file name 608*7f296bb3SBarry Smithis specified, the output is stored in the file 609*7f296bb3SBarry Smith`${HOME}/.petschistory`. Note that this option only saves output 610*7f296bb3SBarry Smithprinted with the `PetscPrintf()` and `PetscFPrintf()` commands, not 611*7f296bb3SBarry Smiththe standard `printf()` and `fprintf()` statements. 612*7f296bb3SBarry Smith 613*7f296bb3SBarry Smith(sec_profaccuracy)= 614*7f296bb3SBarry Smith 615*7f296bb3SBarry Smith## Accurate Profiling and Paging Overheads 616*7f296bb3SBarry Smith 617*7f296bb3SBarry SmithOne factor that often plays a significant role in profiling a code is 618*7f296bb3SBarry Smithpaging by the operating system. Generally, when running a program, only 619*7f296bb3SBarry Smitha few pages required to start it are loaded into memory rather than the 620*7f296bb3SBarry Smithentire executable. When the execution proceeds to code segments that are 621*7f296bb3SBarry Smithnot in memory, a pagefault occurs, prompting the required pages to be 622*7f296bb3SBarry Smithloaded from the disk (a very slow process). This activity distorts the 623*7f296bb3SBarry Smithresults significantly. (The paging effects are noticeable in the log 624*7f296bb3SBarry Smithfiles generated by `-log_mpe`, which is described in 625*7f296bb3SBarry Smith{any}`sec_mpelogs`.) 626*7f296bb3SBarry Smith 627*7f296bb3SBarry SmithTo eliminate the effects of paging when profiling the performance of a 628*7f296bb3SBarry Smithprogram, we have found an effective procedure is to run the *exact same 629*7f296bb3SBarry Smithcode* on a small dummy problem before running it on the actual problem 630*7f296bb3SBarry Smithof interest. We thus ensure that all code required by a solver is loaded 631*7f296bb3SBarry Smithinto memory during solution of the small problem. When the code proceeds 632*7f296bb3SBarry Smithto the actual (larger) problem of interest, all required pages have 633*7f296bb3SBarry Smithalready been loaded into main memory, so that the performance numbers 634*7f296bb3SBarry Smithare not distorted. 635*7f296bb3SBarry Smith 636*7f296bb3SBarry SmithWhen this procedure is used in conjunction with the user-defined stages 637*7f296bb3SBarry Smithof profiling described in {any}`sec_profstages`, we can 638*7f296bb3SBarry Smithfocus easily on the problem of interest. For example, we used this 639*7f296bb3SBarry Smithtechnique in the program 640*7f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/ksp/ksp/tutorials/ex10.c.html">KSP Tutorial ex10</a> 641*7f296bb3SBarry Smithto generate the timings within 642*7f296bb3SBarry Smith{any}`listing <listing_exparprof>` and {any}`listing <listing_exparprof2>`. 643*7f296bb3SBarry SmithIn this case, the profiled code 644*7f296bb3SBarry Smithof interest (solving the linear system for the larger problem) occurs 645*7f296bb3SBarry Smithwithin event stages 5 and 6. {any}`sec_parperformance` 646*7f296bb3SBarry Smithprovides details about interpreting such profiling data. 647*7f296bb3SBarry Smith 648*7f296bb3SBarry SmithIn particular, the macros 649*7f296bb3SBarry Smith 650*7f296bb3SBarry Smith``` 651*7f296bb3SBarry SmithPetscPreLoadBegin(PetscBool flag,char* stagename) 652*7f296bb3SBarry SmithPetscPreLoadStage(char *stagename) 653*7f296bb3SBarry Smith``` 654*7f296bb3SBarry Smith 655*7f296bb3SBarry Smithand 656*7f296bb3SBarry Smith 657*7f296bb3SBarry Smith``` 658*7f296bb3SBarry SmithPetscPreLoadEnd() 659*7f296bb3SBarry Smith``` 660*7f296bb3SBarry Smith 661*7f296bb3SBarry Smithcan be used to easily convert a regular PETSc program to one that uses 662*7f296bb3SBarry Smithpreloading. The command line options `-preload` `true` and 663*7f296bb3SBarry Smith`-preload` `false` may be used to turn on and off preloading at run 664*7f296bb3SBarry Smithtime for PETSc programs that use these macros. 665*7f296bb3SBarry Smith 666*7f296bb3SBarry Smith## NVIDIA Nsight Systems profiling 667*7f296bb3SBarry Smith 668*7f296bb3SBarry SmithNsight Systems will generate profiling data with a CUDA executable 669*7f296bb3SBarry Smithwith the command `nsys`. 670*7f296bb3SBarry SmithFor example, in serial 671*7f296bb3SBarry Smith 672*7f296bb3SBarry Smith```bash 673*7f296bb3SBarry Smithnsys profile -t nvtx,cuda -o file --stats=true --force-overwrite true ./a.out 674*7f296bb3SBarry Smith``` 675*7f296bb3SBarry Smith 676*7f296bb3SBarry Smithwill generate a file `file.qdstrm` with performance data that is 677*7f296bb3SBarry Smithannotated with PETSc events (methods) and Kokkos device kernel names. 678*7f296bb3SBarry SmithThe Nsight Systems GUI, `nsys-ui`, can be used to navigate this file 679*7f296bb3SBarry Smith(<https://developer.nvidia.com/nsight-systems>). The Nsight Systems GUI 680*7f296bb3SBarry Smithlets you see a timeline of code performance information like kernels, 681*7f296bb3SBarry Smithmemory mallocs and frees, CPU-GPU communication, and high-level data like time, sizes 682*7f296bb3SBarry Smithof memory copies, and more, in a popup window when the mouse 683*7f296bb3SBarry Smithhovers over the section. 684*7f296bb3SBarry SmithTo view the data, start `nsys-ui` without any arguments and then `Import` the 685*7f296bb3SBarry Smith`.qdstrm` file in the GUI. 686*7f296bb3SBarry SmithA side effect of this viewing process is the generation of a file `file.nsys-rep`, which can be viewed directly 687*7f296bb3SBarry Smithwith `nsys-ui` in the future. 688*7f296bb3SBarry Smith 689*7f296bb3SBarry SmithFor an MPI parallel job, only one process can call `nsys`, 690*7f296bb3SBarry Smithsay have rank zero output `nsys` data and have all other 691*7f296bb3SBarry Smithranks call the executable directly. For example with MPICH 692*7f296bb3SBarry Smithor Open MPI - we can run a parallel job on 4 MPI tasks as: 693*7f296bb3SBarry Smith 694*7f296bb3SBarry Smith```console 695*7f296bb3SBarry Smithmpiexec -n 1 nsys profile -t nvtx,cuda -o file_name --stats=true --force-overwrite true ./a.out : -n 3 ./a.out 696*7f296bb3SBarry Smith``` 697*7f296bb3SBarry Smith 698*7f296bb3SBarry Smith(sec_using_tau)= 699*7f296bb3SBarry Smith 700*7f296bb3SBarry SmithNote: The Nsight GUI can open profiling reports from elsewhere. For 701*7f296bb3SBarry Smithexample, a report from a compute node can be analyzed on your local 702*7f296bb3SBarry Smithmachine, but care should be taken to use the exact same versions of 703*7f296bb3SBarry SmithNsight Systems that generated the report. 704*7f296bb3SBarry SmithTo check the version of Nsight on the compute node run `nsys-ui` and 705*7f296bb3SBarry Smithnote the version number at the top of the window. 706*7f296bb3SBarry Smith 707*7f296bb3SBarry Smith## Using TAU 708*7f296bb3SBarry Smith 709*7f296bb3SBarry SmithTAU profiles can be generated without the need for instrumentation through the 710*7f296bb3SBarry Smithuse of the perfstubs package. PETSc by default is configured with `--with-tau-perfstubs`. 711*7f296bb3SBarry SmithTo generate profiles with TAU, first setup TAU: 712*7f296bb3SBarry Smith 713*7f296bb3SBarry Smith```bash 714*7f296bb3SBarry Smithwget http://tau.uoregon.edu/tau.tgz 715*7f296bb3SBarry Smith./configure -cc=mpicc -c++=mpicxx -mpi -bfd=download -unwind=download && make install 716*7f296bb3SBarry Smithexport PATH=<tau dir>/x86_64/bin:$PATH 717*7f296bb3SBarry Smith``` 718*7f296bb3SBarry Smith 719*7f296bb3SBarry SmithFor more information on configuring TAU, see [http://tau.uoregon.edu](http://tau.uoregon.edu). 720*7f296bb3SBarry SmithNext, run your program with TAU. For instance, to profile `ex56`, 721*7f296bb3SBarry Smith 722*7f296bb3SBarry Smith```bash 723*7f296bb3SBarry Smithcd $PETSC_DIR/src/snes/tutorials 724*7f296bb3SBarry Smithmake ex56 725*7f296bb3SBarry Smithmpirun -n 4 tau_exec -T mpi ./ex56 -log_perfstubs <args> 726*7f296bb3SBarry Smith``` 727*7f296bb3SBarry Smith 728*7f296bb3SBarry SmithThis should produce four `profile.*` files with profile data that can be 729*7f296bb3SBarry Smithviewed with `paraprof/pprof`: 730*7f296bb3SBarry Smith 731*7f296bb3SBarry Smith```shell 732*7f296bb3SBarry SmithReading Profile files in profile.* 733*7f296bb3SBarry Smith 734*7f296bb3SBarry SmithNODE 0;CONTEXT 0;THREAD 0: 735*7f296bb3SBarry Smith--------------------------------------------------------------------------------------- 736*7f296bb3SBarry Smith%Time Exclusive Inclusive #Call #Subrs Inclusive Name 737*7f296bb3SBarry Smith msec total msec usec/call 738*7f296bb3SBarry Smith--------------------------------------------------------------------------------------- 739*7f296bb3SBarry Smith100.0 26 1,838 1 41322 1838424 .TAU application 740*7f296bb3SBarry Smith 73.2 1 1,345 2 168 672950 SNESSolve 741*7f296bb3SBarry Smith 62.2 3 1,142 2 1282 571442 SNESJacobianEval 742*7f296bb3SBarry Smith 62.0 1,136 1,138 2 76 569494 DMPlexJacobianFE 743*7f296bb3SBarry Smith 60.1 0.046 1,105 1 32 1105001 Solve 1 744*7f296bb3SBarry Smith 15.2 87 279 5 11102 55943 Mesh Setup 745*7f296bb3SBarry Smith 13.2 0.315 241 1 32 241765 Solve 0 746*7f296bb3SBarry Smith 7.8 80 144 38785 38785 4 MPI_Allreduce() 747*7f296bb3SBarry Smith 7.0 69 128 6 43386 21491 DualSpaceSetUp 748*7f296bb3SBarry Smith 6.2 1 114 4 54 28536 PCSetUp 749*7f296bb3SBarry Smith 6.0 12 110 2 892 55407 PCSetUp_GAMG+ 750*7f296bb3SBarry Smith 3.9 70 70 1 0 70888 MPI_Init_thread() 751*7f296bb3SBarry Smith 3.7 68 68 41747 0 2 MPI Collective Sync 752*7f296bb3SBarry Smith 3.6 8 66 4 3536 16548 SNESFunctionEval 753*7f296bb3SBarry Smith 2.6 45 48 171 171 281 MPI_Bcast() 754*7f296bb3SBarry Smith 1.9 34 34 7836 0 4 MPI_Barrier() 755*7f296bb3SBarry Smith 1.8 0.567 33 2 68 16912 GAMG Coarsen 756*7f296bb3SBarry Smith``` 757*7f296bb3SBarry Smith 758*7f296bb3SBarry Smith```{raw} html 759*7f296bb3SBarry Smith<hr> 760*7f296bb3SBarry Smith``` 761*7f296bb3SBarry Smith 762*7f296bb3SBarry Smith```{eval-rst} 763*7f296bb3SBarry Smith.. bibliography:: /petsc.bib 764*7f296bb3SBarry Smith :filter: docname in docnames 765*7f296bb3SBarry Smith``` 766