17f296bb3SBarry Smith(ch_profiling)= 27f296bb3SBarry Smith 37f296bb3SBarry Smith# Profiling 47f296bb3SBarry Smith 57f296bb3SBarry SmithPETSc includes a consistent, lightweight scheme for profiling 67f296bb3SBarry Smithapplication programs. The PETSc routines automatically log 77f296bb3SBarry Smithperformance data if certain options are specified at runtime. The user 87f296bb3SBarry Smithcan also log information about application codes for a complete picture 97f296bb3SBarry Smithof performance. 107f296bb3SBarry Smith 117f296bb3SBarry SmithIn addition, as described in 127f296bb3SBarry Smith{any}`sec_ploginfo`, PETSc provides a mechanism for 137f296bb3SBarry Smithprinting informative messages about computations. 147f296bb3SBarry Smith{any}`sec_profbasic` introduces the various profiling 157f296bb3SBarry Smithoptions in PETSc, while the remainder of the chapter focuses on details 167f296bb3SBarry Smithsuch as monitoring application codes and tips for accurate profiling. 177f296bb3SBarry Smith 187f296bb3SBarry Smith(sec_profbasic)= 197f296bb3SBarry Smith 207f296bb3SBarry Smith## Basic Profiling Information 217f296bb3SBarry Smith 227f296bb3SBarry SmithThe profiling options include the following: 237f296bb3SBarry Smith 247f296bb3SBarry Smith- `-log_view [:filename]` - Prints an ASCII version of performance data at the 257f296bb3SBarry Smith program’s conclusion. These statistics are comprehensive and concise 267f296bb3SBarry Smith and require little overhead; thus, `-log_view` is intended as the 277f296bb3SBarry Smith primary means of monitoring the performance of PETSc codes. See {any}`sec_ploginfo` 287f296bb3SBarry Smith- `-info [infofile]` - Prints verbose information about code to 297f296bb3SBarry Smith stdout or an optional file. This option provides details about 307f296bb3SBarry Smith algorithms, data structures, etc. Since the overhead of printing such 317f296bb3SBarry Smith output slows a code, this option should not be used when evaluating a 327f296bb3SBarry Smith program’s performance. See {any}`sec_PetscInfo` 337f296bb3SBarry Smith- `-log_trace [logfile]` - Traces the beginning and ending of all 347f296bb3SBarry Smith PETSc events. This option, which can be used in conjunction with 357f296bb3SBarry Smith `-info`, is useful to see where a program is hanging without 367f296bb3SBarry Smith running in the debugger. See `PetscLogTraceBegin()`. 377f296bb3SBarry Smith 387f296bb3SBarry SmithAs discussed in {any}`sec_mpelogs`, additional profiling 397f296bb3SBarry Smithcan be done with MPE. 407f296bb3SBarry Smith 417f296bb3SBarry Smith(sec_ploginfo)= 427f296bb3SBarry Smith 437f296bb3SBarry Smith### Interpreting `-log_view` Output: The Basics 447f296bb3SBarry Smith 457f296bb3SBarry SmithAs shown in {any}`listing <listing_exprof>` in {any}`sec_profiling_programs`, the 467f296bb3SBarry Smithoption `-log_view` `[:filename]` activates printing of profile data to standard 477f296bb3SBarry Smithoutput or an ASCII file at the conclusion of a program. See `PetscLogView()` for all the possible 487f296bb3SBarry Smithoutput options. 497f296bb3SBarry Smith 507f296bb3SBarry SmithWe print performance data for each routine, organized by PETSc 517f296bb3SBarry Smithlibraries, followed by any user-defined events (discussed in 527f296bb3SBarry Smith{any}`sec_profileuser`). For each routine, the output data 537f296bb3SBarry Smithinclude the maximum time and floating point operation (flop) rate over 547f296bb3SBarry Smithall processes. Information about parallel performance is also included, 557f296bb3SBarry Smithas discussed in the following section. 567f296bb3SBarry Smith 577f296bb3SBarry SmithFor the purpose of PETSc floating point operation counting, we define 587f296bb3SBarry Smithone *flop* as one operation of any of the following types: 597f296bb3SBarry Smithmultiplication, division, addition, or subtraction. For example, one 607f296bb3SBarry Smith`VecAXPY()` operation, which computes $y = \alpha x + y$ for 617f296bb3SBarry Smithvectors of length $N$, requires $2N$ flop (consisting of 627f296bb3SBarry Smith$N$ additions and $N$ multiplications). Bear in mind that 637f296bb3SBarry Smithflop rates present only a limited view of performance, since memory 647f296bb3SBarry Smithloads and stores are the real performance barrier. 657f296bb3SBarry Smith 667f296bb3SBarry SmithFor simplicity, the remainder of this discussion focuses on interpreting 677f296bb3SBarry Smithprofile data for the `KSP` library, which provides the linear solvers 687f296bb3SBarry Smithat the heart of the PETSc package. Recall the hierarchical organization 697f296bb3SBarry Smithof the PETSc library, as shown in 707f296bb3SBarry Smith{any}`fig_library`. Each `KSP` solver is composed 717f296bb3SBarry Smithof a `PC` (preconditioner) and a `KSP` (Krylov subspace) part, which 727f296bb3SBarry Smithare in turn built on top of the `Mat` (matrix) and `Vec` (vector) 737f296bb3SBarry Smithmodules. Thus, operations in the `KSP` module are composed of 747f296bb3SBarry Smithlower-level operations in these packages. Note also that the nonlinear 757f296bb3SBarry Smithsolvers library, `SNES`, is built on top of the `KSP` module, and 767f296bb3SBarry Smiththe timestepping library, `TS`, is in turn built on top of `SNES`. 777f296bb3SBarry Smith 787f296bb3SBarry SmithWe briefly discuss interpretation of the sample output in 797f296bb3SBarry Smith{any}`listing <listing_exprof>`, which was generated by solving a 807f296bb3SBarry Smithlinear system on one process using restarted GMRES and ILU 817f296bb3SBarry Smithpreconditioning. The linear solvers in `KSP` consist of two basic 827f296bb3SBarry Smithphases, `KSPSetUp()` and `KSPSolve()`, each of which consists of a 837f296bb3SBarry Smithvariety of actions, depending on the particular solution technique. For 847f296bb3SBarry Smiththe case of using the `PCILU` preconditioner and `KSPGMRES` Krylov 857f296bb3SBarry Smithsubspace method, the breakdown of PETSc routines is listed below. As 867f296bb3SBarry Smithindicated by the levels of indentation, the operations in `KSPSetUp()` 877f296bb3SBarry Smithinclude all of the operations within `PCSetUp()`, which in turn 887f296bb3SBarry Smithinclude `MatILUFactor()`, and so on. 897f296bb3SBarry Smith 907f296bb3SBarry Smith- `KSPSetUp` - Set up linear solver 917f296bb3SBarry Smith 927f296bb3SBarry Smith - `PCSetUp` - Set up preconditioner 937f296bb3SBarry Smith 947addb90fSBarry Smith - `MatILUFactor` - Factor matrix 957f296bb3SBarry Smith 967f296bb3SBarry Smith - `MatILUFactorSymbolic` - Symbolic factorization phase 977f296bb3SBarry Smith - `MatLUFactorNumeric` - Numeric factorization phase 987f296bb3SBarry Smith 997f296bb3SBarry Smith- `KSPSolve` - Solve linear system 1007f296bb3SBarry Smith 1017f296bb3SBarry Smith - `PCApply` - Apply preconditioner 1027f296bb3SBarry Smith 1037f296bb3SBarry Smith - `MatSolve` - Forward/backward triangular solves 1047f296bb3SBarry Smith 1057f296bb3SBarry Smith - `KSPGMRESOrthog` - Orthogonalization in GMRES 1067f296bb3SBarry Smith 1077f296bb3SBarry Smith - `VecDot` or `VecMDot` - Inner products 1087f296bb3SBarry Smith - `VecAXPY` or `VecMAXPY` - vector updates 1097f296bb3SBarry Smith 1107f296bb3SBarry Smith - `MatMult` - Matrix-vector product 1117f296bb3SBarry Smith 1127f296bb3SBarry Smith - `MatMultAdd` - Matrix-vector product + vector addition 1137f296bb3SBarry Smith 1147f296bb3SBarry Smith - `VecScale`, `VecNorm`, `VecAXPY`, `VecCopy`, ... 1157f296bb3SBarry Smith 1167f296bb3SBarry SmithThe summaries printed via `-log_view` reflect this routine hierarchy. 1177f296bb3SBarry SmithFor example, the performance summaries for a particular high-level 1187f296bb3SBarry Smithroutine such as `KSPSolve()` include all of the operations accumulated 1197f296bb3SBarry Smithin the lower-level components that make up the routine. 1207f296bb3SBarry Smith 1217f296bb3SBarry SmithThe output produced with `-log_view` is flat, meaning that the hierarchy 1227f296bb3SBarry Smithof PETSc operations is not completely clear. For a 1237f296bb3SBarry Smithparticular problem, the user should generally have an idea of the basic 1247f296bb3SBarry Smithoperations that are required for its implementation (e.g., which 1257f296bb3SBarry Smithoperations are performed when using GMRES and ILU, as described above), 1267f296bb3SBarry Smithso that interpreting the `-log_view` data should be relatively 1277f296bb3SBarry Smithstraightforward. 1287f296bb3SBarry SmithIf this is problematic then it is also possible to examine 1297f296bb3SBarry Smiththe profiling information in a nested format. For more information see 1307f296bb3SBarry Smith{any}`sec_nestedevents`. 1317f296bb3SBarry Smith 1327f296bb3SBarry Smith(sec_parperformance)= 1337f296bb3SBarry Smith 1347f296bb3SBarry Smith### Interpreting `-log_view` Output: Parallel Performance 1357f296bb3SBarry Smith 1367f296bb3SBarry SmithWe next discuss performance summaries for parallel programs, as shown 1377f296bb3SBarry Smithwithin {any}`listing <listing_exparprof>` and {any}`listing <listing_exparprof2>`, 1387f296bb3SBarry Smithwhich presents the 1397f296bb3SBarry Smithoutput generated by the `-log_view` option. The program that generated 1407f296bb3SBarry Smiththis data is 1417f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/ksp/ksp/tutorials/ex10.c.html">KSP Tutorial ex10</a>. 1427f296bb3SBarry SmithThe code loads a matrix and right-hand-side vector from a binary file 1437f296bb3SBarry Smithand then solves the resulting linear system; the program then repeats 1447f296bb3SBarry Smiththis process for a second linear system. This particular case was run on 1457f296bb3SBarry Smithfour processors of an Intel x86_64 Linux cluster, using restarted GMRES 1467f296bb3SBarry Smithand the block Jacobi preconditioner, where each block was solved with 1477f296bb3SBarry SmithILU. The two input files `medium` and `arco6` can be obtained from 1487f296bb3SBarry Smith[datafiles](https://gitlab.com/petsc/datafiles), see {any}`petsc_repositories`. 1497f296bb3SBarry Smith 1507f296bb3SBarry SmithThe first {any}`listing <listing_exparprof>` presents an overall 1517f296bb3SBarry Smithperformance summary, including times, floating-point operations, 1527f296bb3SBarry Smithcomputational rates, and message-passing activity (such as the number 1537f296bb3SBarry Smithand size of messages sent and collective operations). Summaries for 1547f296bb3SBarry Smithvarious user-defined stages of monitoring (as discussed in 1557f296bb3SBarry Smith{any}`sec_profstages`) are also given. Information about the 1567f296bb3SBarry Smithvarious phases of computation then follow (as shown separately here in 1577f296bb3SBarry Smiththe second {any}`listing <listing_exparprof2>`). Finally, a summary of 1587f296bb3SBarry Smithobject creation and destruction is presented. 1597f296bb3SBarry Smith 1607f296bb3SBarry Smith(listing_exparprof)= 1617f296bb3SBarry Smith 1627f296bb3SBarry Smith```none 1637f296bb3SBarry Smithmpiexec -n 4 ./ex10 -f0 medium -f1 arco6 -ksp_gmres_classicalgramschmidt -log_view -mat_type baij \ 1647f296bb3SBarry Smith -matload_block_size 3 -pc_type bjacobi -options_left 1657f296bb3SBarry Smith 1667f296bb3SBarry SmithNumber of iterations = 19 1677f296bb3SBarry SmithResidual norm 1.088292e-05 1687f296bb3SBarry SmithNumber of iterations = 59 1697f296bb3SBarry SmithResidual norm 3.871022e-02 1707f296bb3SBarry Smith---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- 1717f296bb3SBarry Smith 1727f296bb3SBarry Smith./ex10 on a intel-bdw-opt named beboplogin4 with 4 processors, by jczhang Mon Apr 23 13:36:54 2018 1737f296bb3SBarry SmithUsing PETSc Development Git Revision: v3.9-163-gbe3efd42 Git Date: 2018-04-16 10:45:40 -0500 1747f296bb3SBarry Smith 1757f296bb3SBarry Smith Max Max/Min Avg Total 1767f296bb3SBarry SmithTime (sec): 1.849e-01 1.00002 1.849e-01 1777f296bb3SBarry SmithObjects: 1.060e+02 1.00000 1.060e+02 1787f296bb3SBarry SmithFlops: 2.361e+08 1.00684 2.353e+08 9.413e+08 1797f296bb3SBarry SmithFlops/sec: 1.277e+09 1.00685 1.273e+09 5.091e+09 1807f296bb3SBarry SmithMPI Msg Count: 2.360e+02 1.34857 2.061e+02 8.245e+02 1817f296bb3SBarry SmithMPI Msg Len (bytes): 1.256e+07 2.24620 4.071e+04 3.357e+07 1827f296bb3SBarry SmithMPI Reductions: 2.160e+02 1.00000 1837f296bb3SBarry Smith 1847f296bb3SBarry SmithSummary of Stages: ----- Time ------ ----- Flop ----- --- Messages --- -- Message Lengths -- -- Reductions -- 1857f296bb3SBarry Smith Avg %Total Avg %Total counts %Total Avg %Total counts %Total 1867f296bb3SBarry Smith 0: Main Stage: 5.9897e-04 0.3% 0.0000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 2.000e+00 0.9% 1877f296bb3SBarry Smith 1: Load System 0: 2.9113e-03 1.6% 0.0000e+00 0.0% 3.550e+01 4.3% 5.984e+02 0.1% 2.200e+01 10.2% 1887f296bb3SBarry Smith 2: KSPSetUp 0: 7.7349e-04 0.4% 9.9360e+03 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 2.000e+00 0.9% 1897f296bb3SBarry Smith 3: KSPSolve 0: 1.7690e-03 1.0% 2.9673e+05 0.0% 1.520e+02 18.4% 1.800e+02 0.1% 3.900e+01 18.1% 1907f296bb3SBarry Smith 4: Load System 1: 1.0056e-01 54.4% 0.0000e+00 0.0% 3.700e+01 4.5% 5.657e+05 62.4% 2.200e+01 10.2% 1917f296bb3SBarry Smith 5: KSPSetUp 1: 5.6883e-03 3.1% 2.1205e+07 2.3% 0.000e+00 0.0% 0.000e+00 0.0% 2.000e+00 0.9% 1927f296bb3SBarry Smith 6: KSPSolve 1: 7.2578e-02 39.3% 9.1979e+08 97.7% 6.000e+02 72.8% 2.098e+04 37.5% 1.200e+02 55.6% 1937f296bb3SBarry Smith 1947f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------ 1957f296bb3SBarry Smith 1967f296bb3SBarry Smith.... [Summary of various phases, see part II below] ... 1977f296bb3SBarry Smith 1987f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------ 1997f296bb3SBarry Smith 2007f296bb3SBarry SmithObject Type Creations Destructions (Reports information only for process 0.) 2017f296bb3SBarry Smith... 2027f296bb3SBarry Smith--- Event Stage 3: KSPSolve 0 2037f296bb3SBarry Smith 2047f296bb3SBarry Smith Matrix 0 4 2057f296bb3SBarry Smith Vector 20 30 2067f296bb3SBarry Smith Index Set 0 3 2077f296bb3SBarry Smith Vec Scatter 0 1 2087f296bb3SBarry Smith Krylov Solver 0 2 2097f296bb3SBarry Smith Preconditioner 0 2 2107f296bb3SBarry Smith``` 2117f296bb3SBarry Smith 2127f296bb3SBarry SmithWe next focus on the summaries for the various phases of the 2137f296bb3SBarry Smithcomputation, as given in the table within 2147f296bb3SBarry Smiththe following {any}`listing <listing_exparprof2>`. The summary for each 2157f296bb3SBarry Smithphase presents the maximum times and flop rates over all processes, as 2167f296bb3SBarry Smithwell as the ratio of maximum to minimum times and flop rates for all 2177f296bb3SBarry Smithprocesses. A ratio of approximately 1 indicates that computations within 2187f296bb3SBarry Smitha given phase are well balanced among the processes; as the ratio 2197f296bb3SBarry Smithincreases, the balance becomes increasingly poor. Also, the total 2207f296bb3SBarry Smithcomputational rate (in units of MFlop/sec) is given for each phase in 2217f296bb3SBarry Smiththe final column of the phase summary table. 2227f296bb3SBarry Smith 2237f296bb3SBarry Smith$$ 2247f296bb3SBarry Smith{\rm Total\: Mflop/sec} \:=\: 10^{-6} * ({\rm sum\; of\; flop\; over\; all\; processors})/({\rm max\; time\; over\; all\; processors}) 2257f296bb3SBarry Smith$$ 2267f296bb3SBarry Smith 2277f296bb3SBarry Smith*Note*: Total computational rates $<$ 1 MFlop are listed as 0 in 2287f296bb3SBarry Smiththis column of the phase summary table. Additional statistics for each 2297f296bb3SBarry Smithphase include the total number of messages sent, the average message 2307f296bb3SBarry Smithlength, and the number of global reductions. 2317f296bb3SBarry Smith 2327f296bb3SBarry Smith(listing_exparprof2)= 2337f296bb3SBarry Smith 2347f296bb3SBarry Smith```none 2357f296bb3SBarry Smithmpiexec -n 4 ./ex10 -f0 medium -f1 arco6 -ksp_gmres_classicalgramschmidt -log_view -mat_type baij \ 2367f296bb3SBarry Smith -matload_block_size 3 -pc_type bjacobi -options_left 2377f296bb3SBarry Smith 2387f296bb3SBarry Smith---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- 2397f296bb3SBarry Smith.... [Overall summary, see part I] ... 2407f296bb3SBarry Smith 2417f296bb3SBarry SmithPhase summary info: 2427f296bb3SBarry Smith Count: number of times phase was executed 2437f296bb3SBarry Smith Time and Flop/sec: Max - maximum over all processors 2447f296bb3SBarry Smith Ratio - ratio of maximum to minimum over all processors 2457f296bb3SBarry Smith Mess: number of messages sent 2467f296bb3SBarry Smith AvgLen: average message length 2477f296bb3SBarry Smith Reduct: number of global reductions 2487f296bb3SBarry Smith Global: entire computation 2497f296bb3SBarry Smith Stage: optional user-defined stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). 2507f296bb3SBarry Smith %T - percent time in this phase %F - percent flop in this phase 2517f296bb3SBarry Smith %M - percent messages in this phase %L - percent message lengths in this phase 2527f296bb3SBarry Smith %R - percent reductions in this phase 2537f296bb3SBarry Smith Total Mflop/s: 10^6 * (sum of flop over all processors)/(max time over all processors) 2547f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------ 2557f296bb3SBarry SmithPhase Count Time (sec) Flop/sec --- Global --- --- Stage ---- Total 2567f296bb3SBarry Smith Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s 2577f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------ 2587f296bb3SBarry Smith... 2597f296bb3SBarry Smith 2607f296bb3SBarry Smith--- Event Stage 5: KSPSetUp 1 2617f296bb3SBarry Smith 2627f296bb3SBarry SmithMatLUFactorNum 1 1.0 3.6440e-03 1.1 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00 2 2 0 0 0 62100 0 0 0 5819 2637f296bb3SBarry SmithMatILUFactorSym 1 1.0 1.7111e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 26 0 0 0 0 0 2647f296bb3SBarry SmithMatGetRowIJ 1 1.0 1.1921e-06 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 2657f296bb3SBarry SmithMatGetOrdering 1 1.0 3.0041e-05 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0 2667f296bb3SBarry SmithKSPSetUp 2 1.0 6.6495e-04 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 0 0 0 0 1 9 0 0 0100 0 2677f296bb3SBarry SmithPCSetUp 2 1.0 5.4271e-03 1.2 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00 3 2 0 0 0 90100 0 0 0 3907 2687f296bb3SBarry SmithPCSetUpOnBlocks 1 1.0 5.3999e-03 1.2 5.30e+06 1.0 0.0e+00 0.0e+00 0.0e+00 3 2 0 0 0 90100 0 0 0 3927 2697f296bb3SBarry Smith 2707f296bb3SBarry Smith--- Event Stage 6: KSPSolve 1 2717f296bb3SBarry Smith 2727f296bb3SBarry SmithMatMult 60 1.0 2.4068e-02 1.1 6.54e+07 1.0 6.0e+02 2.1e+04 0.0e+00 12 27 73 37 0 32 28100100 0 10731 2737f296bb3SBarry SmithMatSolve 61 1.0 1.9177e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 10 25 0 0 0 26 26 0 0 0 12491 2747f296bb3SBarry SmithVecMDot 59 1.0 1.4741e-02 1.3 4.86e+07 1.0 0.0e+00 0.0e+00 5.9e+01 7 21 0 0 27 18 21 0 0 49 13189 2757f296bb3SBarry SmithVecNorm 61 1.0 3.0417e-03 1.4 3.29e+06 1.0 0.0e+00 0.0e+00 6.1e+01 1 1 0 0 28 4 1 0 0 51 4332 2767f296bb3SBarry SmithVecScale 61 1.0 9.9802e-04 1.0 1.65e+06 1.0 0.0e+00 0.0e+00 0.0e+00 1 1 0 0 0 1 1 0 0 0 6602 2777f296bb3SBarry SmithVecCopy 2 1.0 5.9128e-05 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 2787f296bb3SBarry SmithVecSet 64 1.0 8.0323e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0 2797f296bb3SBarry SmithVecAXPY 3 1.0 7.4387e-05 1.1 1.62e+05 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 8712 2807f296bb3SBarry SmithVecMAXPY 61 1.0 8.8558e-03 1.1 5.18e+07 1.0 0.0e+00 0.0e+00 0.0e+00 5 22 0 0 0 12 23 0 0 0 23393 2817f296bb3SBarry SmithVecScatterBegin 60 1.0 9.6416e-04 1.8 0.00e+00 0.0 6.0e+02 2.1e+04 0.0e+00 0 0 73 37 0 1 0100100 0 0 2827f296bb3SBarry SmithVecScatterEnd 60 1.0 6.1543e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 8 0 0 0 0 0 2837f296bb3SBarry SmithVecNormalize 61 1.0 4.2675e-03 1.3 4.94e+06 1.0 0.0e+00 0.0e+00 6.1e+01 2 2 0 0 28 5 2 0 0 51 4632 2847f296bb3SBarry SmithKSPGMRESOrthog 59 1.0 2.2627e-02 1.1 9.72e+07 1.0 0.0e+00 0.0e+00 5.9e+01 11 41 0 0 27 29 42 0 0 49 17185 2857f296bb3SBarry SmithKSPSolve 1 1.0 7.2577e-02 1.0 2.31e+08 1.0 6.0e+02 2.1e+04 1.2e+02 39 98 73 37 56 99100100100100 12673 2867f296bb3SBarry SmithPCSetUpOnBlocks 1 1.0 9.5367e-07 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 2877f296bb3SBarry SmithPCApply 61 1.0 2.0427e-02 1.0 5.99e+07 1.0 0.0e+00 0.0e+00 0.0e+00 11 25 0 0 0 28 26 0 0 0 11726 2887f296bb3SBarry Smith------------------------------------------------------------------------------------------------------------------------ 2897f296bb3SBarry Smith.... [Conclusion of overall summary, see part I] ... 2907f296bb3SBarry Smith``` 2917f296bb3SBarry Smith 2927f296bb3SBarry SmithAs discussed in the preceding section, the performance summaries for 2937f296bb3SBarry Smithhigher-level PETSc routines include the statistics for the lower levels 2947f296bb3SBarry Smithof which they are made up. For example, the communication within 2957f296bb3SBarry Smithmatrix-vector products `MatMult()` consists of vector scatter 2967f296bb3SBarry Smithoperations, as given by the routines `VecScatterBegin()` and 2977f296bb3SBarry Smith`VecScatterEnd()`. 2987f296bb3SBarry Smith 2997f296bb3SBarry SmithThe final data presented are the percentages of the various statistics 3007f296bb3SBarry Smith(time (`%T`), flop/sec (`%F`), messages(`%M`), average message 3017f296bb3SBarry Smithlength (`%L`), and reductions (`%R`)) for each event relative to the 3027f296bb3SBarry Smithtotal computation and to any user-defined stages (discussed in 3037f296bb3SBarry Smith{any}`sec_profstages`). These statistics can aid in 3047f296bb3SBarry Smithoptimizing performance, since they indicate the sections of code that 3057f296bb3SBarry Smithcould benefit from various kinds of tuning. 3067f296bb3SBarry Smith{any}`ch_performance` gives suggestions about achieving good 3077f296bb3SBarry Smithperformance with PETSc codes. 3087f296bb3SBarry Smith 3097f296bb3SBarry SmithThe additional option `-log_view_memory` causes the display of additional columns of information about how much 3107f296bb3SBarry Smithmemory was allocated and freed during each logged event. This is useful 3117f296bb3SBarry Smithto understand what phases of a computation require the most memory. 3127f296bb3SBarry Smith 3137f296bb3SBarry Smith(sec_mpelogs)= 3147f296bb3SBarry Smith 3157f296bb3SBarry Smith### Using `-log_mpe` with Jumpshot 3167f296bb3SBarry Smith 3177f296bb3SBarry SmithIt is also possible to use the *Jumpshot* package 3187f296bb3SBarry Smith{cite}`upshot` to visualize PETSc events. This package comes 3197f296bb3SBarry Smithwith the MPE software, which is part of the MPICH 3207f296bb3SBarry Smith{cite}`mpich-web-page` implementation of MPI. The option 3217f296bb3SBarry Smith 3227f296bb3SBarry Smith```none 3237f296bb3SBarry Smith-log_mpe [logfile] 3247f296bb3SBarry Smith``` 3257f296bb3SBarry Smith 3267f296bb3SBarry Smithcreates a logfile of events appropriate for viewing with *Jumpshot*. The 3277f296bb3SBarry Smithuser can either use the default logging file or specify a name via 3287f296bb3SBarry Smith`logfile`. Events can be deactivated as described in 3297f296bb3SBarry Smith{any}`sec_deactivate`. 3307f296bb3SBarry Smith 3317f296bb3SBarry SmithThe user can also log MPI events. To do this, simply consider the PETSc 3327f296bb3SBarry Smithapplication as any MPI application, and follow the MPI implementation’s 3337f296bb3SBarry Smithinstructions for logging MPI calls. For example, when using MPICH, this 3347f296bb3SBarry Smithmerely required adding `-llmpich` to the library list *before* 3357f296bb3SBarry Smith`-lmpich`. 3367f296bb3SBarry Smith 3377f296bb3SBarry Smith(sec_nestedevents)= 3387f296bb3SBarry Smith 3397f296bb3SBarry Smith### Profiling Nested Events 3407f296bb3SBarry Smith 3417f296bb3SBarry SmithIt is possible to output the PETSc logging information in a nested format 3427f296bb3SBarry Smithwhere the hierarchy of events is explicit. This output can be generated 3437f296bb3SBarry Smitheither as an XML file or as a text file in a format suitable for viewing as 3447f296bb3SBarry Smitha flame graph. 3457f296bb3SBarry Smith 3467f296bb3SBarry SmithOne can generate the XML output by passing the option `-log_view :[logfilename]:ascii_xml`. 3477f296bb3SBarry SmithIt can be viewed by copying `${PETSC_DIR}/share/petsc/xml/performance_xml2html.xsl` 3487f296bb3SBarry Smithinto the current directory, then opening the logfile in your browser. 3497f296bb3SBarry Smith 3507f296bb3SBarry SmithThe flame graph output can be generated with the option `-log_view :[logfile]:ascii_flamegraph`. 3517f296bb3SBarry SmithIt can then be visualised with either [FlameGraph](https://github.com/brendangregg/FlameGraph) 3527f296bb3SBarry Smithor [speedscope](https://www.speedscope.app). A flamegraph can be visualized directly from 3537f296bb3SBarry Smithstdout using, for example, 3547f296bb3SBarry Smith`ImageMagick's display utility <https://imagemagick.org/script/display.php>`: 3557f296bb3SBarry Smith 3567f296bb3SBarry Smith``` 3577f296bb3SBarry Smithcd $PETSC_DIR/src/sys/tests 3587f296bb3SBarry Smithmake ex30 3597f296bb3SBarry Smithmpiexec -n 2 ./ex30 -log_view ::ascii_flamegraph | flamegraph | display 3607f296bb3SBarry Smith``` 3617f296bb3SBarry Smith 3627f296bb3SBarry SmithNote that user-defined stages (see {any}`sec_profstages`) will be ignored when 3637f296bb3SBarry Smithusing this nested format. 3647f296bb3SBarry Smith 3657f296bb3SBarry Smith(sec_profileuser)= 3667f296bb3SBarry Smith 3677f296bb3SBarry Smith## Profiling Application Codes 3687f296bb3SBarry Smith 3697f296bb3SBarry SmithPETSc automatically logs object creation, times, and floating-point 3707f296bb3SBarry Smithcounts for the library routines. Users can easily supplement this 3717f296bb3SBarry Smithinformation by profiling their application codes as well. The basic 3727f296bb3SBarry Smithsteps involved in logging a user-defined portion of code, called an 3737f296bb3SBarry Smith*event*, are shown in the code fragment below: 3747f296bb3SBarry Smith 3757f296bb3SBarry Smith``` 3767f296bb3SBarry SmithPetscLogEvent USER_EVENT; 3777f296bb3SBarry SmithPetscClassId classid; 3787f296bb3SBarry SmithPetscLogDouble user_event_flops; 3797f296bb3SBarry Smith 3807f296bb3SBarry SmithPetscClassIdRegister("class name",&classid); 3817f296bb3SBarry SmithPetscLogEventRegister("User event name",classid,&USER_EVENT); 3827f296bb3SBarry SmithPetscLogEventBegin(USER_EVENT,0,0,0,0); 3837f296bb3SBarry Smith/* code segment to monitor */ 3847f296bb3SBarry SmithPetscLogFlops(user_event_flops); 3857f296bb3SBarry SmithPetscLogEventEnd(USER_EVENT,0,0,0,0); 3867f296bb3SBarry Smith``` 3877f296bb3SBarry Smith 3887f296bb3SBarry SmithOne must register the event by calling `PetscLogEventRegister()`, 3897f296bb3SBarry Smithwhich assigns a unique integer to identify the event for profiling 3907f296bb3SBarry Smithpurposes: 3917f296bb3SBarry Smith 3927f296bb3SBarry Smith``` 3937f296bb3SBarry SmithPetscLogEventRegister(const char string[],PetscClassId classid,PetscLogEvent *e); 3947f296bb3SBarry Smith``` 3957f296bb3SBarry Smith 3967f296bb3SBarry SmithHere `string` is a user-defined event name, and `color` is an 3977f296bb3SBarry Smithoptional user-defined event color (for use with *Jumpshot* logging; see 3987f296bb3SBarry Smith{any}`sec_mpelogs`); one should see the manual page for 3997f296bb3SBarry Smithdetails. The argument returned in `e` should then be passed to the 4007f296bb3SBarry Smith`PetscLogEventBegin()` and `PetscLogEventEnd()` routines. 4017f296bb3SBarry Smith 4027f296bb3SBarry SmithEvents are logged by using the pair 4037f296bb3SBarry Smith 4047f296bb3SBarry Smith``` 4057f296bb3SBarry SmithPetscLogEventBegin(int event,PetscObject o1,PetscObject o2,PetscObject o3,PetscObject o4); 4067f296bb3SBarry SmithPetscLogEventEnd(int event,PetscObject o1,PetscObject o2,PetscObject o3,PetscObject o4); 4077f296bb3SBarry Smith``` 4087f296bb3SBarry Smith 4097f296bb3SBarry SmithThe four objects are the PETSc objects that are most closely associated 4107f296bb3SBarry Smithwith the event. For instance, in a matrix-vector product they would be 4117f296bb3SBarry Smiththe matrix and the two vectors. These objects can be omitted by 4127f296bb3SBarry Smithspecifying 0 for `o1` - `o4`. The code between these two routine 4137f296bb3SBarry Smithcalls will be automatically timed and logged as part of the specified 4147f296bb3SBarry Smithevent. 4157f296bb3SBarry Smith 4167f296bb3SBarry SmithEvents are collective by default on the communicator of `o1` (if present). 4177f296bb3SBarry SmithThey can be made not collective by using `PetscLogEventSetCollective()`. 4187f296bb3SBarry SmithNo synchronization is performed on collective events in optimized builds unless 4197f296bb3SBarry Smiththe command line option `-log_sync` is used; however, we do check for collective 4207f296bb3SBarry Smithsemantics in debug mode. 4217f296bb3SBarry Smith 4227f296bb3SBarry SmithThe user can log the number of floating-point operations for this 4237f296bb3SBarry Smithsegment of code by calling 4247f296bb3SBarry Smith 4257f296bb3SBarry Smith``` 4267f296bb3SBarry SmithPetscLogFlops(number of flop for this code segment); 4277f296bb3SBarry Smith``` 4287f296bb3SBarry Smith 4297f296bb3SBarry Smithbetween the calls to `PetscLogEventBegin()` and 4307f296bb3SBarry Smith`PetscLogEventEnd()`. This value will automatically be added to the 4317f296bb3SBarry Smithglobal flop counter for the entire program. 4327f296bb3SBarry Smith 4337f296bb3SBarry Smith(sec_profstages)= 4347f296bb3SBarry Smith 4357f296bb3SBarry Smith## Profiling Multiple Sections of Code 4367f296bb3SBarry Smith 4377f296bb3SBarry SmithBy default, the profiling produces a single set of statistics for all 4387f296bb3SBarry Smithcode between the `PetscInitialize()` and `PetscFinalize()` calls 4397f296bb3SBarry Smithwithin a program. One can independently monitor several "stages" of code 4407f296bb3SBarry Smithby switching among the various stages with the commands 4417f296bb3SBarry Smith 4427f296bb3SBarry Smith``` 4437f296bb3SBarry SmithPetscLogStagePush(PetscLogStage stage); 4447f296bb3SBarry SmithPetscLogStagePop(); 4457f296bb3SBarry Smith``` 4467f296bb3SBarry Smith 4477f296bb3SBarry Smithsee the manual pages for details. 4487f296bb3SBarry SmithThe command 4497f296bb3SBarry Smith 4507f296bb3SBarry Smith``` 4517f296bb3SBarry SmithPetscLogStageRegister(const char *name,PetscLogStage *stage) 4527f296bb3SBarry Smith``` 4537f296bb3SBarry Smith 4547f296bb3SBarry Smithallows one to associate a name with a stage; these names are printed 4557f296bb3SBarry Smithwhenever summaries are generated with `-log_view`. The following code fragment uses three profiling 4567f296bb3SBarry Smithstages within an program. 4577f296bb3SBarry Smith 4587f296bb3SBarry Smith``` 4597f296bb3SBarry SmithPetscInitialize(int *argc,char ***args,0,0); 4607f296bb3SBarry Smith/* stage 0 of code here */ 4617f296bb3SBarry SmithPetscLogStageRegister("Stage 0 of Code", &stagenum0); 4627f296bb3SBarry Smithfor (i=0; i<ntimes; i++) { 4637f296bb3SBarry Smith PetscLogStageRegister("Stage 1 of Code", &stagenum1); 4647f296bb3SBarry Smith PetscLogStagePush(stagenum1); 4657f296bb3SBarry Smith /* stage 1 of code here */ 4667f296bb3SBarry Smith PetscLogStagePop(); 4677f296bb3SBarry Smith PetscLogStageRegister("Stage 2 of Code", &stagenum2); 4687f296bb3SBarry Smith PetscLogStagePush(stagenum2); 4697f296bb3SBarry Smith /* stage 2 of code here */ 4707f296bb3SBarry Smith PetscLogStagePop(); 4717f296bb3SBarry Smith} 4727f296bb3SBarry SmithPetscFinalize(); 4737f296bb3SBarry Smith``` 4747f296bb3SBarry Smith 4757f296bb3SBarry SmithThe listings above 4767f296bb3SBarry Smithshow output generated by 4777f296bb3SBarry Smith`-log_view` for a program that employs several profiling stages. In 4787f296bb3SBarry Smithparticular, this program is subdivided into six stages: loading a matrix and right-hand-side vector from a binary file, 4797f296bb3SBarry Smithsetting up the preconditioner, and solving the linear system; this 4807f296bb3SBarry Smithsequence is then repeated for a second linear system. For simplicity, 4817f296bb3SBarry Smiththe second listing contains output only for 4827f296bb3SBarry Smithstages 5 and 6 (linear solve of the second system), which comprise the 4837f296bb3SBarry Smithpart of this computation of most interest to us in terms of performance 4847f296bb3SBarry Smithmonitoring. This code organization (solving a small linear system 4857f296bb3SBarry Smithfollowed by a larger system) enables generation of more accurate 4867f296bb3SBarry Smithprofiling statistics for the second system by overcoming the often 4877f296bb3SBarry Smithconsiderable overhead of paging, as discussed in 4887f296bb3SBarry Smith{any}`sec_profaccuracy`. 4897f296bb3SBarry Smith 4907f296bb3SBarry Smith(sec_deactivate)= 4917f296bb3SBarry Smith 4927f296bb3SBarry Smith## Restricting Event Logging 4937f296bb3SBarry Smith 4947f296bb3SBarry SmithBy default, all PETSc operations are logged. To enable or disable the 4957f296bb3SBarry SmithPETSc logging of individual events, one uses the commands 4967f296bb3SBarry Smith 4977f296bb3SBarry Smith``` 4987f296bb3SBarry SmithPetscLogEventActivate(PetscLogEvent event); 4997f296bb3SBarry SmithPetscLogEventDeactivate(PetscLogEvent event); 5007f296bb3SBarry Smith``` 5017f296bb3SBarry Smith 5027f296bb3SBarry SmithThe `event` may be either a predefined PETSc event (as listed in the 5037f296bb3SBarry Smithfile `$PETSC_DIR/include/petsclog.h`) or one obtained with 5047f296bb3SBarry Smith`PetscLogEventRegister()` (as described in 5057f296bb3SBarry Smith{any}`sec_profileuser`). 5067f296bb3SBarry Smith 5077f296bb3SBarry SmithPETSc also provides routines that deactivate (or activate) logging for 5087f296bb3SBarry Smithentire components of the library. Currently, the components that support 5097f296bb3SBarry Smithsuch logging (de)activation are `Mat` (matrices), `Vec` (vectors), 5107f296bb3SBarry Smith`KSP` (linear solvers, including `KSP` and `PC`), and `SNES` 5117f296bb3SBarry Smith(nonlinear solvers): 5127f296bb3SBarry Smith 5137f296bb3SBarry Smith``` 5147f296bb3SBarry SmithPetscLogEventDeactivateClass(MAT_CLASSID); 5157f296bb3SBarry SmithPetscLogEventDeactivateClass(KSP_CLASSID); /* includes PC and KSP */ 5167f296bb3SBarry SmithPetscLogEventDeactivateClass(VEC_CLASSID); 5177f296bb3SBarry SmithPetscLogEventDeactivateClass(SNES_CLASSID); 5187f296bb3SBarry Smith``` 5197f296bb3SBarry Smith 5207f296bb3SBarry Smithand 5217f296bb3SBarry Smith 5227f296bb3SBarry Smith``` 5237f296bb3SBarry SmithPetscLogEventActivateClass(MAT_CLASSID); 5247f296bb3SBarry SmithPetscLogEventActivateClass(KSP_CLASSID); /* includes PC and KSP */ 5257f296bb3SBarry SmithPetscLogEventActivateClass(VEC_CLASSID); 5267f296bb3SBarry SmithPetscLogEventActivateClass(SNES_CLASSID); 5277f296bb3SBarry Smith``` 5287f296bb3SBarry Smith 5297f296bb3SBarry Smith(sec_petscinfo)= 5307f296bb3SBarry Smith 5317f296bb3SBarry Smith## Interpreting `-info` Output: Informative Messages 5327f296bb3SBarry Smith 5337f296bb3SBarry SmithUsers can activate the printing of verbose information about algorithms, 5347f296bb3SBarry Smithdata structures, etc. to the screen by using the option `-info` or by 5357f296bb3SBarry Smithcalling `PetscInfoAllow(PETSC_TRUE)`. Such logging, which is used 5367f296bb3SBarry Smiththroughout the PETSc libraries, can aid the user in understanding 5377f296bb3SBarry Smithalgorithms and tuning program performance. For example, as discussed in 5387f296bb3SBarry Smith{any}`sec_matsparse`, `-info` activates the printing of 5397f296bb3SBarry Smithinformation about memory allocation during matrix assembly. 5407f296bb3SBarry Smith 5417f296bb3SBarry SmithOne can selectively turn off informative messages about any of the basic 5427f296bb3SBarry SmithPETSc objects (e.g., `Mat`, `SNES`) with the command 5437f296bb3SBarry Smith 5447f296bb3SBarry Smith``` 5457f296bb3SBarry SmithPetscInfoDeactivateClass(int object_classid) 5467f296bb3SBarry Smith``` 5477f296bb3SBarry Smith 5487f296bb3SBarry Smithwhere `object_classid` is one of `MAT_CLASSID`, `SNES_CLASSID`, 5497f296bb3SBarry Smithetc. Messages can be reactivated with the command 5507f296bb3SBarry Smith 5517f296bb3SBarry Smith``` 5527f296bb3SBarry SmithPetscInfoActivateClass(int object_classid) 5537f296bb3SBarry Smith``` 5547f296bb3SBarry Smith 5557f296bb3SBarry SmithSuch deactivation can be useful when one wishes to view information 5567f296bb3SBarry Smithabout higher-level PETSc libraries (e.g., `TS` and `SNES`) without 5577f296bb3SBarry Smithseeing all lower level data as well (e.g., `Mat`). 5587f296bb3SBarry Smith 5597f296bb3SBarry SmithOne can turn on or off logging for particular classes at runtime 5607f296bb3SBarry Smith 5617f296bb3SBarry Smith```console 5627f296bb3SBarry Smith-info [filename][:[~]<list,of,classnames>[:[~]self]] 5637f296bb3SBarry Smith``` 5647f296bb3SBarry Smith 5657f296bb3SBarry SmithThe `list,of,classnames` is a list, separated by commas with no spaces, of classes one wishes to view the information on. For 5667f296bb3SBarry Smithexample `vec,ksp`. Information on all other classes will not be displayed. The ~ indicates to not display the list of classes but rather to display all other classes. 5677f296bb3SBarry Smith 5687f296bb3SBarry Smith`self` indicates to display information on objects that are associated with `PETSC_COMM_SELF` while `~self` indicates to display information only for parallel objects. 5697f296bb3SBarry Smith 5707f296bb3SBarry SmithSee `PetscInfo()` for links to all the info operations that are available. 5717f296bb3SBarry Smith 5727f296bb3SBarry SmithApplication programmers can log their own messages, as well, by using the 5737f296bb3SBarry Smithroutine 5747f296bb3SBarry Smith 5757f296bb3SBarry Smith``` 5767f296bb3SBarry SmithPetscInfo(void* obj,char *message,...) 5777f296bb3SBarry Smith``` 5787f296bb3SBarry Smith 5797f296bb3SBarry Smithwhere `obj` is the PETSc object associated most closely with the 5807f296bb3SBarry Smithlogging statement, `message`. For example, in the line search Newton 5817f296bb3SBarry Smithmethods, we use a statement such as 5827f296bb3SBarry Smith 5837f296bb3SBarry Smith``` 5847f296bb3SBarry SmithPetscInfo(snes,"Cubic step, lambda %g\n",lambda); 5857f296bb3SBarry Smith``` 5867f296bb3SBarry Smith 5877f296bb3SBarry Smith## Time 5887f296bb3SBarry Smith 5897f296bb3SBarry SmithPETSc application programmers can access the wall clock time directly 5907f296bb3SBarry Smithwith the command 5917f296bb3SBarry Smith 5927f296bb3SBarry Smith``` 5937f296bb3SBarry SmithPetscLogDouble time; 5947f296bb3SBarry SmithPetscCall(PetscTime(&time)); 5957f296bb3SBarry Smith``` 5967f296bb3SBarry Smith 5977f296bb3SBarry Smithwhich returns the current time in seconds since the epoch, and is 5987f296bb3SBarry Smithcommonly implemented with `MPI_Wtime`. A floating point number is 5997f296bb3SBarry Smithreturned in order to express fractions of a second. In addition, as 6007f296bb3SBarry Smithdiscussed in {any}`sec_profileuser`, PETSc can automatically 6017f296bb3SBarry Smithprofile user-defined segments of code. 6027f296bb3SBarry Smith 6037f296bb3SBarry Smith## Saving Output to a File 6047f296bb3SBarry Smith 6057f296bb3SBarry SmithAll output from PETSc programs (including informative messages, 6067f296bb3SBarry Smithprofiling information, and convergence data) can be saved to a file by 6077f296bb3SBarry Smithusing the command line option `-history [filename]`. If no file name 6087f296bb3SBarry Smithis specified, the output is stored in the file 6097f296bb3SBarry Smith`${HOME}/.petschistory`. Note that this option only saves output 6107f296bb3SBarry Smithprinted with the `PetscPrintf()` and `PetscFPrintf()` commands, not 6117f296bb3SBarry Smiththe standard `printf()` and `fprintf()` statements. 6127f296bb3SBarry Smith 6137f296bb3SBarry Smith(sec_profaccuracy)= 6147f296bb3SBarry Smith 6157f296bb3SBarry Smith## Accurate Profiling and Paging Overheads 6167f296bb3SBarry Smith 6177f296bb3SBarry SmithOne factor that often plays a significant role in profiling a code is 6187f296bb3SBarry Smithpaging by the operating system. Generally, when running a program, only 6197f296bb3SBarry Smitha few pages required to start it are loaded into memory rather than the 6207f296bb3SBarry Smithentire executable. When the execution proceeds to code segments that are 6217f296bb3SBarry Smithnot in memory, a pagefault occurs, prompting the required pages to be 6227f296bb3SBarry Smithloaded from the disk (a very slow process). This activity distorts the 6237f296bb3SBarry Smithresults significantly. (The paging effects are noticeable in the log 6247f296bb3SBarry Smithfiles generated by `-log_mpe`, which is described in 6257f296bb3SBarry Smith{any}`sec_mpelogs`.) 6267f296bb3SBarry Smith 6277f296bb3SBarry SmithTo eliminate the effects of paging when profiling the performance of a 6287f296bb3SBarry Smithprogram, we have found an effective procedure is to run the *exact same 6297f296bb3SBarry Smithcode* on a small dummy problem before running it on the actual problem 6307f296bb3SBarry Smithof interest. We thus ensure that all code required by a solver is loaded 6317f296bb3SBarry Smithinto memory during solution of the small problem. When the code proceeds 6327f296bb3SBarry Smithto the actual (larger) problem of interest, all required pages have 6337f296bb3SBarry Smithalready been loaded into main memory, so that the performance numbers 6347f296bb3SBarry Smithare not distorted. 6357f296bb3SBarry Smith 6367f296bb3SBarry SmithWhen this procedure is used in conjunction with the user-defined stages 6377f296bb3SBarry Smithof profiling described in {any}`sec_profstages`, we can 6387f296bb3SBarry Smithfocus easily on the problem of interest. For example, we used this 6397f296bb3SBarry Smithtechnique in the program 6407f296bb3SBarry Smith<a href="PETSC_DOC_OUT_ROOT_PLACEHOLDER/src/ksp/ksp/tutorials/ex10.c.html">KSP Tutorial ex10</a> 6417f296bb3SBarry Smithto generate the timings within 6427f296bb3SBarry Smith{any}`listing <listing_exparprof>` and {any}`listing <listing_exparprof2>`. 6437f296bb3SBarry SmithIn this case, the profiled code 6447f296bb3SBarry Smithof interest (solving the linear system for the larger problem) occurs 6457f296bb3SBarry Smithwithin event stages 5 and 6. {any}`sec_parperformance` 6467f296bb3SBarry Smithprovides details about interpreting such profiling data. 6477f296bb3SBarry Smith 6487f296bb3SBarry SmithIn particular, the macros 6497f296bb3SBarry Smith 6507f296bb3SBarry Smith``` 6517f296bb3SBarry SmithPetscPreLoadBegin(PetscBool flag,char* stagename) 6527f296bb3SBarry SmithPetscPreLoadStage(char *stagename) 6537f296bb3SBarry Smith``` 6547f296bb3SBarry Smith 6557f296bb3SBarry Smithand 6567f296bb3SBarry Smith 6577f296bb3SBarry Smith``` 6587f296bb3SBarry SmithPetscPreLoadEnd() 6597f296bb3SBarry Smith``` 6607f296bb3SBarry Smith 6617f296bb3SBarry Smithcan be used to easily convert a regular PETSc program to one that uses 6627f296bb3SBarry Smithpreloading. The command line options `-preload` `true` and 6637f296bb3SBarry Smith`-preload` `false` may be used to turn on and off preloading at run 6647f296bb3SBarry Smithtime for PETSc programs that use these macros. 6657f296bb3SBarry Smith 6667f296bb3SBarry Smith## NVIDIA Nsight Systems profiling 6677f296bb3SBarry Smith 6687f296bb3SBarry SmithNsight Systems will generate profiling data with a CUDA executable 6697f296bb3SBarry Smithwith the command `nsys`. 6707f296bb3SBarry SmithFor example, in serial 6717f296bb3SBarry Smith 6727f296bb3SBarry Smith```bash 6737f296bb3SBarry Smithnsys profile -t nvtx,cuda -o file --stats=true --force-overwrite true ./a.out 6747f296bb3SBarry Smith``` 6757f296bb3SBarry Smith 6767f296bb3SBarry Smithwill generate a file `file.qdstrm` with performance data that is 6777f296bb3SBarry Smithannotated with PETSc events (methods) and Kokkos device kernel names. 6787f296bb3SBarry SmithThe Nsight Systems GUI, `nsys-ui`, can be used to navigate this file 6797f296bb3SBarry Smith(<https://developer.nvidia.com/nsight-systems>). The Nsight Systems GUI 6807f296bb3SBarry Smithlets you see a timeline of code performance information like kernels, 6817f296bb3SBarry Smithmemory mallocs and frees, CPU-GPU communication, and high-level data like time, sizes 6827f296bb3SBarry Smithof memory copies, and more, in a popup window when the mouse 6837f296bb3SBarry Smithhovers over the section. 6847f296bb3SBarry SmithTo view the data, start `nsys-ui` without any arguments and then `Import` the 6857f296bb3SBarry Smith`.qdstrm` file in the GUI. 6867f296bb3SBarry SmithA side effect of this viewing process is the generation of a file `file.nsys-rep`, which can be viewed directly 6877f296bb3SBarry Smithwith `nsys-ui` in the future. 6887f296bb3SBarry Smith 6897f296bb3SBarry SmithFor an MPI parallel job, only one process can call `nsys`, 6907f296bb3SBarry Smithsay have rank zero output `nsys` data and have all other 6917f296bb3SBarry Smithranks call the executable directly. For example with MPICH 6927f296bb3SBarry Smithor Open MPI - we can run a parallel job on 4 MPI tasks as: 6937f296bb3SBarry Smith 6947f296bb3SBarry Smith```console 6957f296bb3SBarry Smithmpiexec -n 1 nsys profile -t nvtx,cuda -o file_name --stats=true --force-overwrite true ./a.out : -n 3 ./a.out 6967f296bb3SBarry Smith``` 6977f296bb3SBarry Smith 6987f296bb3SBarry Smith(sec_using_tau)= 6997f296bb3SBarry Smith 7007f296bb3SBarry SmithNote: The Nsight GUI can open profiling reports from elsewhere. For 7017f296bb3SBarry Smithexample, a report from a compute node can be analyzed on your local 7027f296bb3SBarry Smithmachine, but care should be taken to use the exact same versions of 7037f296bb3SBarry SmithNsight Systems that generated the report. 7047f296bb3SBarry SmithTo check the version of Nsight on the compute node run `nsys-ui` and 7057f296bb3SBarry Smithnote the version number at the top of the window. 7067f296bb3SBarry Smith 70756a72328SZach Atkins## ROCProfiler profiling 70856a72328SZach Atkins 70956a72328SZach AtkinsFor AMD GPUs, log events registered to PETSc can be displayed as ranges in trace files generated by `rocprof` by running with the `-log_roctx` flag. 71056a72328SZach AtkinsSee the `rocprof` [documentation](https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest) for details of how to run the profiler. 71156a72328SZach AtkinsAt the least, you will need: 71256a72328SZach Atkins 71356a72328SZach Atkins```console 71456a72328SZach Atkinsmpiexec -n 1 rocprofv3 --marker-trace -o file_name -- ./path/to/application -log_roctx 71556a72328SZach Atkins``` 71656a72328SZach Atkins 71756a72328SZach Atkins 7187f296bb3SBarry Smith## Using TAU 7197f296bb3SBarry Smith 7207f296bb3SBarry SmithTAU profiles can be generated without the need for instrumentation through the 7217f296bb3SBarry Smithuse of the perfstubs package. PETSc by default is configured with `--with-tau-perfstubs`. 7227f296bb3SBarry SmithTo generate profiles with TAU, first setup TAU: 7237f296bb3SBarry Smith 7247f296bb3SBarry Smith```bash 7257f296bb3SBarry Smithwget http://tau.uoregon.edu/tau.tgz 7267f296bb3SBarry Smith./configure -cc=mpicc -c++=mpicxx -mpi -bfd=download -unwind=download && make install 7277f296bb3SBarry Smithexport PATH=<tau dir>/x86_64/bin:$PATH 7287f296bb3SBarry Smith``` 7297f296bb3SBarry Smith 7307f296bb3SBarry SmithFor more information on configuring TAU, see [http://tau.uoregon.edu](http://tau.uoregon.edu). 7317f296bb3SBarry SmithNext, run your program with TAU. For instance, to profile `ex56`, 7327f296bb3SBarry Smith 7337f296bb3SBarry Smith```bash 7347f296bb3SBarry Smithcd $PETSC_DIR/src/snes/tutorials 7357f296bb3SBarry Smithmake ex56 736*a8cf87e0SJunchao Zhangmpiexec -n 4 tau_exec -T mpi ./ex56 -log_perfstubs <args> 7377f296bb3SBarry Smith``` 7387f296bb3SBarry Smith 7397f296bb3SBarry SmithThis should produce four `profile.*` files with profile data that can be 7407f296bb3SBarry Smithviewed with `paraprof/pprof`: 7417f296bb3SBarry Smith 7427f296bb3SBarry Smith```shell 7437f296bb3SBarry SmithReading Profile files in profile.* 7447f296bb3SBarry Smith 7457f296bb3SBarry SmithNODE 0;CONTEXT 0;THREAD 0: 7467f296bb3SBarry Smith--------------------------------------------------------------------------------------- 7477f296bb3SBarry Smith%Time Exclusive Inclusive #Call #Subrs Inclusive Name 7487f296bb3SBarry Smith msec total msec usec/call 7497f296bb3SBarry Smith--------------------------------------------------------------------------------------- 7507f296bb3SBarry Smith100.0 26 1,838 1 41322 1838424 .TAU application 7517f296bb3SBarry Smith 73.2 1 1,345 2 168 672950 SNESSolve 7527f296bb3SBarry Smith 62.2 3 1,142 2 1282 571442 SNESJacobianEval 7537f296bb3SBarry Smith 62.0 1,136 1,138 2 76 569494 DMPlexJacobianFE 7547f296bb3SBarry Smith 60.1 0.046 1,105 1 32 1105001 Solve 1 7557f296bb3SBarry Smith 15.2 87 279 5 11102 55943 Mesh Setup 7567f296bb3SBarry Smith 13.2 0.315 241 1 32 241765 Solve 0 7577f296bb3SBarry Smith 7.8 80 144 38785 38785 4 MPI_Allreduce() 7587f296bb3SBarry Smith 7.0 69 128 6 43386 21491 DualSpaceSetUp 7597f296bb3SBarry Smith 6.2 1 114 4 54 28536 PCSetUp 7607f296bb3SBarry Smith 6.0 12 110 2 892 55407 PCSetUp_GAMG+ 7617f296bb3SBarry Smith 3.9 70 70 1 0 70888 MPI_Init_thread() 7627f296bb3SBarry Smith 3.7 68 68 41747 0 2 MPI Collective Sync 7637f296bb3SBarry Smith 3.6 8 66 4 3536 16548 SNESFunctionEval 7647f296bb3SBarry Smith 2.6 45 48 171 171 281 MPI_Bcast() 7657f296bb3SBarry Smith 1.9 34 34 7836 0 4 MPI_Barrier() 7667f296bb3SBarry Smith 1.8 0.567 33 2 68 16912 GAMG Coarsen 7677f296bb3SBarry Smith``` 7687f296bb3SBarry Smith 7697f296bb3SBarry Smith```{raw} html 7707f296bb3SBarry Smith<hr> 7717f296bb3SBarry Smith``` 7727f296bb3SBarry Smith 7737f296bb3SBarry Smith```{eval-rst} 7747f296bb3SBarry Smith.. bibliography:: /petsc.bib 7757f296bb3SBarry Smith :filter: docname in docnames 7767f296bb3SBarry Smith``` 777