streams.md - OpenGrok cross reference for /petsc/doc/manual/streams.md

Lines Matching +full:linux +full:- +full:intel
1 ---
3 ---
10 bandwidth limited. The speed of a simulation depends more on the total achievable [^achievable-foot…
22 …andwidth achievable when running `n` independent threads or processes on non-overlapping memory re…
24 …he timing is done with `MPI_Wtime()`. A call to the timer takes less than 3e-08 seconds, significa…
27 independent non-overlapping memory STREAMS model still provides useful information.
30 …tained on a given system indicates the likely performance of memory bandwidth-limited computations.
32 … plots the total memory bandwidth achieved and the speedup for runs on an Intel system whose detai…
33 …cores are utilized. Also, note that the improvement may, unintuitively, be non-monotone when adding
43 There are three important concepts needed to understand memory bandwidth-limited computing.
45 - Thread or process **binding** to hardware subsets of the shared memory node. The Unix operating s…
46 …during a computation. This migration is managed by the operating system (OS). [^memorymigration-fo…
49 - Thread or process **mapping** (assignment) to hardware subsets when more threads or processes are…
51 …non-uniform memory access (**NUMA**), meaning the memory latency or bandwidth for any particular c…
56 - In addition to mapping, one must ensure that each thread or process **uses data on the closest me…
64 - MPI, options to `mpiexec`
66   - --bind-to hwthread | core | l1cache | l2cache | l3cache | socket | numa | board
67   - --map-by hwthread | core | socket | numa | board | node
68   - --report-bindings
69   - --cpu-list list of cores
70   - --cpu-set list of sets of cores
72 - OpenMP, environmental variables
74   - OMP_NUM_THREADS=n
75   - OMP_PROC_BIND=close | spread
76   - OMP_PLACES="list of sets of cores" for example \{0:2},\{2:2},\{32:2},\{34:2}
77   - OMP_DISPLAY_ENV=false | true
78   - OMP_DISPLAY_AFFINITY=false | true
80 …often the best bindings for large PETSc applications. The Linux commands `lscpu` and `numactl -H` …
91   CPU time reported by cpu_time()               6.1660000000000006E-002
92   Wall clock time reported by system_clock()    1.8335562000000000E-002
93   Wall clock time reported by omp_get_wtime()   1.8330062011955306E-002
99 $ OMP_NUM_THREADS=4 mpiexec -n 1  ./ex69f
100   CPU time reported by cpu_time()               7.2290999999999994E-002
101   Wall clock time reported by system_clock()    7.2356641999999999E-002
102   Wall clock time reported by omp_get_wtime()   7.2353694995399565E-002
108 $ OMP_NUM_THREADS=4 mpiexec --bind-to numa -n 1 --map-by core ./ex69f
109   CPU time reported by cpu_time()               7.0021000000000000E-002
110   Wall clock time reported by system_clock()    1.8489282999999999E-002
111   Wall clock time reported by omp_get_wtime()   1.8486462999135256E-002
115 Consider also the `mpiexec` option `--map-by socket:pe=$OMP_NUM_THREADS` to ensure each thread gets…
121 $ OMP_PROC_BIND=spread OMP_NUM_THREADS=4 mpiexec -n 1  ./ex69f
122   CPU time reported by cpu_time()               7.2841999999999990E-002
123   Wall clock time reported by system_clock()    7.2946015000000003E-002
124   Wall clock time reported by omp_get_wtime()   7.2942997998325154E-002
131 on an Intel system with the Intel ifort compiler and observed the recorded CPU for the second loop …
137 We now present a detailed study of a particular Intel Icelake system, the Intel(R) Xeon(R) Platinum…
140 It is running the Rocky Linux 8.8 (Green Obsidian) distribution. The compilers
141 used are GNU 12.2, Intel(R) oneAPI Compiler 2023.0.0 with both icc and icx, and NVIDIA nvhpc/23.1. …
143 - gcc -O3 -march=native
144 - icc -O3 -march=native
145 - icx -O3 -ffinite-math-only (the -xHost option, that replaces -march=native, crashed the compiler …
146 - nvc -O3 -march=native
149 …I and OpenMP with their default bindings and with the MPI binding of `--bind-to core --map-by numa`
156 Comprehensive STREAMS performance on Intel system
197 …st optimization level that produced the same results as gcc and icx: `-O1` without `-march=native`.
201 :alt: STREAMS benchmark icc -O1
204 STREAMS benchmark icc -O1
218 - For MPI, the default binding and mapping on this system produces results that are as good as prov…
219 - For OpenMP gcc, the default binding is better than using `spread`, because `spread` has a bug. Fo…
220 - We do not have any explanation why the improvement in speedup for gcc, icx, and nvc slows down be…
223 and -O3 optimization flags with a smaller N of 80,000,000. macOS contains no public API for setting…
240 …rc/ksp/ksp/tutorials/ex45.c.html">PETSc application</a> which solves a three-dimensional Poisson p…
243 the time to solve the linear system with the preconditioner, and the time for the matrix-vector pro…
244 `-da_refine 6 -pc_type gamg -log_view`. This study did not attempt to tune the default `PCGAMG` par…
263 …required from the communication pattern which results from the different three-dimensional parallel
280 We now run the same PETSc application using the MPI linear solver server mode, set using `-mpi_line…
284 …in the pure MPI version, the vectors are partitioned directly from the three-dimensional grid; the…
285 …-cubes, this minimizes the inter-process communication, especially in the matrix-vector product. I…
288 …tor mapping to a sub-cube of the domain. This would require, of course, a far more complicated Ope…
321 … (parallel efficiency) of a memory bandwidth limited application** on a shared memory Linux system.
323 For the Apple M2, we present the results using Unix shared-memory communication of the matrix and v…
340 [^achievable-footnote]: Achievable memory bandwidth is the actual bandwidth one can obtain
343 [^memorymigration-footnote]: Data can also be migrated among different memory sockets during a comp…
345 ```{eval-rst}