xref: /libCEED/doc/papers/joss/paper.md (revision 9e9158cb3c288939d04d8533311581e1441f379c)
1*b4cdc43dSValeria Barra---
2*b4cdc43dSValeria Barratitle: 'libCEED: Fast algebra for high-order element-based discretizations'
3*b4cdc43dSValeria Barratags:
4*b4cdc43dSValeria Barra  - high-performance computing
5*b4cdc43dSValeria Barra  - high-order methods
6*b4cdc43dSValeria Barra  - finite elements
7*b4cdc43dSValeria Barra  - spectral elements
8*b4cdc43dSValeria Barra  - matrix-free
9*b4cdc43dSValeria Barraauthors:
10*b4cdc43dSValeria Barra  - name: Jed Brown
11*b4cdc43dSValeria Barra    orcid: 0000-0002-9945-0639
12*b4cdc43dSValeria Barra    affiliation: 1
13*b4cdc43dSValeria Barra  - name: Ahmad Abdelfattah
14*b4cdc43dSValeria Barra    orcid: 0000-0001-5054-4784
15*b4cdc43dSValeria Barra    affiliation: 3
16*b4cdc43dSValeria Barra  - name: Valeria Barra
17*b4cdc43dSValeria Barra    orcid: 0000-0003-1129-2056
18*b4cdc43dSValeria Barra    affiliation: 1
19*b4cdc43dSValeria Barra  - name: Natalie Beams
20*b4cdc43dSValeria Barra    orcid: 0000-0001-6060-4082
21*b4cdc43dSValeria Barra    affiliation: 3
22*b4cdc43dSValeria Barra  - name: Jean-Sylvain Camier
23*b4cdc43dSValeria Barra    orcid: 0000-0003-2421-1999
24*b4cdc43dSValeria Barra    affiliation: 2
25*b4cdc43dSValeria Barra  - name: Veselin Dobrev
26*b4cdc43dSValeria Barra    orcid: 0000-0003-1793-5622
27*b4cdc43dSValeria Barra    affiliation: 2
28*b4cdc43dSValeria Barra  - name: Yohann Dudouit
29*b4cdc43dSValeria Barra    orcid: 0000-0001-5831-561X
30*b4cdc43dSValeria Barra    affiliation: 2
31*b4cdc43dSValeria Barra  - name:  Leila Ghaffari
32*b4cdc43dSValeria Barra    orcid: 0000-0002-0965-214X
33*b4cdc43dSValeria Barra    affiliation: 1
34*b4cdc43dSValeria Barra  - name: Tzanio Kolev
35*b4cdc43dSValeria Barra    orcid: 0000-0002-2810-3090
36*b4cdc43dSValeria Barra    affiliation: 2
37*b4cdc43dSValeria Barra  - name: David Medina
38*b4cdc43dSValeria Barra    affiliation: 4
39*b4cdc43dSValeria Barra  - name: Will Pazner
40*b4cdc43dSValeria Barra    orcid: 0000-0003-4885-2934
41*b4cdc43dSValeria Barra    affiliation: 2
42*b4cdc43dSValeria Barra  - name: Thilina Ratnayaka
43*b4cdc43dSValeria Barra    orcid: 0000-0001-6102-6560
44*b4cdc43dSValeria Barra    affiliation: 5
45*b4cdc43dSValeria Barra  - name: Jeremy Thompson
46*b4cdc43dSValeria Barra    orcid: 0000-0003-2980-0899
47*b4cdc43dSValeria Barra    affiliation: 1
48*b4cdc43dSValeria Barra  - name: Stan Tomov
49*b4cdc43dSValeria Barra    orcid: 0000-0002-5937-7959
50*b4cdc43dSValeria Barra    affiliation: 3
51*b4cdc43dSValeria Barraaffiliations:
52*b4cdc43dSValeria Barra - name: University of Colorado at Boulder
53*b4cdc43dSValeria Barra   index: 1
54*b4cdc43dSValeria Barra - name: Lawrence Livermore National Laboratory
55*b4cdc43dSValeria Barra   index: 2
56*b4cdc43dSValeria Barra - name: University of Tennessee
57*b4cdc43dSValeria Barra   index: 3
58*b4cdc43dSValeria Barra - name: Occalytics LLC
59*b4cdc43dSValeria Barra   index: 4
60*b4cdc43dSValeria Barra - name: University of Illinois at Urbana-Champaign
61*b4cdc43dSValeria Barra   index: 5
62*b4cdc43dSValeria Barradate: 9 July 2021
63*b4cdc43dSValeria Barrabibliography: paper.bib
64*b4cdc43dSValeria Barra---
65*b4cdc43dSValeria Barra
66*b4cdc43dSValeria Barra# Summary and statement of need
67*b4cdc43dSValeria Barra
68*b4cdc43dSValeria BarraFinite element methods are widely used to solve partial differential equations (PDE) in science and engineering, but their standard implementation [@dealII92;@libMeshPaper;@LoggMardalWells2012] relies on assembling sparse matrices.
69*b4cdc43dSValeria BarraSparse matrix multiplication and triangular operations perform a scalar multiply and add for each nonzero entry, just 2 floating point operations (flops) per scalar that must be loaded from memory [@williams2009roofline].
70*b4cdc43dSValeria BarraModern hardware is capable of nearly 100 flops per scalar streamed from memory [@kruppcomparison] so sparse matrix operations cannot achieve more than about 2% utilization of arithmetic units.
71*b4cdc43dSValeria BarraMatrix assembly becomes even more problematic when the polynomial degree $p$ of the basis functions is increased, resulting in $O(p^d)$ storage and $O(p^{2d})$ compute per degree of freedom (DoF) in $d$ dimensions.
72*b4cdc43dSValeria BarraMethods pioneered by the spectral element community [@Orszag:1980; @deville2002highorder] exploit problem structure to reduce costs to $O(1)$ storage and $O(p)$ compute per DoF, with very high utilization of modern CPUs and GPUs.
73*b4cdc43dSValeria BarraUnfortunately, high-quality implementations have been relegated to applications and intrusive frameworks that are often difficult to extend to new problems or incorporate into legacy applications, especially when strong preconditioners are required.
74*b4cdc43dSValeria Barra
75*b4cdc43dSValeria Barra`libCEED`, the Code for Efficient Extensible Discretization [@libceed-user-manual], is a lightweight library that provides a purely algebraic interface for linear and nonlinear operators and preconditioners with element-based discretizations.
76*b4cdc43dSValeria Barra`libCEED` provides portable performance via run-time selection of implementations optimized for CPUs and GPUs, including support for just-in-time (JIT) compilation.
77*b4cdc43dSValeria BarraIt is designed for convenient use in new and legacy software, and offers interfaces in C99 [@C99-lang], Fortran77 [@Fortran77-lang], Python [@Python-lang], Julia [@Julia-lang], and Rust [@Rust-lang].
78*b4cdc43dSValeria BarraUsers and library developers can integrate `libCEED` at a low level into existing applications in place of existing matrix-vector products without significant refactoring of their own discretization infrastructure.
79*b4cdc43dSValeria BarraAlternatively, users can utilize integrated `libCEED` support in MFEM [@MFEMlibrary; @mfem-paper].
80*b4cdc43dSValeria Barra
81*b4cdc43dSValeria BarraIn addition to supporting applications and discretization libraries, `libCEED` provides a platform for performance engineering and co-design, as well as an algebraic interface for solvers research like adaptive $p$-multigrid, much like how sparse matrix libraries enable development and deployment of algebraic multigrid solvers.
82*b4cdc43dSValeria Barra
83*b4cdc43dSValeria Barra# Concepts and interface
84*b4cdc43dSValeria Barra
85*b4cdc43dSValeria BarraConsider finite element discretization of a problem based on a weak form with one weak derivative: find $u$ such that
86*b4cdc43dSValeria Barra
87*b4cdc43dSValeria Barra$$ v^T F(u) := \int_\Omega v \cdot f_0(u, \nabla u) + \nabla v \!:\! f_1(u, \nabla u) = 0 \quad \forall v, $$
88*b4cdc43dSValeria Barra
89*b4cdc43dSValeria Barrawhere the functions $f_0$ and $f_1$ define the physics and possible stabilization of the problem [@Brown:2010] and the functions $u$ and $v$ live in a suitable space.
90*b4cdc43dSValeria BarraIntegrals in the weak form are evaluated by summing over elements $e$,
91*b4cdc43dSValeria Barra
92*b4cdc43dSValeria Barra$$ F(u) = \sum_e \mathcal E_e^T B_e^T W_e f(B_e \mathcal E_e u), $$
93*b4cdc43dSValeria Barra
94*b4cdc43dSValeria Barrawhere $\mathcal E_e$ restricts to element $e$, $B_e$ evaluates solution values and derivatives to quadrature points, $f$ acts independently at quadrature points, and $W_e$ is a (diagonal) weighting at quadrature points.
95*b4cdc43dSValeria BarraBy grouping the operations $W_e$ and $f$ into a point-block diagonal $D$ and stacking the restrictions $\mathcal E_e$ and basis actions $B_e$ for each element, we can express the global residual in operator notation (\autoref{fig:decomposition}), where $\mathcal P$ is an optional external operator, such as the parallel restriction in MPI-based [@gropp2014using] solvers.
96*b4cdc43dSValeria BarraInhomogeneous Neumann, Robin, and nonlinear boundary conditions can be added in a similar fashion by adding terms integrated over boundary faces while Dirichlet boundary conditions can be added by setting the target values prior to applying the operator representing the weak form.
97*b4cdc43dSValeria BarraSimilar face integral terms can also be used to represent discontinuous Galerkin formulations.
98*b4cdc43dSValeria Barra
99*b4cdc43dSValeria Barra![`libCEED` uses a logical decomposition to define element-based discretizations, with optimized implementations of the action and preconditioning ingredients. \label{fig:decomposition}](img/libCEED-2-trim.pdf)
100*b4cdc43dSValeria Barra
101*b4cdc43dSValeria Barra`libCEED`'s native C interface is object-oriented, providing data types for each logical object in the decomposition.
102*b4cdc43dSValeria Barra
103*b4cdc43dSValeria BarraSymbol        `libCEED` type             Description
104*b4cdc43dSValeria Barra------        ------------             -----------
105*b4cdc43dSValeria Barra$D$           `CeedQFunction`          User-defined action at quadrature points
106*b4cdc43dSValeria Barra$B$           `CeedBasis`              Basis evaluation to quadrature (dense/structured)
107*b4cdc43dSValeria Barra$\mathcal E$  `CeedElemRestriction`    Restriction to each element (sparse/boolean)
108*b4cdc43dSValeria Barra$A$           `CeedOperator`           Linear or nonlinear operator acting on L-vectors
109*b4cdc43dSValeria Barra
110*b4cdc43dSValeria Barra`libCEED` implementations ("backends") are free to reorder and fuse computational steps (including eliding memory to store intermediate representations) so long as the mathematical properties of the operator $A$ are preserved.
111*b4cdc43dSValeria BarraA `CeedOperator` is composed of one or more operators defined as in \autoref{fig:decomposition}, and acts on a `CeedVector`, which typically encapsulates zero-copy access to host or device memory provided by the caller.
112*b4cdc43dSValeria BarraThe element restriction $\mathcal E$ requires mesh topology and a numbering of DoFs, and may be a no-op when data is already composed by element (such as with discontinuous Galerkin methods).
113*b4cdc43dSValeria BarraThe discrete basis $B$ is the purely algebraic expression of a finite element basis (shape functions) and quadrature; it often possesses structure that is exploited to speed up its action.
114*b4cdc43dSValeria BarraSome constructors are provided for arbitrary polynomial degree $H^1$ Lagrange bases with a tensor-product representation due to the computational efficiency of computing solution values and derivatives at quadrature points via tensor contractions.
115*b4cdc43dSValeria BarraHowever, the user can define a `CeedBasis` for arbitrary element topology including tetrahedra, prisms, and other realizations of abstract polytopes, by providing quadrature weights and the matrices used to compute solution values and derivatives at quadrature points from the DoFs on the element.
116*b4cdc43dSValeria Barra
117*b4cdc43dSValeria BarraThe physics (weak form) is expressed through `CeedQFunction`, which can either be defined by the user or selected from a gallery distributed with `libCEED`.
118*b4cdc43dSValeria BarraThese pointwise functions do not depend on element resolution, topology, or basis degree (see \autoref{fig:schematic}), in contrast to systems like FEniCS where UFL forms specify basis degree at compile time.
119*b4cdc43dSValeria BarraThis isolation is valuable for $hp$-refinement and adaptivity (where $h$ commonly denotes the average element size and $p$ the polynomial degree of the basis functions; see @babuska1994hpfem) and $p$-multigrid solvers; mixed-degree, mixed-topology, and $h$-nonconforming finite element methods are readily expressed by composition.
120*b4cdc43dSValeria BarraAdditionally, a single source implementation (in vanilla C or C++) for the `CeedQFunction`s can be used on CPUs or GPUs (transparently using the @NVRTCwebsite, HIPRTC, or OCCA [@OCCAwebsite] run-time compilation features).
121*b4cdc43dSValeria Barra
122*b4cdc43dSValeria Barra`libCEED` provides computation of the true operator diagonal for preconditioning with Jacobi and Chebyshev as well as direct assembly of sparse matrices (e.g., for coarse operators in multigrid) and construction of $p$-multigrid prolongation and restriction operators.
123*b4cdc43dSValeria BarraPreconditioning matrix-free operators is an active area of research; support for domain decomposition methods and inexact subdomain solvers based on the fast diagonalization method [@lottes2005hms] are in active development.
124*b4cdc43dSValeria Barra
125*b4cdc43dSValeria Barra![A schematic of element restriction and basis applicator operators for
126*b4cdc43dSValeria Barraelements with different topology. This sketch shows the independence of Q-functions
127*b4cdc43dSValeria Barra(in this case representing a Laplacian) on element resolution, topology, and basis degree.\label{fig:schematic}](img/QFunctionSketch.pdf)
128*b4cdc43dSValeria Barra
129*b4cdc43dSValeria Barra# High-level languages
130*b4cdc43dSValeria Barra
131*b4cdc43dSValeria Barra`libCEED` provides high-level interfaces in Python, Julia, and Rust, each of which is maintained and tested as part of the main repository, but distributed through each language's respective package manager.
132*b4cdc43dSValeria Barra
133*b4cdc43dSValeria BarraThe Python interface uses CFFI, the C Foreign Function Interface [@python-cffi]. CFFI allows reuse of most C declarations and requires only a minimal adaptation of some of them. The C and Python APIs are mapped in a nearly 1:1 correspondence. For instance, a `CeedVector` object is exposed as `libceed.Vector` in Python, and supports no-copy host and GPU device interperability with Python arrays from the NumPy [@NumPy] or Numba [@Numba] packages. The interested reader can find more details on `libCEED`'s Python interface in @libceed-paper-proc-scipy-2020.
134*b4cdc43dSValeria Barra
135*b4cdc43dSValeria BarraThe Julia interface, referred to as `LibCEED.jl`, provides both a low-level interface, which is generated automatically from `libCEED`'s C header files, and a high-level interface. The high-level interface takes advantage of Julia's metaprogramming and just-in-time compilation capabilities to enable concise definition of Q-functions that work on both CPUs and GPUs, along with their composition into operators as in \autoref{fig:decomposition}.
136*b4cdc43dSValeria Barra
137*b4cdc43dSValeria BarraThe Rust interface also wraps automatically-generated bindings from the `libCEED` C header files, offering increased safety due to Rust ownership and borrow checking, and more convenient definition of Q-functions (e.g., via closures).
138*b4cdc43dSValeria Barra
139*b4cdc43dSValeria Barra# Backends
140*b4cdc43dSValeria Barra
141*b4cdc43dSValeria Barra\autoref{fig:libCEEDBackends} shows a subset of the backend implementations (backends) available in `libCEED`.
142*b4cdc43dSValeria BarraGPU implementations are available via pure @CUDAwebsite and pure @HIPwebsite, as well as the OCCA [@OCCAwebsite] and MAGMA [@MAGMAwebsite] libraries. CPU implementations are available via pure C and AVX intrinsics as well as the LIBXSMM library [@LIBXSMM]. `libCEED` provides a dynamic interface such that users only need to write a single source (no need for templates/generics) and can select the desired specialized implementation at run time. Moreover, each process or thread can instantiate an arbitrary number of backends on an arbitrary number of devices.
143*b4cdc43dSValeria Barra
144*b4cdc43dSValeria Barra![`libCEED` provides the algebraic core for element-based discretizations, with specialized implementations
145*b4cdc43dSValeria Barra(backends) for heterogeneous architectures.\label{fig:libCEEDBackends}](img/libCEEDBackends.png)
146*b4cdc43dSValeria Barra
147*b4cdc43dSValeria Barra# Performance benchmarks
148*b4cdc43dSValeria Barra
149*b4cdc43dSValeria BarraThe Exascale Computing Project (ECP) co-design Center for Efficient Exascale Discretization [@CEEDwebsite] has defined a suite of Benchmark Problems (BPs) to test and compare the performance of high-order finite element implementations [@Fischer2020scalability; @CEED-ECP-paper]. \autoref{fig:bp3} compares the performance of `libCEED` solving BP3 (CG iteration on a 3D Poisson problem) or CPU and GPU systems of similar (purchase/operating and energy) cost. These tests use PETSc [@PETScUserManual] for unstructured mesh management and parallel solvers with GPU-aware communication [@zhang2021petscsf]; a similar implementation with comparable performance is available through MFEM.
150*b4cdc43dSValeria Barra
151*b4cdc43dSValeria Barra![Performance for BP3 using the \texttt{xsmm/blocked} backend on a 2-socket AMD EPYC 7452 (32-core, 2.35GHz) and the \texttt{cuda/gen} backend on LLNL's Lassen system with NVIDIA V100 GPUs. Each curve represents fixing the basis degree $p$ and varying the number of elements. The CPU enables faster solution of smaller problem sizes (as in strong scaling) while the GPU is more efficient for applications that can afford to wait for larger sizes. Note that the CPU exhibits a performance drop when the working set becomes too large for L3 cache (128 MB/socket) while no such drop exists for the GPU. (This experiment was run with release candidates of PETSc 3.14 and libCEED 0.7 using gcc-10 on EPYC and clang-10/CUDA-10 on Lassen.) \label{fig:bp3}](img/bp3-2020.pdf)
152*b4cdc43dSValeria Barra
153*b4cdc43dSValeria Barra# Demo applications and integration
154*b4cdc43dSValeria Barra
155*b4cdc43dSValeria BarraTo highlight the ease of library reuse for solver composition and leverage `libCEED`'s full capability for real-world applications, `libCEED` comes with a suite of application examples, including problems of interest to the fluid dynamics and solid mechanics communities.
156*b4cdc43dSValeria BarraThe fluid dynamics example solves the 2D and 3D compressible Navier-Stokes equations using SU/SUPG stabilization and implicit, explicit, or IMEX time integration; \autoref{fig:NSvortices} shows vortices arising in the "density current" [@straka1993numerical] when a cold bubble of air reaches the ground.
157*b4cdc43dSValeria BarraThe solid mechanics example solves static linear elasticity and hyperelasticity with load continuation and Newton-Krylov solvers with $p$-multigrid preconditioners; \autoref{fig:Solids} shows a twisted Neo-Hookean beam. Both of these examples have been developed using PETSc, where `libCEED` provides the matrix-free operator and preconditioner ingredient evaluation and PETSc provides the unstructured mesh management and parallel solvers.
158*b4cdc43dSValeria Barra
159*b4cdc43dSValeria Barra![Vortices develop as a cold air bubble drops to the ground.\label{fig:NSvortices}](img/Vortices.png)
160*b4cdc43dSValeria Barra
161*b4cdc43dSValeria Barra![Strain energy density in a twisted Neo-Hookean beam.\label{fig:Solids}](img/SolidTwistExample.jpeg)
162*b4cdc43dSValeria Barra
163*b4cdc43dSValeria Barra`libCEED` also includes additional examples with PETSc, MFEM, and Nek5000 [@Nekwebsite].
164*b4cdc43dSValeria Barra
165*b4cdc43dSValeria BarraIf MFEM is built with `libCEED` support, existing MFEM users can pass `-d ceed-cuda:/gpu/cuda/gen` to use a `libCEED` CUDA backend, and similarly for other backends.
166*b4cdc43dSValeria BarraThe `libCEED` implementations, accessed in this way, currently provide MFEM users with the fastest operator action on CPUs and GPUs (CUDA and HIP/ROCm) without writing any `libCEED` Q-functions.
167*b4cdc43dSValeria Barra
168*b4cdc43dSValeria Barra# Acknowledgements
169*b4cdc43dSValeria Barra
170*b4cdc43dSValeria BarraThis research is supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, in support of the nations exascale computing imperative. We thank Lawrence Livermore National Laboratory for access to the Lassen and Corona machines.
171*b4cdc43dSValeria Barra
172*b4cdc43dSValeria Barra# References
173