1--- 2title: 'libCEED: Fast algebra for high-order element-based discretizations' 3tags: 4 - high-performance computing 5 - high-order methods 6 - finite elements 7 - spectral elements 8 - matrix-free 9authors: 10 - name: Jed Brown 11 orcid: 0000-0002-9945-0639 12 affiliation: 1 13 - name: Ahmad Abdelfattah 14 orcid: 0000-0001-5054-4784 15 affiliation: 3 16 - name: Valeria Barra 17 orcid: 0000-0003-1129-2056 18 affiliation: 1 19 - name: Natalie Beams 20 orcid: 0000-0001-6060-4082 21 affiliation: 3 22 - name: Jean-Sylvain Camier 23 orcid: 0000-0003-2421-1999 24 affiliation: 2 25 - name: Veselin Dobrev 26 orcid: 0000-0003-1793-5622 27 affiliation: 2 28 - name: Yohann Dudouit 29 orcid: 0000-0001-5831-561X 30 affiliation: 2 31 - name: Leila Ghaffari 32 orcid: 0000-0002-0965-214X 33 affiliation: 1 34 - name: Tzanio Kolev 35 orcid: 0000-0002-2810-3090 36 affiliation: 2 37 - name: David Medina 38 affiliation: 4 39 - name: Will Pazner 40 orcid: 0000-0003-4885-2934 41 affiliation: 2 42 - name: Thilina Ratnayaka 43 orcid: 0000-0001-6102-6560 44 affiliation: 5 45 - name: Jeremy Thompson 46 orcid: 0000-0003-2980-0899 47 affiliation: 1 48 - name: Stan Tomov 49 orcid: 0000-0002-5937-7959 50 affiliation: 3 51affiliations: 52 - name: University of Colorado at Boulder 53 index: 1 54 - name: Lawrence Livermore National Laboratory 55 index: 2 56 - name: University of Tennessee 57 index: 3 58 - name: Occalytics LLC 59 index: 4 60 - name: University of Illinois at Urbana-Champaign 61 index: 5 62date: 9 July 2021 63bibliography: paper.bib 64--- 65 66# Summary and statement of need 67 68Finite element methods are widely used to solve partial differential equations (PDE) in science and engineering, but their standard implementation [@dealII92;@libMeshPaper;@LoggMardalWells2012] relies on assembling sparse matrices. 69Sparse matrix multiplication and triangular operations perform a scalar multiply and add for each nonzero entry, just 2 floating point operations (flops) per scalar that must be loaded from memory [@williams2009roofline]. 70Modern hardware is capable of nearly 100 flops per scalar streamed from memory [@kruppcomparison] so sparse matrix operations cannot achieve more than about 2% utilization of arithmetic units. 71Matrix assembly becomes even more problematic when the polynomial degree $p$ of the basis functions is increased, resulting in $O(p^d)$ storage and $O(p^{2d})$ compute per degree of freedom (DoF) in $d$ dimensions. 72Methods pioneered by the spectral element community [@Orszag:1980; @deville2002highorder] exploit problem structure to reduce costs to $O(1)$ storage and $O(p)$ compute per DoF, with very high utilization of modern CPUs and GPUs. 73Unfortunately, high-quality implementations have been relegated to applications and intrusive frameworks that are often difficult to extend to new problems or incorporate into legacy applications, especially when strong preconditioners are required. 74 75`libCEED`, the Code for Efficient Extensible Discretization [@libceed-user-manual], is a lightweight library that provides a purely algebraic interface for linear and nonlinear operators and preconditioners with element-based discretizations. 76`libCEED` provides portable performance via run-time selection of implementations optimized for CPUs and GPUs, including support for just-in-time (JIT) compilation. 77It is designed for convenient use in new and legacy software, and offers interfaces in C99 [@C99-lang], Fortran77 [@Fortran77-lang], Python [@Python-lang], Julia [@Julia-lang], and Rust [@Rust-lang]. 78Users and library developers can integrate `libCEED` at a low level into existing applications in place of existing matrix-vector products without significant refactoring of their own discretization infrastructure. 79Alternatively, users can utilize integrated `libCEED` support in MFEM [@MFEMlibrary; @mfem-paper]. 80 81In addition to supporting applications and discretization libraries, `libCEED` provides a platform for performance engineering and co-design, as well as an algebraic interface for solvers research like adaptive $p$-multigrid, much like how sparse matrix libraries enable development and deployment of algebraic multigrid solvers. 82 83# Concepts and interface 84 85Consider finite element discretization of a problem based on a weak form with one weak derivative: find $u$ such that 86 87$$ v^T F(u) := \int_\Omega v \cdot f_0(u, \nabla u) + \nabla v \!:\! f_1(u, \nabla u) = 0 \quad \forall v, $$ 88 89where the functions $f_0$ and $f_1$ define the physics and possible stabilization of the problem [@Brown:2010] and the functions $u$ and $v$ live in a suitable space. 90Integrals in the weak form are evaluated by summing over elements $e$, 91 92$$ F(u) = \sum_e \mathcal E_e^T B_e^T W_e f(B_e \mathcal E_e u), $$ 93 94where $\mathcal E_e$ restricts to element $e$, $B_e$ evaluates solution values and derivatives to quadrature points, $f$ acts independently at quadrature points, and $W_e$ is a (diagonal) weighting at quadrature points. 95By grouping the operations $W_e$ and $f$ into a point-block diagonal $D$ and stacking the restrictions $\mathcal E_e$ and basis actions $B_e$ for each element, we can express the global residual in operator notation (\autoref{fig:decomposition}), where $\mathcal P$ is an optional external operator, such as the parallel restriction in MPI-based [@gropp2014using] solvers. 96Inhomogeneous Neumann, Robin, and nonlinear boundary conditions can be added in a similar fashion by adding terms integrated over boundary faces while Dirichlet boundary conditions can be added by setting the target values prior to applying the operator representing the weak form. 97Similar face integral terms can also be used to represent discontinuous Galerkin formulations. 98 99 100 101`libCEED`'s native C interface is object-oriented, providing data types for each logical object in the decomposition. 102 103Symbol `libCEED` type Description 104------ ------------ ----------- 105$D$ `CeedQFunction` User-defined action at quadrature points 106$B$ `CeedBasis` Basis evaluation to quadrature (dense/structured) 107$\mathcal E$ `CeedElemRestriction` Restriction to each element (sparse/boolean) 108$A$ `CeedOperator` Linear or nonlinear operator acting on L-vectors 109 110`libCEED` implementations ("backends") are free to reorder and fuse computational steps (including eliding memory to store intermediate representations) so long as the mathematical properties of the operator $A$ are preserved. 111A `CeedOperator` is composed of one or more operators defined as in \autoref{fig:decomposition}, and acts on a `CeedVector`, which typically encapsulates zero-copy access to host or device memory provided by the caller. 112The element restriction $\mathcal E$ requires mesh topology and a numbering of DoFs, and may be a no-op when data is already composed by element (such as with discontinuous Galerkin methods). 113The discrete basis $B$ is the purely algebraic expression of a finite element basis (shape functions) and quadrature; it often possesses structure that is exploited to speed up its action. 114Some constructors are provided for arbitrary polynomial degree $H^1$ Lagrange bases with a tensor-product representation due to the computational efficiency of computing solution values and derivatives at quadrature points via tensor contractions. 115However, the user can define a `CeedBasis` for arbitrary element topology including tetrahedra, prisms, and other realizations of abstract polytopes, by providing quadrature weights and the matrices used to compute solution values and derivatives at quadrature points from the DoFs on the element. 116 117The physics (weak form) is expressed through `CeedQFunction`, which can either be defined by the user or selected from a gallery distributed with `libCEED`. 118These pointwise functions do not depend on element resolution, topology, or basis degree (see \autoref{fig:schematic}), in contrast to systems like FEniCS where UFL forms specify basis degree at compile time. 119This isolation is valuable for $hp$-refinement and adaptivity (where $h$ commonly denotes the average element size and $p$ the polynomial degree of the basis functions; see @babuska1994hpfem) and $p$-multigrid solvers; mixed-degree, mixed-topology, and $h$-nonconforming finite element methods are readily expressed by composition. 120Additionally, a single source implementation (in vanilla C or C++) for the `CeedQFunction`s can be used on CPUs or GPUs (transparently using the @NVRTCwebsite, HIPRTC, or OCCA [@OCCAwebsite] run-time compilation features). 121 122`libCEED` provides computation of the true operator diagonal for preconditioning with Jacobi and Chebyshev as well as direct assembly of sparse matrices (e.g., for coarse operators in multigrid) and construction of $p$-multigrid prolongation and restriction operators. 123Preconditioning matrix-free operators is an active area of research; support for domain decomposition methods and inexact subdomain solvers based on the fast diagonalization method [@lottes2005hms] are in active development. 124 125 128 129# High-level languages 130 131`libCEED` provides high-level interfaces in Python, Julia, and Rust, each of which is maintained and tested as part of the main repository, but distributed through each language's respective package manager. 132 133The Python interface uses CFFI, the C Foreign Function Interface [@python-cffi]. CFFI allows reuse of most C declarations and requires only a minimal adaptation of some of them. The C and Python APIs are mapped in a nearly 1:1 correspondence. For instance, a `CeedVector` object is exposed as `libceed.Vector` in Python, and supports no-copy host and GPU device interperability with Python arrays from the NumPy [@NumPy] or Numba [@Numba] packages. The interested reader can find more details on `libCEED`'s Python interface in @libceed-paper-proc-scipy-2020. 134 135The Julia interface, referred to as `LibCEED.jl`, provides both a low-level interface, which is generated automatically from `libCEED`'s C header files, and a high-level interface. The high-level interface takes advantage of Julia's metaprogramming and just-in-time compilation capabilities to enable concise definition of Q-functions that work on both CPUs and GPUs, along with their composition into operators as in \autoref{fig:decomposition}. 136 137The Rust interface also wraps automatically-generated bindings from the `libCEED` C header files, offering increased safety due to Rust ownership and borrow checking, and more convenient definition of Q-functions (e.g., via closures). 138 139# Backends 140 141\autoref{fig:libCEEDBackends} shows a subset of the backend implementations (backends) available in `libCEED`. 142GPU implementations are available via pure @CUDAwebsite and pure @HIPwebsite, as well as the OCCA [@OCCAwebsite] and MAGMA [@MAGMAwebsite] libraries. CPU implementations are available via pure C and AVX intrinsics as well as the LIBXSMM library [@LIBXSMM]. `libCEED` provides a dynamic interface such that users only need to write a single source (no need for templates/generics) and can select the desired specialized implementation at run time. Moreover, each process or thread can instantiate an arbitrary number of backends on an arbitrary number of devices. 143 144 146 147# Performance benchmarks 148 149The Exascale Computing Project (ECP) co-design Center for Efficient Exascale Discretization [@CEEDwebsite] has defined a suite of Benchmark Problems (BPs) to test and compare the performance of high-order finite element implementations [@Fischer2020scalability; @CEED-ECP-paper]. \autoref{fig:bp3} compares the performance of `libCEED` solving BP3 (CG iteration on a 3D Poisson problem) or CPU and GPU systems of similar (purchase/operating and energy) cost. These tests use PETSc [@PETScUserManual] for unstructured mesh management and parallel solvers with GPU-aware communication [@zhang2021petscsf]; a similar implementation with comparable performance is available through MFEM. 150 151 152 153# Demo applications and integration 154 155To highlight the ease of library reuse for solver composition and leverage `libCEED`'s full capability for real-world applications, `libCEED` comes with a suite of application examples, including problems of interest to the fluid dynamics and solid mechanics communities. 156The fluid dynamics example solves the 2D and 3D compressible Navier-Stokes equations using SU/SUPG stabilization and implicit, explicit, or IMEX time integration; \autoref{fig:NSvortices} shows vortices arising in the "density current" [@straka1993numerical] when a cold bubble of air reaches the ground. 157The solid mechanics example solves static linear elasticity and hyperelasticity with load continuation and Newton-Krylov solvers with $p$-multigrid preconditioners; \autoref{fig:Solids} shows a twisted Neo-Hookean beam. Both of these examples have been developed using PETSc, where `libCEED` provides the matrix-free operator and preconditioner ingredient evaluation and PETSc provides the unstructured mesh management and parallel solvers. 158 159 160 161 162 163`libCEED` also includes additional examples with PETSc, MFEM, and Nek5000 [@Nekwebsite]. 164 165If MFEM is built with `libCEED` support, existing MFEM users can pass `-d ceed-cuda:/gpu/cuda/gen` to use a `libCEED` CUDA backend, and similarly for other backends. 166The `libCEED` implementations, accessed in this way, currently provide MFEM users with the fastest operator action on CPUs and GPUs (CUDA and HIP/ROCm) without writing any `libCEED` Q-functions. 167 168# Acknowledgements 169 170This research is supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, in support of the nations exascale computing imperative. We thank Lawrence Livermore National Laboratory for access to the Lassen and Corona machines. 171 172# References 173