1# Introduction 2 3Historically, conventional high-order finite element methods were rarely used for 4industrial problems because the Jacobian rapidly loses sparsity as the order is 5increased, leading to unaffordable solve times and memory requirements 6{cite}`brown2010`. This effect typically limited the order of accuracy to at most 7quadratic, especially because quadratic finite element formulations are computationally advantageous in terms of 8floating point operations (FLOPS) per degree of freedom (DOF)---see 9{numref}`fig-assembledVsmatrix-free`---, despite the fast convergence and favorable 10stability properties offered by higher order discretizations. Nowadays, high-order 11numerical methods, such as the spectral element method (SEM)---a special case of 12nodal p-Finite Element Method (FEM) which can reuse the interpolation nodes for 13quadrature---are employed, especially with (nearly) affine elements, because 14linear constant coefficient problems can be very efficiently solved using the 15fast diagonalization method combined with a multilevel coarse solve. In 16{numref}`fig-assembledVsmatrix-free` we analyze and compare the theoretical costs, 17of different configurations: assembling the sparse matrix representing the action 18of the operator (labeled as *assembled*), non assembling the matrix and storing 19only the metric terms needed as an operator setup-phase (labeled as *tensor-qstore*) 20and non assembling the matrix and computing the metric terms on the fly and storing 21a compact representation of the linearization at quadrature points (labeled as 22*tensor*). In the right panel, we show the cost in terms of FLOPS/DOF. This metric for 23computational efficiency made sense historically, when the performance was mostly 24limited by processors' clockspeed. A more relevant performance plot for current 25state-of-the-art high-performance machines (for which the bottleneck of performance is 26mostly in the memory bandwith) is shown in the left panel of 27{numref}`fig-assembledVsmatrix-free`, where the memory bandwith is measured in terms of 28bytes/DOF. We can see that high-order methods, implemented properly with only partial 29assembly, require optimal amount of memory transfers (with respect to the 30polynomial order) and near-optimal FLOPs for operator evaluation. Thus, high-order 31methods in matrix-free representation not only possess favorable properties, such as 32higher accuracy and faster convergence to solution, but also manifest an efficiency gain 33compared to their corresponding assembled representations. 34 35(fig-assembledvsmatrix-free)= 36 37:::{figure} ../../img/TensorVsAssembly.png 38Comparison of memory transfer and floating point operations per 39degree of freedom for different representations of a linear operator for a PDE in 403D with $b$ components and variable coefficients arising due to Newton 41linearization of a material nonlinearity. The representation labeled as *tensor* 42computes metric terms on the fly and stores a compact representation of the 43linearization at quadrature points. The representation labeled as *tensor-qstore* 44pulls the metric terms into the stored representation. The *assembled* representation 45uses a (block) CSR format. 46::: 47 48Furthermore, software packages that provide high-performance implementations have often 49been special-purpose and intrusive. libCEED {cite}`libceed-joss-paper` is a new library that offers a purely 50algebraic interface for matrix-free operator representation and supports run-time 51selection of implementations tuned for a variety of computational device types, 52including CPUs and GPUs. libCEED's purely algebraic interface can unobtrusively be 53integrated in new and legacy software to provide performance portable interfaces. 54While libCEED's focus is on high-order finite elements, the approach is algebraic 55and thus applicable to other discretizations in factored form. libCEED's role, as 56a lightweight portable library that allows a wide variety of applications to share 57highly optimized discretization kernels, is illustrated in 58{numref}`fig-libCEED-backends`, where a non-exhaustive list of specialized 59implementations (backends) is provided. libCEED provides a low-level Application 60Programming Interface (API) for user codes so that applications with their own 61discretization infrastructure (e.g., those in [PETSc](https://www.mcs.anl.gov/petsc/), 62[MFEM](https://mfem.org/) and [Nek5000](https://nek5000.mcs.anl.gov/)) can evaluate 63and use the core operations provided by libCEED. GPU implementations are available via 64pure [CUDA](https://developer.nvidia.com/about-cuda) and pure [HIP](https://rocmdocs.amd.com) as well as the 65[OCCA](http://github.com/libocca/occa) and [MAGMA](https://bitbucket.org/icl/magma) 66libraries. CPU implementations are available via pure C and AVX intrinsics as well as 67the [LIBXSMM](http://github.com/hfp/libxsmm) library. libCEED provides a unified 68interface, so that users only need to write a single source code and can select the 69desired specialized implementation at run time. Moreover, each process or thread can 70instantiate an arbitrary number of backends. 71 72(fig-libceed-backends)= 73 74:::{figure} ../../img/libCEEDBackends.png 75The role of libCEED as a lightweight, portable library which provides a low-level 76API for efficient, specialized implementations. libCEED allows different applications 77to share highly optimized discretization kernels. 78::: 79