xref: /libCEED/doc/sphinx/source/intro.md (revision b086c2df6e3de7820b46afe6b13ac7175f68375f)
1# Introduction
2
3Historically, conventional high-order finite element methods were rarely used for
4industrial problems because the Jacobian rapidly loses sparsity as the order is
5increased, leading to unaffordable solve times and memory requirements
6{cite}`brown2010`. This effect typically limited the order of accuracy to at most
7quadratic, especially because quadratic finite element formulations are computationally advantageous in terms of
8floating point operations (FLOPS) per degree of freedom (DOF)---see
9{numref}`fig-assembledVsmatrix-free`---, despite the fast convergence and favorable
10stability properties offered by higher order discretizations. Nowadays, high-order
11numerical methods, such as the spectral element method (SEM)---a special case of
12nodal p-Finite Element Method (FEM) which can reuse the interpolation nodes for
13quadrature---are employed, especially with (nearly) affine elements, because
14linear constant coefficient problems can be very efficiently solved using the
15fast diagonalization method combined with a multilevel coarse solve. In
16{numref}`fig-assembledVsmatrix-free` we analyze and compare the theoretical costs,
17of different configurations: assembling the sparse matrix representing the action
18of the operator (labeled as *assembled*), non assembling the matrix and storing
19only the metric terms needed as an operator setup-phase (labeled as *tensor-qstore*)
20and non assembling  the matrix and computing the metric terms on the fly and storing
21a compact representation of the linearization at quadrature points (labeled as
22*tensor*). In the right panel, we show the cost in terms of FLOPS/DOF. This metric for
23computational efficiency made sense historically, when the performance was mostly
24limited by processors' clockspeed. A more relevant performance plot for current
25state-of-the-art high-performance machines (for which the bottleneck of performance is
26mostly in the memory bandwith) is shown in the left panel of
27{numref}`fig-assembledVsmatrix-free`, where the memory bandwith is measured in terms of
28bytes/DOF. We can see that high-order methods, implemented properly with only partial
29assembly, require optimal amount of memory transfers (with respect to the
30polynomial order) and near-optimal FLOPs for operator evaluation. Thus, high-order
31methods in matrix-free representation not only possess favorable properties, such as
32higher accuracy and faster convergence to solution, but also manifest an efficiency gain
33compared to their corresponding assembled representations.
34
35(fig-assembledvsmatrix-free)=
36
37:::{figure} ../../img/TensorVsAssembly.png
38Comparison of memory transfer and floating point operations per
39degree of freedom for different representations of a linear operator for a PDE in
403D with $b$ components and variable coefficients arising due to Newton
41linearization of a material nonlinearity. The representation labeled as *tensor*
42computes metric terms on the fly and stores a compact representation of the
43linearization at quadrature points. The representation labeled as *tensor-qstore*
44pulls the metric terms into the stored representation. The *assembled* representation
45uses a (block) CSR format.
46:::
47
48Furthermore, software packages that provide high-performance implementations have often
49been special-purpose and intrusive. libCEED {cite}`libceed-joss-paper` is a new library that offers a purely
50algebraic interface for matrix-free operator representation and supports run-time
51selection of implementations tuned for a variety of computational device types,
52including CPUs and GPUs. libCEED's purely algebraic interface can unobtrusively be
53integrated in new and legacy software to provide performance portable interfaces.
54While libCEED's focus is on high-order finite elements, the approach is algebraic
55and thus applicable to other discretizations in factored form. libCEED's role, as
56a lightweight portable library that allows a wide variety of applications to share
57highly optimized discretization kernels, is illustrated in
58{numref}`fig-libCEED-backends`, where a non-exhaustive list of specialized
59implementations (backends) is provided. libCEED provides a low-level Application
60Programming Interface (API) for user codes so that applications with their own
61discretization infrastructure (e.g., those in [PETSc](https://www.mcs.anl.gov/petsc/),
62[MFEM](https://mfem.org/) and [Nek5000](https://nek5000.mcs.anl.gov/)) can evaluate
63and use the core operations provided by libCEED. GPU implementations are available via
64pure [CUDA](https://developer.nvidia.com/about-cuda) as well as the
65[OCCA](http://github.com/libocca/occa) and [MAGMA](https://bitbucket.org/icl/magma)
66libraries. CPU implementations are available via pure C and AVX intrinsics as well as
67the [LIBXSMM](http://github.com/hfp/libxsmm) library. libCEED provides a unified
68interface, so that users only need to write a single source code and can select the
69desired specialized implementation at run time. Moreover, each process or thread can
70instantiate an arbitrary number of backends.
71
72(fig-libceed-backends)=
73
74:::{figure} ../../img/libCEEDBackends.png
75The role of libCEED as a lightweight, portable library which provides a low-level
76API for efficient, specialized implementations. libCEED allows different applications
77to share highly optimized discretization kernels.
78:::
79