xref: /libCEED/README.md (revision ac4340cfabc1a0ad9914dc12a7b7127f12f0c697)
1# libCEED: the CEED API Library
2
3[![Build Status](https://travis-ci.org/CEED/libCEED.svg?branch=master)](https://travis-ci.org/CEED/libCEED)
4[![Code Coverage](https://codecov.io/gh/CEED/libCEED/branch/master/graphs/badge.svg)](https://codecov.io/gh/CEED/libCEED/)
5[![License](https://img.shields.io/badge/License-BSD%202--Clause-orange.svg)](https://opensource.org/licenses/BSD-2-Clause)
6[![Documentation Status](https://readthedocs.org/projects/libceed/badge/?version=latest)](https://libceed.readthedocs.io/en/latest/?badge=latest)
7[![Doxygen](https://codedocs.xyz/CEED/libCEED.svg)](https://codedocs.xyz/CEED/libCEED/)
8
9## Code for Efficient Extensible Discretization
10
11This repository contains an initial low-level API library for the efficient
12high-order discretization methods developed by the ECP co-design [Center for
13Efficient Exascale Discretizations (CEED)](http://ceed.exascaleproject.org).
14While our focus is on high-order finite elements, the approach is mostly
15algebraic and thus applicable to other discretizations in factored form, as
16explained in the [User manual](https://libceed.readthedocs.io/en/latest/) and API documentation portion of the [Doxygen documentation](https://codedocs.xyz/CEED/libCEED/md_doc_libCEEDapi.html).
17
18One of the challenges with high-order methods is that a global sparse matrix is
19no longer a good representation of a high-order linear operator, both with
20respect to the FLOPs needed for its evaluation, as well as the memory transfer
21needed for a matvec.  Thus, high-order methods require a new "format" that still
22represents a linear (or more generally non-linear) operator, but not through a
23sparse matrix.
24
25The goal of libCEED is to propose such a format, as well as supporting
26implementations and data structures, that enable efficient operator evaluation
27on a variety of computational device types (CPUs, GPUs, etc.). This new operator
28description is based on algebraically [factored form](https://libceed.readthedocs.io/en/latest/libCEEDapi.html),
29which is easy to incorporate in a wide variety of applications, without significant
30refactoring of their own discretization infrastructure.
31
32The repository is part of the [CEED software suite][ceed-soft], a collection of
33software benchmarks, miniapps, libraries and APIs for efficient exascale
34discretizations based on high-order finite element and spectral element methods.
35See http://github.com/ceed for more information and source code availability.
36
37The CEED research is supported by the [Exascale Computing Project][ecp]
38(17-SC-20-SC), a collaborative effort of two U.S. Department of Energy
39organizations (Office of Science and the National Nuclear Security
40Administration) responsible for the planning and preparation of a [capable
41exascale ecosystem](https://exascaleproject.org/what-is-exascale), including
42software, applications, hardware, advanced system engineering and early testbed
43platforms, in support of the nation’s exascale computing imperative.
44
45For more details on the CEED API see http://ceed.exascaleproject.org/ceed-code/.
46
47For detailed instructions on how to build libCEED and run benchmarks and examples, please see the dedicated [Getting Started](https://libceed.readthedocs.io/en/latest/gettingstarted.html) page in the [User manual](https://libceed.readthedocs.io/en/latest/). A short summary is provided here.
48
49## Building
50
51The CEED library, `libceed`, is a C99 library with no external dependencies. The library
52has Fortran and Python interfaces; see `interface/ceed-fortran.c` and
53`interface/ceed-python/`. It can be built using
54
55    make
56
57or, with optimization flags
58
59    make OPT='-O3 -march=skylake-avx512 -ffp-contract=fast'
60
61These optimization flags are used by all languages (C, C++, Fortran) and this
62makefile variable can also be set for testing and examples (below).
63
64The library attempts to automatically detect support for the AVX
65instruction set using gcc-style compiler options for the host.
66Support may need to be manually specified via
67
68    make AVX=1
69
70or
71
72    make AVX=0
73
74if your compiler does not support gcc-style options, if you are cross
75compiling, etc.
76
77## Testing
78
79The test suite produces [TAP](https://testanything.org) output and is run by:
80
81    make test
82
83or, using the `prove` tool distributed with Perl (recommended)
84
85    make prove
86
87## Backends
88
89There are multiple supported backends, which can be selected at runtime in the examples:
90
91| CEED resource            | Backend                                           |
92| :----------------------- | :------------------------------------------------ |
93| `/cpu/self/ref/serial`   | Serial reference implementation                   |
94| `/cpu/self/ref/blocked`  | Blocked refrence implementation                   |
95| `/cpu/self/memcheck`     | Memcheck backend, undefined value checks          |
96| `/cpu/self/opt/serial`   | Serial optimized C implementation                 |
97| `/cpu/self/opt/blocked`  | Blocked optimized C implementation                |
98| `/cpu/self/avx/serial`   | Serial AVX implementation                         |
99| `/cpu/self/avx/blocked`  | Blocked AVX implementation                        |
100| `/cpu/self/xsmm/serial`  | Serial LIBXSMM implementation                     |
101| `/cpu/self/xsmm/blocked` | Blocked LIBXSMM implementation                    |
102| `/cpu/occa`              | Serial OCCA kernels                               |
103| `/gpu/occa`              | CUDA OCCA kernels                                 |
104| `/omp/occa`              | OpenMP OCCA kernels                               |
105| `/ocl/occa`              | OpenCL OCCA kernels                               |
106| `/gpu/cuda/ref`          | Reference pure CUDA kernels                       |
107| `/gpu/cuda/reg`          | Pure CUDA kernels using one thread per element    |
108| `/gpu/cuda/shared`       | Optimized pure CUDA kernels using shared memory   |
109| `/gpu/cuda/gen`          | Optimized pure CUDA kernels using code generation |
110| `/gpu/magma`             | CUDA MAGMA kernels                                |
111
112The `/cpu/self/*/serial` backends process one element at a time and are intended for meshes
113with a smaller number of high order elements. The `/cpu/self/*/blocked` backends process
114blocked batches of eight interlaced elements and are intended for meshes with higher numbers
115of elements.
116
117The `/cpu/self/ref/*` backends are written in pure C and provide basic functionality.
118
119The `/cpu/self/opt/*` backends are written in pure C and use partial e-vectors to improve performance.
120
121The `/cpu/self/avx/*` backends rely upon AVX instructions to provide vectorized CPU performance.
122
123The `/cpu/self/xsmm/*` backends rely upon the [LIBXSMM](http://github.com/hfp/libxsmm) package
124to provide vectorized CPU performance. If linking MKL and LIBXSMM is desired but
125the Makefile is not detecting `MKLROOT`, linking libCEED against MKL can be
126forced by setting the environment variable `MKL=1`.
127
128The `/cpu/self/memcheck/*` backends rely upon the [Valgrind](http://valgrind.org/) Memcheck tool
129to help verify that user QFunctions have no undefined values. To use, run your code with
130Valgrind and the Memcheck backends, e.g. `valgrind ./build/ex1 -ceed /cpu/self/ref/memcheck`. A
131'development' or 'debugging' version of Valgrind with headers is required to use this backend.
132This backend can be run in serial or blocked mode and defaults to running in the serial mode
133if `/cpu/self/memcheck` is selected at runtime.
134
135The `/*/occa` backends rely upon the [OCCA](http://github.com/libocca/occa) package to provide
136cross platform performance.
137
138The `/gpu/cuda/*` backends provide GPU performance strictly using CUDA.
139
140The `/gpu/magma` backend relies upon the [MAGMA](https://bitbucket.org/icl/magma) package.
141