xref/libCEED/README.md

# libCEED: the CEED API Library

[![Build Status](https://travis-ci.org/CEED/libCEED.svg?branch=master)](https://travis-ci.org/CEED/libCEED)
[![Code Coverage](https://codecov.io/gh/CEED/libCEED/branch/master/graphs/badge.svg)](https://codecov.io/gh/CEED/libCEED/)
[![License](https://img.shields.io/badge/License-BSD%202--Clause-orange.svg)](https://opensource.org/licenses/BSD-2-Clause)
[![Documentation Status](https://readthedocs.org/projects/libceed/badge/?version=latest)](https://libceed.readthedocs.io/en/latest/?badge=latest)
[![Doxygen](https://codedocs.xyz/CEED/libCEED.svg)](https://codedocs.xyz/CEED/libCEED/)

## Code for Efficient Extensible Discretization

This repository contains an initial low-level API library for the efficient
high-order discretization methods developed by the ECP co-design [Center for
Efficient Exascale Discretizations (CEED)](http://ceed.exascaleproject.org).
While our focus is on high-order finite elements, the approach is mostly
algebraic and thus applicable to other discretizations in factored form, as
explained in the [User manual](https://libceed.readthedocs.io/en/latest/) and API implementation portion of the [documentation](https://libceed.readthedocs.io/en/latest/libCEEDapi.html).

One of the challenges with high-order methods is that a global sparse matrix is
no longer a good representation of a high-order linear operator, both with
respect to the FLOPs needed for its evaluation, as well as the memory transfer
needed for a matvec.  Thus, high-order methods require a new "format" that still
represents a linear (or more generally non-linear) operator, but not through a
sparse matrix.

The goal of libCEED is to propose such a format, as well as supporting
implementations and data structures, that enable efficient operator evaluation
on a variety of computational device types (CPUs, GPUs, etc.). This new operator
description is based on algebraically [factored form](https://libceed.readthedocs.io/en/latest/libCEEDapi.html),
which is easy to incorporate in a wide variety of applications, without significant
refactoring of their own discretization infrastructure.

The repository is part of the [CEED software suite][ceed-soft], a collection of
software benchmarks, miniapps, libraries and APIs for efficient exascale
discretizations based on high-order finite element and spectral element methods.
See http://github.com/ceed for more information and source code availability.

The CEED research is supported by the [Exascale Computing Project][ecp]
(17-SC-20-SC), a collaborative effort of two U.S. Department of Energy
organizations (Office of Science and the National Nuclear Security
Administration) responsible for the planning and preparation of a [capable
exascale ecosystem](https://exascaleproject.org/what-is-exascale), including
software, applications, hardware, advanced system engineering and early testbed
platforms, in support of the nation’s exascale computing imperative.

For more details on the CEED API see http://ceed.exascaleproject.org/ceed-code/.

For detailed instructions on how to build libCEED and run benchmarks and examples, please see the dedicated [Getting Started](https://libceed.readthedocs.io/en/latest/gettingstarted.html) page in the [User manual](https://libceed.readthedocs.io/en/latest/). A short summary is provided here.

## Building

The CEED library, `libceed`, is a C99 library with no required dependencies, and
with Fortran and Python interfaces.  It can be built using

    make

or, with optimization flags

    make OPT='-O3 -march=skylake-avx512 -ffp-contract=fast'

These optimization flags are used by all languages (C, C++, Fortran) and this
makefile variable can also be set for testing and examples (below).
Python users can install using

    pip install libceed

or in a clone of the repository via `pip install .`.
The library attempts to automatically detect support for the AVX
instruction set using gcc-style compiler options for the host.
Support may need to be manually specified via

    make AVX=1

or

    make AVX=0

if your compiler does not support gcc-style options, if you are cross
compiling, etc.

## Testing

The test suite produces [TAP](https://testanything.org) output and is run by:

    make test

or, using the `prove` tool distributed with Perl (recommended)

    make prove

## Backends

There are multiple supported backends, which can be selected at runtime in the examples:

| CEED resource            | Backend                                           |
| :----------------------- | :------------------------------------------------ |
| `/cpu/self/ref/serial`   | Serial reference implementation                   |
| `/cpu/self/ref/blocked`  | Blocked refrence implementation                   |
| `/cpu/self/memcheck`     | Memcheck backend, undefined value checks          |
| `/cpu/self/opt/serial`   | Serial optimized C implementation                 |
| `/cpu/self/opt/blocked`  | Blocked optimized C implementation                |
| `/cpu/self/avx/serial`   | Serial AVX implementation                         |
| `/cpu/self/avx/blocked`  | Blocked AVX implementation                        |
| `/cpu/self/xsmm/serial`  | Serial LIBXSMM implementation                     |
| `/cpu/self/xsmm/blocked` | Blocked LIBXSMM implementation                    |
| `/cpu/occa`              | Serial OCCA kernels                               |
| `/gpu/occa`              | CUDA OCCA kernels                                 |
| `/omp/occa`              | OpenMP OCCA kernels                               |
| `/ocl/occa`              | OpenCL OCCA kernels                               |
| `/gpu/cuda/ref`          | Reference pure CUDA kernels                       |
| `/gpu/cuda/reg`          | Pure CUDA kernels using one thread per element    |
| `/gpu/cuda/shared`       | Optimized pure CUDA kernels using shared memory   |
| `/gpu/cuda/gen`          | Optimized pure CUDA kernels using code generation |
| `/gpu/magma`             | CUDA MAGMA kernels                                |

The `/cpu/self/*/serial` backends process one element at a time and are intended for meshes
with a smaller number of high order elements. The `/cpu/self/*/blocked` backends process
blocked batches of eight interlaced elements and are intended for meshes with higher numbers
of elements.

The `/cpu/self/ref/*` backends are written in pure C and provide basic functionality.

The `/cpu/self/opt/*` backends are written in pure C and use partial e-vectors to improve performance.

The `/cpu/self/avx/*` backends rely upon AVX instructions to provide vectorized CPU performance.

The `/cpu/self/xsmm/*` backends rely upon the [LIBXSMM](http://github.com/hfp/libxsmm) package
to provide vectorized CPU performance. If linking MKL and LIBXSMM is desired but
the Makefile is not detecting `MKLROOT`, linking libCEED against MKL can be
forced by setting the environment variable `MKL=1`.

The `/cpu/self/memcheck/*` backends rely upon the [Valgrind](http://valgrind.org/) Memcheck tool
to help verify that user QFunctions have no undefined values. To use, run your code with
Valgrind and the Memcheck backends, e.g. `valgrind ./build/ex1 -ceed /cpu/self/ref/memcheck`. A
'development' or 'debugging' version of Valgrind with headers is required to use this backend.
This backend can be run in serial or blocked mode and defaults to running in the serial mode
if `/cpu/self/memcheck` is selected at runtime.

The `/*/occa` backends rely upon the [OCCA](http://github.com/libocca/occa) package to provide
cross platform performance.

The `/gpu/cuda/*` backends provide GPU performance strictly using CUDA.

The `/gpu/magma` backend relies upon the [MAGMA](https://bitbucket.org/icl/magma) package.