===================================================================

This document is a README file for coding parallel AMG for Pressure
Poisson Equation in PHASTA. If you want to start to modify or add
functionality or simply want to understand the code, please use this
document as a help.

This document is intended to serve as extra comments for the source, not
to be published. You made your changes, please also update this file.

Please have basic knowledge of sparse matrices, and read paper/thesis as
you continue. row-wise sparse storage is used everywhere in this
document.

Chun Sun, March, 2008

==================================================================

*** Files related to AMG in the ../common/ and ../incompressible/
directories:

** solfar.f
    Two things involving AMG are done here between #ifdef AMG
statements: 1) setup AMG by calling ramg_control. ( don't worry you'll
do setup everytime, ramg_control handles that. ) 2) Build a loop for
smartAMG, which is never used.

** lestools.c
    Provides C function: LesPrecPPE. LesLIB calls this function for
preconditioning; and this function calls fortran function ramg_interface
to do the actual preconditioning. This is the interface between our AMG
and ACUSIM solver.

** input_fforms.cc

    Handling I/O from input.config and solver.inp, nothing special.

**** in AMG directory ******

* ramg_data.f

  We have to start from this file. This is a module file containing
global control variables, matrices, and vectors. To have a global module
file and use it whenever necessary, saved us passing long parameters
every function call (though it's still tedious in some places). 

  First we have defined data type: pointer to arrays with different
dimentions. This is used, as shown in this file later, in submatrices
communication and build PPE multilevels. 
  
  Then we have our multilevel AMG stored in amg_A_* arrays as sparse
matrices. amg_para* arrays are for communication. amg_ilwork, amg_nlwork
are same as ilwork and nlwork, but in higher levels with smaller size.
amg_ppeDiag is important diagonal scaling matrices for different levels.
CF_map, CF_rev(erse)map are arrays recording coarsening information.

* ramg_coarse.f

The purpose of this file is: 1) CF-spliting, 2) setup interpolation
operator. 

CF-spliting is done in a seperate function ramg_CFsplit. Other than the
thesis, one thing to note is that we use heap sort for lambda array. The
good thing is we keep changing this array and everytime we want to get
the highest value out from the array. So heap sort is a natural choice. 

After CF-split, we do a communication: from ramg_update_cfmap to
ramg_readin_cfmap. This is to ensure that shared boundary nodes get same
coarsening on both processors. Note commout does not pass property for
integers(it only work with real*8 numbers), so we have commout_i to do
this for integer, check that commout_i.

CF_map is an array to set the order for all the nodes according to
"coarse first, fine second". This order will be used in Gauss-Seidel
smoothing, when we smooth nodes according to this order. aka,
"red-black" Gauss-Seidel, see paper/thesis for description.

amg_paramap is an array to record "grouping" information for coarsening
by group in parallel. For each node, there is a "group id" assigned to
it. For all interior node, the "group id" is the rank of the local
processor. For a boundary node, if it is a "master", the rank of it is
the neighbour rank; if it is a "slave", the rank of it is the negative
of the neighbour rank. (see ramg_initbcflag in ramg_paratools.f). This
information is computed at very beginning for the finest level. Then
after each coarsening, the information is carried on to higher level in
ramg_coarse.f, and will be used for communication at higher levels.
Also, CF-spliting is done within each group, if you pay attention to
every "if (afmap....)" in ramg_CFsplit you will see that.

Then we build temp variable amg_I for interpolation operator.
Interpolation operator is a sparse matrix, but it's hard to build it at
once. amg_I is used then. It's an array of allocatable pointers. So we
can bulid row by row. Eventually construct it to form I_cf_* which is
interpolation operator matrix. And transpose it to form I_fc_*. Pay
attention to transpose algorithm, it's a bit complicated but as usual,
standard for row-wise sparse storage. 

* ramg_control.f

In ramg_prep, please note that variable maxstopsign is the control
variable for coarsening. Coarsening is stopped if this value is true.
How do we decide it? When every processor has true value. Otherwise,
coarsening is going on. For those already got coarsened and have true
value, the result of coarsening will be trivial identity matrix as
interpolator. A tiny bit of waste compare to communicate criterias (see
thesis).

Also in ramg_control, we do lu decomposition for the coarsest level, if
direct solve for coarsest level is selected. Thus, each iteration will
only need to do a back substitution, which saves time greatly,
especially when you have a bit larger coarest level.

For "smartamg" prototype, we check consecutive 7 norms from CG (stored
in ramg_window/ramg_winct), if it does not go down, we start restart
AMG-cg. Before restart we just cheat ACUSIM by giving it zero norms for
the first sweep of solves (last iteration of CG and first iteration of
GMRES). This whole segment might be abandoned provided ACUSIM give us
better interface and norm control.

* ramg_driver.f

The main wrap, nothing fancy. ramg_interface is the only gate to the
actual ACUSIM solve. In the actual ramg_driver, we have left
possibilities (mode) to do a stand-alone Preconditioned CG by us, which
is never used.

Pay attention to diagonal scaling in V-cycle, at the finest level, the
PPE has already been scaled to diagonal unity. We do scaling only at
second level and above.

(Now when I was writing these: There seems to be two places that we can
improve efficiency: For second level and above, scaling can be
incorporated with interpolation operator, i.e. interpolation is modified
to directly give a scaled matrix. Another place is the allocation of
coarse level vector. We don't need to do this every iteration. This can
be done in extract and use the vector everytime.) 

* ramg_extract.f

sparse matrix-matrix production: You need to do these three steps
seperatedly: calculate nonzeros; form colm; form rowp and A. Yes there
are possibilities that you combine these three loops into one, but that
will dramatically decrease efficiency. Actually, a rule of thumb: break
the complicated loop into several less complicated loops will
increaseing efficiency (better optimized for vector operations), even if
the math hasn't changed. But here if you combine matrix-matrix product
into one loop, the math will change to a more complicated one. Oh,
another rule of thumb: use less "if" in a loop, that will increase
efficiency.

Now to parallel. lhsGP* is a sparse-communicated duplicate of lhsP
matrix.

* ramg_ggb.f

Must read the section of GGB in Chun Sun's thesis. ARPACK/PARPACK is
used through a user-specified AP-product. here in ggb_av.

generate_gmap creates a mapping and reverse mapping (gmap,grevmap) for a
vector mapping from GGB (without overlap boundary) to phasta format
(with overlap boundary). ggb_G2A and ggb_A2G are actual function calls
to do that.

In ggb_setup, there is a Allreduce with MPI_MAX, this is to search for a
maximum vector length for vector allocation. ARPACK does not dynamic
memory at different processor, so we allocate maximum nshg for every
proc.

There are 3 things you may want to modify for this GGB implementation. 
1) You may want to do lu-decomposition first in the setup, then in GGB
apply, do a simple back-substitution. (like we did in coarsest solver
if using direct). This saves time.
2) You may want to figure out how the tolerance are related to the
actual performance (tol = 1.0E-3)? 
3) Eventually you want your own eigen calculator, can't use ARPACK
forever.

* ramg_ITAI.f

This does A_(l+1)=Q^T*A_l*Q, creation of coarser matrices.

Note in the code we use "I" instead of "Q".

When you do sparse matrix-matrix product, don't attempt to do it in one
loop. Use three loops, calculate nnz, create colm, then create rowp and
mtx. This looks like you are using 3 loops over 1, but it saves you
time. and easier for compiler to vectorify.

As to diagonal scaling to unity, maybe you want to test out how to
disable that with modification to the smoother (polynomial smoother)

* ramg_mls.f

Read thesis, and appendix too, everything's straight-forward if you do
so. For the setup, similar structure as you do GGB. There are segments
of the code that do the so called "post-smoothing", but they are never
activated. Don't worry about that though.

I'm writing things in ramg_cheby.f here too, coz they are all polynomial
smoothers. ramg_chebyratio is an input coefficient to determine the
smallest eigenvalue you want to smooth using chebyshev polynomial, it is
set to 10 in input.config. You can change it to the ratio of coarse/fine
nodes, though I don't think there will be substantial difference.

* ramg_paratools.f

Very important tools for parallelization.

global_lhs is a function that reads in lhsP (in sparse format so there
will be three arrays), and output lhsGP (summation of boundary-boundary
interactions then put back to each processor, read thesis and slides).
There are tricky steps, first you need to extract the G_bb matrices,
then the summation will alter the sparse structure, and you will have to
alter the structure of lhsP too. This is done task by task.  There are
mappings, reverse mappings all used inside this routine. I suggest you
leave this subroutine as is. If you really want to do something, read
the source very carefully. One thing about this, is that we are not
considering triple boundaries, if a boundary node has 3 or more
representations (lives in 2 or more tasks), only one will be considered.
This will result the G_bb value for two such nodes slightly smaller,
which won't be a big issue for the overall performance. It is a direct
result from the task structure of phasta. I'm mentioning it here because
if you look really really carefully you will find this problem, but
don't worry about it. (slightly different value of G_bb only).

PPE_Ap routines perform matrix-vector product for PPE, just pay
attention to scaling and corresponding function calls in ramg_tools.f

init_ilwork initialize all the ilwork structure for higher level AMG
matrices. Everything is similar to plain CG, but smaller in size because
of coarsening.

init_bcflag initialize global array amg_paramap. Ignore the first bunch
of the code (if (numpe.lt.2) then) because that's for reduced serial,
which we no longer use.

ramg_commout, ramg_commin, same commout, commin for higher level AMG
systems.

Okay that's it, again, be very carefull in this file. 

* ramg_smoother.f

It's a wrapper for G-S, Jacobi, Cheby, and MLS. Nothing fancy here.

Only thing to note is in Gauss-Seidel, we follow red-black smoothing,
coarse first fine second (or other way around). In direct solve we can't
do parallel direct solve (or CG! Farzin said so because that destroy
preconditioner. i.e. preconditioner is not the same solve by solve,
while it should be the same and symmetric). Let's suggest a smoother to
make the direct solve. 

* ramg_tools.f

nothing fancy here. but there are  a lot of output tools you may want to
use ramg_dump* functions. Search for them to find examples, very
convenient when you go to parallel debugging, and do matlab
verification.

calcAv_g is the global Ap-product for all AMG-matrices, the kernel is
here.


Good luck.