phSolver/AMG/amgread.txt

*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen===================================================================
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenThis document is a README file for coding parallel AMG for Pressure
*59599516SKenneth E. JansenPoisson Equation in PHASTA. If you want to start to modify or add
*59599516SKenneth E. Jansenfunctionality or simply want to understand the code, please use this
*59599516SKenneth E. Jansendocument as a help.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenThis document is intended to serve as extra comments for the source, not
*59599516SKenneth E. Jansento be published. You made your changes, please also update this file.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenPlease have basic knowledge of sparse matrices, and read paper/thesis as
*59599516SKenneth E. Jansenyou continue. row-wise sparse storage is used everywhere in this
*59599516SKenneth E. Jansendocument.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenChun Sun, March, 2008
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen==================================================================
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen*** Files related to AMG in the ../common/ and ../incompressible/
*59599516SKenneth E. Jansendirectories:
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen** solfar.f
*59599516SKenneth E. Jansen    Two things involving AMG are done here between #ifdef AMG
*59599516SKenneth E. Jansenstatements: 1) setup AMG by calling ramg_control. ( don't worry you'll
*59599516SKenneth E. Jansendo setup everytime, ramg_control handles that. ) 2) Build a loop for
*59599516SKenneth E. JansensmartAMG, which is never used.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen** lestools.c
*59599516SKenneth E. Jansen    Provides C function: LesPrecPPE. LesLIB calls this function for
*59599516SKenneth E. Jansenpreconditioning; and this function calls fortran function ramg_interface
*59599516SKenneth E. Jansento do the actual preconditioning. This is the interface between our AMG
*59599516SKenneth E. Jansenand ACUSIM solver.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen** input_fforms.cc
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen    Handling I/O from input.config and solver.inp, nothing special.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen**** in AMG directory ******
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen* ramg_data.f
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen  We have to start from this file. This is a module file containing
*59599516SKenneth E. Jansenglobal control variables, matrices, and vectors. To have a global module
*59599516SKenneth E. Jansenfile and use it whenever necessary, saved us passing long parameters
*59599516SKenneth E. Jansenevery function call (though it's still tedious in some places).
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen  First we have defined data type: pointer to arrays with different
*59599516SKenneth E. Jansendimentions. This is used, as shown in this file later, in submatrices
*59599516SKenneth E. Jansencommunication and build PPE multilevels.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen  Then we have our multilevel AMG stored in amg_A_* arrays as sparse
*59599516SKenneth E. Jansenmatrices. amg_para* arrays are for communication. amg_ilwork, amg_nlwork
*59599516SKenneth E. Jansenare same as ilwork and nlwork, but in higher levels with smaller size.
*59599516SKenneth E. Jansenamg_ppeDiag is important diagonal scaling matrices for different levels.
*59599516SKenneth E. JansenCF_map, CF_rev(erse)map are arrays recording coarsening information.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen* ramg_coarse.f
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenThe purpose of this file is: 1) CF-spliting, 2) setup interpolation
*59599516SKenneth E. Jansenoperator.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenCF-spliting is done in a seperate function ramg_CFsplit. Other than the
*59599516SKenneth E. Jansenthesis, one thing to note is that we use heap sort for lambda array. The
*59599516SKenneth E. Jansengood thing is we keep changing this array and everytime we want to get
*59599516SKenneth E. Jansenthe highest value out from the array. So heap sort is a natural choice.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenAfter CF-split, we do a communication: from ramg_update_cfmap to
*59599516SKenneth E. Jansenramg_readin_cfmap. This is to ensure that shared boundary nodes get same
*59599516SKenneth E. Jansencoarsening on both processors. Note commout does not pass property for
*59599516SKenneth E. Jansenintegers(it only work with real*8 numbers), so we have commout_i to do
*59599516SKenneth E. Jansenthis for integer, check that commout_i.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenCF_map is an array to set the order for all the nodes according to
*59599516SKenneth E. Jansen"coarse first, fine second". This order will be used in Gauss-Seidel
*59599516SKenneth E. Jansensmoothing, when we smooth nodes according to this order. aka,
*59599516SKenneth E. Jansen"red-black" Gauss-Seidel, see paper/thesis for description.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansenamg_paramap is an array to record "grouping" information for coarsening
*59599516SKenneth E. Jansenby group in parallel. For each node, there is a "group id" assigned to
*59599516SKenneth E. Jansenit. For all interior node, the "group id" is the rank of the local
*59599516SKenneth E. Jansenprocessor. For a boundary node, if it is a "master", the rank of it is
*59599516SKenneth E. Jansenthe neighbour rank; if it is a "slave", the rank of it is the negative
*59599516SKenneth E. Jansenof the neighbour rank. (see ramg_initbcflag in ramg_paratools.f). This
*59599516SKenneth E. Janseninformation is computed at very beginning for the finest level. Then
*59599516SKenneth E. Jansenafter each coarsening, the information is carried on to higher level in
*59599516SKenneth E. Jansenramg_coarse.f, and will be used for communication at higher levels.
*59599516SKenneth E. JansenAlso, CF-spliting is done within each group, if you pay attention to
*59599516SKenneth E. Jansenevery "if (afmap....)" in ramg_CFsplit you will see that.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenThen we build temp variable amg_I for interpolation operator.
*59599516SKenneth E. JansenInterpolation operator is a sparse matrix, but it's hard to build it at
*59599516SKenneth E. Jansenonce. amg_I is used then. It's an array of allocatable pointers. So we
*59599516SKenneth E. Jansencan bulid row by row. Eventually construct it to form I_cf_* which is
*59599516SKenneth E. Janseninterpolation operator matrix. And transpose it to form I_fc_*. Pay
*59599516SKenneth E. Jansenattention to transpose algorithm, it's a bit complicated but as usual,
*59599516SKenneth E. Jansenstandard for row-wise sparse storage.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen* ramg_control.f
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenIn ramg_prep, please note that variable maxstopsign is the control
*59599516SKenneth E. Jansenvariable for coarsening. Coarsening is stopped if this value is true.
*59599516SKenneth E. JansenHow do we decide it? When every processor has true value. Otherwise,
*59599516SKenneth E. Jansencoarsening is going on. For those already got coarsened and have true
*59599516SKenneth E. Jansenvalue, the result of coarsening will be trivial identity matrix as
*59599516SKenneth E. Janseninterpolator. A tiny bit of waste compare to communicate criterias (see
*59599516SKenneth E. Jansenthesis).
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenAlso in ramg_control, we do lu decomposition for the coarsest level, if
*59599516SKenneth E. Jansendirect solve for coarsest level is selected. Thus, each iteration will
*59599516SKenneth E. Jansenonly need to do a back substitution, which saves time greatly,
*59599516SKenneth E. Jansenespecially when you have a bit larger coarest level.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenFor "smartamg" prototype, we check consecutive 7 norms from CG (stored
*59599516SKenneth E. Jansenin ramg_window/ramg_winct), if it does not go down, we start restart
*59599516SKenneth E. JansenAMG-cg. Before restart we just cheat ACUSIM by giving it zero norms for
*59599516SKenneth E. Jansenthe first sweep of solves (last iteration of CG and first iteration of
*59599516SKenneth E. JansenGMRES). This whole segment might be abandoned provided ACUSIM give us
*59599516SKenneth E. Jansenbetter interface and norm control.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen* ramg_driver.f
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenThe main wrap, nothing fancy. ramg_interface is the only gate to the
*59599516SKenneth E. Jansenactual ACUSIM solve. In the actual ramg_driver, we have left
*59599516SKenneth E. Jansenpossibilities (mode) to do a stand-alone Preconditioned CG by us, which
*59599516SKenneth E. Jansenis never used.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenPay attention to diagonal scaling in V-cycle, at the finest level, the
*59599516SKenneth E. JansenPPE has already been scaled to diagonal unity. We do scaling only at
*59599516SKenneth E. Jansensecond level and above.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen(Now when I was writing these: There seems to be two places that we can
*59599516SKenneth E. Jansenimprove efficiency: For second level and above, scaling can be
*59599516SKenneth E. Jansenincorporated with interpolation operator, i.e. interpolation is modified
*59599516SKenneth E. Jansento directly give a scaled matrix. Another place is the allocation of
*59599516SKenneth E. Jansencoarse level vector. We don't need to do this every iteration. This can
*59599516SKenneth E. Jansenbe done in extract and use the vector everytime.)
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen* ramg_extract.f
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansensparse matrix-matrix production: You need to do these three steps
*59599516SKenneth E. Jansenseperatedly: calculate nonzeros; form colm; form rowp and A. Yes there
*59599516SKenneth E. Jansenare possibilities that you combine these three loops into one, but that
*59599516SKenneth E. Jansenwill dramatically decrease efficiency. Actually, a rule of thumb: break
*59599516SKenneth E. Jansenthe complicated loop into several less complicated loops will
*59599516SKenneth E. Jansenincreaseing efficiency (better optimized for vector operations), even if
*59599516SKenneth E. Jansenthe math hasn't changed. But here if you combine matrix-matrix product
*59599516SKenneth E. Janseninto one loop, the math will change to a more complicated one. Oh,
*59599516SKenneth E. Jansenanother rule of thumb: use less "if" in a loop, that will increase
*59599516SKenneth E. Jansenefficiency.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenNow to parallel. lhsGP* is a sparse-communicated duplicate of lhsP
*59599516SKenneth E. Jansenmatrix.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen* ramg_ggb.f
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenMust read the section of GGB in Chun Sun's thesis. ARPACK/PARPACK is
*59599516SKenneth E. Jansenused through a user-specified AP-product. here in ggb_av.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansengenerate_gmap creates a mapping and reverse mapping (gmap,grevmap) for a
*59599516SKenneth E. Jansenvector mapping from GGB (without overlap boundary) to phasta format
*59599516SKenneth E. Jansen(with overlap boundary). ggb_G2A and ggb_A2G are actual function calls
*59599516SKenneth E. Jansento do that.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenIn ggb_setup, there is a Allreduce with MPI_MAX, this is to search for a
*59599516SKenneth E. Jansenmaximum vector length for vector allocation. ARPACK does not dynamic
*59599516SKenneth E. Jansenmemory at different processor, so we allocate maximum nshg for every
*59599516SKenneth E. Jansenproc.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenThere are 3 things you may want to modify for this GGB implementation.
*59599516SKenneth E. Jansen1) You may want to do lu-decomposition first in the setup, then in GGB
*59599516SKenneth E. Jansenapply, do a simple back-substitution. (like we did in coarsest solver
*59599516SKenneth E. Jansenif using direct). This saves time.
*59599516SKenneth E. Jansen2) You may want to figure out how the tolerance are related to the
*59599516SKenneth E. Jansenactual performance (tol = 1.0E-3)?
*59599516SKenneth E. Jansen3) Eventually you want your own eigen calculator, can't use ARPACK
*59599516SKenneth E. Jansenforever.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen* ramg_ITAI.f
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenThis does A_(l+1)=Q^T*A_l*Q, creation of coarser matrices.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenNote in the code we use "I" instead of "Q".
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenWhen you do sparse matrix-matrix product, don't attempt to do it in one
*59599516SKenneth E. Jansenloop. Use three loops, calculate nnz, create colm, then create rowp and
*59599516SKenneth E. Jansenmtx. This looks like you are using 3 loops over 1, but it saves you
*59599516SKenneth E. Jansentime. and easier for compiler to vectorify.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenAs to diagonal scaling to unity, maybe you want to test out how to
*59599516SKenneth E. Jansendisable that with modification to the smoother (polynomial smoother)
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen* ramg_mls.f
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenRead thesis, and appendix too, everything's straight-forward if you do
*59599516SKenneth E. Jansenso. For the setup, similar structure as you do GGB. There are segments
*59599516SKenneth E. Jansenof the code that do the so called "post-smoothing", but they are never
*59599516SKenneth E. Jansenactivated. Don't worry about that though.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenI'm writing things in ramg_cheby.f here too, coz they are all polynomial
*59599516SKenneth E. Jansensmoothers. ramg_chebyratio is an input coefficient to determine the
*59599516SKenneth E. Jansensmallest eigenvalue you want to smooth using chebyshev polynomial, it is
*59599516SKenneth E. Jansenset to 10 in input.config. You can change it to the ratio of coarse/fine
*59599516SKenneth E. Jansennodes, though I don't think there will be substantial difference.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen* ramg_paratools.f
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenVery important tools for parallelization.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansenglobal_lhs is a function that reads in lhsP (in sparse format so there
*59599516SKenneth E. Jansenwill be three arrays), and output lhsGP (summation of boundary-boundary
*59599516SKenneth E. Janseninteractions then put back to each processor, read thesis and slides).
*59599516SKenneth E. JansenThere are tricky steps, first you need to extract the G_bb matrices,
*59599516SKenneth E. Jansenthen the summation will alter the sparse structure, and you will have to
*59599516SKenneth E. Jansenalter the structure of lhsP too. This is done task by task.  There are
*59599516SKenneth E. Jansenmappings, reverse mappings all used inside this routine. I suggest you
*59599516SKenneth E. Jansenleave this subroutine as is. If you really want to do something, read
*59599516SKenneth E. Jansenthe source very carefully. One thing about this, is that we are not
*59599516SKenneth E. Jansenconsidering triple boundaries, if a boundary node has 3 or more
*59599516SKenneth E. Jansenrepresentations (lives in 2 or more tasks), only one will be considered.
*59599516SKenneth E. JansenThis will result the G_bb value for two such nodes slightly smaller,
*59599516SKenneth E. Jansenwhich won't be a big issue for the overall performance. It is a direct
*59599516SKenneth E. Jansenresult from the task structure of phasta. I'm mentioning it here because
*59599516SKenneth E. Jansenif you look really really carefully you will find this problem, but
*59599516SKenneth E. Jansendon't worry about it. (slightly different value of G_bb only).
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenPPE_Ap routines perform matrix-vector product for PPE, just pay
*59599516SKenneth E. Jansenattention to scaling and corresponding function calls in ramg_tools.f
*59599516SKenneth E. Jansen
*59599516SKenneth E. Janseninit_ilwork initialize all the ilwork structure for higher level AMG
*59599516SKenneth E. Jansenmatrices. Everything is similar to plain CG, but smaller in size because
*59599516SKenneth E. Jansenof coarsening.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Janseninit_bcflag initialize global array amg_paramap. Ignore the first bunch
*59599516SKenneth E. Jansenof the code (if (numpe.lt.2) then) because that's for reduced serial,
*59599516SKenneth E. Jansenwhich we no longer use.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansenramg_commout, ramg_commin, same commout, commin for higher level AMG
*59599516SKenneth E. Jansensystems.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenOkay that's it, again, be very carefull in this file.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen* ramg_smoother.f
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenIt's a wrapper for G-S, Jacobi, Cheby, and MLS. Nothing fancy here.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenOnly thing to note is in Gauss-Seidel, we follow red-black smoothing,
*59599516SKenneth E. Jansencoarse first fine second (or other way around). In direct solve we can't
*59599516SKenneth E. Jansendo parallel direct solve (or CG! Farzin said so because that destroy
*59599516SKenneth E. Jansenpreconditioner. i.e. preconditioner is not the same solve by solve,
*59599516SKenneth E. Jansenwhile it should be the same and symmetric). Let's suggest a smoother to
*59599516SKenneth E. Jansenmake the direct solve.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen* ramg_tools.f
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansennothing fancy here. but there are  a lot of output tools you may want to
*59599516SKenneth E. Jansenuse ramg_dump* functions. Search for them to find examples, very
*59599516SKenneth E. Jansenconvenient when you go to parallel debugging, and do matlab
*59599516SKenneth E. Jansenverification.
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansencalcAv_g is the global Ap-product for all AMG-matrices, the kernel is
*59599516SKenneth E. Jansenhere.
*59599516SKenneth E. Jansen
*59599516SKenneth E. Jansen
*59599516SKenneth E. JansenGood luck.