1*59599516SKenneth E. Jansen 2*59599516SKenneth E. Jansen=================================================================== 3*59599516SKenneth E. Jansen 4*59599516SKenneth E. JansenThis document is a README file for coding parallel AMG for Pressure 5*59599516SKenneth E. JansenPoisson Equation in PHASTA. If you want to start to modify or add 6*59599516SKenneth E. Jansenfunctionality or simply want to understand the code, please use this 7*59599516SKenneth E. Jansendocument as a help. 8*59599516SKenneth E. Jansen 9*59599516SKenneth E. JansenThis document is intended to serve as extra comments for the source, not 10*59599516SKenneth E. Jansento be published. You made your changes, please also update this file. 11*59599516SKenneth E. Jansen 12*59599516SKenneth E. JansenPlease have basic knowledge of sparse matrices, and read paper/thesis as 13*59599516SKenneth E. Jansenyou continue. row-wise sparse storage is used everywhere in this 14*59599516SKenneth E. Jansendocument. 15*59599516SKenneth E. Jansen 16*59599516SKenneth E. JansenChun Sun, March, 2008 17*59599516SKenneth E. Jansen 18*59599516SKenneth E. Jansen================================================================== 19*59599516SKenneth E. Jansen 20*59599516SKenneth E. Jansen*** Files related to AMG in the ../common/ and ../incompressible/ 21*59599516SKenneth E. Jansendirectories: 22*59599516SKenneth E. Jansen 23*59599516SKenneth E. Jansen** solfar.f 24*59599516SKenneth E. Jansen Two things involving AMG are done here between #ifdef AMG 25*59599516SKenneth E. Jansenstatements: 1) setup AMG by calling ramg_control. ( don't worry you'll 26*59599516SKenneth E. Jansendo setup everytime, ramg_control handles that. ) 2) Build a loop for 27*59599516SKenneth E. JansensmartAMG, which is never used. 28*59599516SKenneth E. Jansen 29*59599516SKenneth E. Jansen** lestools.c 30*59599516SKenneth E. Jansen Provides C function: LesPrecPPE. LesLIB calls this function for 31*59599516SKenneth E. Jansenpreconditioning; and this function calls fortran function ramg_interface 32*59599516SKenneth E. Jansento do the actual preconditioning. This is the interface between our AMG 33*59599516SKenneth E. Jansenand ACUSIM solver. 34*59599516SKenneth E. Jansen 35*59599516SKenneth E. Jansen** input_fforms.cc 36*59599516SKenneth E. Jansen 37*59599516SKenneth E. Jansen Handling I/O from input.config and solver.inp, nothing special. 38*59599516SKenneth E. Jansen 39*59599516SKenneth E. Jansen**** in AMG directory ****** 40*59599516SKenneth E. Jansen 41*59599516SKenneth E. Jansen* ramg_data.f 42*59599516SKenneth E. Jansen 43*59599516SKenneth E. Jansen We have to start from this file. This is a module file containing 44*59599516SKenneth E. Jansenglobal control variables, matrices, and vectors. To have a global module 45*59599516SKenneth E. Jansenfile and use it whenever necessary, saved us passing long parameters 46*59599516SKenneth E. Jansenevery function call (though it's still tedious in some places). 47*59599516SKenneth E. Jansen 48*59599516SKenneth E. Jansen First we have defined data type: pointer to arrays with different 49*59599516SKenneth E. Jansendimentions. This is used, as shown in this file later, in submatrices 50*59599516SKenneth E. Jansencommunication and build PPE multilevels. 51*59599516SKenneth E. Jansen 52*59599516SKenneth E. Jansen Then we have our multilevel AMG stored in amg_A_* arrays as sparse 53*59599516SKenneth E. Jansenmatrices. amg_para* arrays are for communication. amg_ilwork, amg_nlwork 54*59599516SKenneth E. Jansenare same as ilwork and nlwork, but in higher levels with smaller size. 55*59599516SKenneth E. Jansenamg_ppeDiag is important diagonal scaling matrices for different levels. 56*59599516SKenneth E. JansenCF_map, CF_rev(erse)map are arrays recording coarsening information. 57*59599516SKenneth E. Jansen 58*59599516SKenneth E. Jansen* ramg_coarse.f 59*59599516SKenneth E. Jansen 60*59599516SKenneth E. JansenThe purpose of this file is: 1) CF-spliting, 2) setup interpolation 61*59599516SKenneth E. Jansenoperator. 62*59599516SKenneth E. Jansen 63*59599516SKenneth E. JansenCF-spliting is done in a seperate function ramg_CFsplit. Other than the 64*59599516SKenneth E. Jansenthesis, one thing to note is that we use heap sort for lambda array. The 65*59599516SKenneth E. Jansengood thing is we keep changing this array and everytime we want to get 66*59599516SKenneth E. Jansenthe highest value out from the array. So heap sort is a natural choice. 67*59599516SKenneth E. Jansen 68*59599516SKenneth E. JansenAfter CF-split, we do a communication: from ramg_update_cfmap to 69*59599516SKenneth E. Jansenramg_readin_cfmap. This is to ensure that shared boundary nodes get same 70*59599516SKenneth E. Jansencoarsening on both processors. Note commout does not pass property for 71*59599516SKenneth E. Jansenintegers(it only work with real*8 numbers), so we have commout_i to do 72*59599516SKenneth E. Jansenthis for integer, check that commout_i. 73*59599516SKenneth E. Jansen 74*59599516SKenneth E. JansenCF_map is an array to set the order for all the nodes according to 75*59599516SKenneth E. Jansen"coarse first, fine second". This order will be used in Gauss-Seidel 76*59599516SKenneth E. Jansensmoothing, when we smooth nodes according to this order. aka, 77*59599516SKenneth E. Jansen"red-black" Gauss-Seidel, see paper/thesis for description. 78*59599516SKenneth E. Jansen 79*59599516SKenneth E. Jansenamg_paramap is an array to record "grouping" information for coarsening 80*59599516SKenneth E. Jansenby group in parallel. For each node, there is a "group id" assigned to 81*59599516SKenneth E. Jansenit. For all interior node, the "group id" is the rank of the local 82*59599516SKenneth E. Jansenprocessor. For a boundary node, if it is a "master", the rank of it is 83*59599516SKenneth E. Jansenthe neighbour rank; if it is a "slave", the rank of it is the negative 84*59599516SKenneth E. Jansenof the neighbour rank. (see ramg_initbcflag in ramg_paratools.f). This 85*59599516SKenneth E. Janseninformation is computed at very beginning for the finest level. Then 86*59599516SKenneth E. Jansenafter each coarsening, the information is carried on to higher level in 87*59599516SKenneth E. Jansenramg_coarse.f, and will be used for communication at higher levels. 88*59599516SKenneth E. JansenAlso, CF-spliting is done within each group, if you pay attention to 89*59599516SKenneth E. Jansenevery "if (afmap....)" in ramg_CFsplit you will see that. 90*59599516SKenneth E. Jansen 91*59599516SKenneth E. JansenThen we build temp variable amg_I for interpolation operator. 92*59599516SKenneth E. JansenInterpolation operator is a sparse matrix, but it's hard to build it at 93*59599516SKenneth E. Jansenonce. amg_I is used then. It's an array of allocatable pointers. So we 94*59599516SKenneth E. Jansencan bulid row by row. Eventually construct it to form I_cf_* which is 95*59599516SKenneth E. Janseninterpolation operator matrix. And transpose it to form I_fc_*. Pay 96*59599516SKenneth E. Jansenattention to transpose algorithm, it's a bit complicated but as usual, 97*59599516SKenneth E. Jansenstandard for row-wise sparse storage. 98*59599516SKenneth E. Jansen 99*59599516SKenneth E. Jansen* ramg_control.f 100*59599516SKenneth E. Jansen 101*59599516SKenneth E. JansenIn ramg_prep, please note that variable maxstopsign is the control 102*59599516SKenneth E. Jansenvariable for coarsening. Coarsening is stopped if this value is true. 103*59599516SKenneth E. JansenHow do we decide it? When every processor has true value. Otherwise, 104*59599516SKenneth E. Jansencoarsening is going on. For those already got coarsened and have true 105*59599516SKenneth E. Jansenvalue, the result of coarsening will be trivial identity matrix as 106*59599516SKenneth E. Janseninterpolator. A tiny bit of waste compare to communicate criterias (see 107*59599516SKenneth E. Jansenthesis). 108*59599516SKenneth E. Jansen 109*59599516SKenneth E. JansenAlso in ramg_control, we do lu decomposition for the coarsest level, if 110*59599516SKenneth E. Jansendirect solve for coarsest level is selected. Thus, each iteration will 111*59599516SKenneth E. Jansenonly need to do a back substitution, which saves time greatly, 112*59599516SKenneth E. Jansenespecially when you have a bit larger coarest level. 113*59599516SKenneth E. Jansen 114*59599516SKenneth E. JansenFor "smartamg" prototype, we check consecutive 7 norms from CG (stored 115*59599516SKenneth E. Jansenin ramg_window/ramg_winct), if it does not go down, we start restart 116*59599516SKenneth E. JansenAMG-cg. Before restart we just cheat ACUSIM by giving it zero norms for 117*59599516SKenneth E. Jansenthe first sweep of solves (last iteration of CG and first iteration of 118*59599516SKenneth E. JansenGMRES). This whole segment might be abandoned provided ACUSIM give us 119*59599516SKenneth E. Jansenbetter interface and norm control. 120*59599516SKenneth E. Jansen 121*59599516SKenneth E. Jansen* ramg_driver.f 122*59599516SKenneth E. Jansen 123*59599516SKenneth E. JansenThe main wrap, nothing fancy. ramg_interface is the only gate to the 124*59599516SKenneth E. Jansenactual ACUSIM solve. In the actual ramg_driver, we have left 125*59599516SKenneth E. Jansenpossibilities (mode) to do a stand-alone Preconditioned CG by us, which 126*59599516SKenneth E. Jansenis never used. 127*59599516SKenneth E. Jansen 128*59599516SKenneth E. JansenPay attention to diagonal scaling in V-cycle, at the finest level, the 129*59599516SKenneth E. JansenPPE has already been scaled to diagonal unity. We do scaling only at 130*59599516SKenneth E. Jansensecond level and above. 131*59599516SKenneth E. Jansen 132*59599516SKenneth E. Jansen(Now when I was writing these: There seems to be two places that we can 133*59599516SKenneth E. Jansenimprove efficiency: For second level and above, scaling can be 134*59599516SKenneth E. Jansenincorporated with interpolation operator, i.e. interpolation is modified 135*59599516SKenneth E. Jansento directly give a scaled matrix. Another place is the allocation of 136*59599516SKenneth E. Jansencoarse level vector. We don't need to do this every iteration. This can 137*59599516SKenneth E. Jansenbe done in extract and use the vector everytime.) 138*59599516SKenneth E. Jansen 139*59599516SKenneth E. Jansen* ramg_extract.f 140*59599516SKenneth E. Jansen 141*59599516SKenneth E. Jansensparse matrix-matrix production: You need to do these three steps 142*59599516SKenneth E. Jansenseperatedly: calculate nonzeros; form colm; form rowp and A. Yes there 143*59599516SKenneth E. Jansenare possibilities that you combine these three loops into one, but that 144*59599516SKenneth E. Jansenwill dramatically decrease efficiency. Actually, a rule of thumb: break 145*59599516SKenneth E. Jansenthe complicated loop into several less complicated loops will 146*59599516SKenneth E. Jansenincreaseing efficiency (better optimized for vector operations), even if 147*59599516SKenneth E. Jansenthe math hasn't changed. But here if you combine matrix-matrix product 148*59599516SKenneth E. Janseninto one loop, the math will change to a more complicated one. Oh, 149*59599516SKenneth E. Jansenanother rule of thumb: use less "if" in a loop, that will increase 150*59599516SKenneth E. Jansenefficiency. 151*59599516SKenneth E. Jansen 152*59599516SKenneth E. JansenNow to parallel. lhsGP* is a sparse-communicated duplicate of lhsP 153*59599516SKenneth E. Jansenmatrix. 154*59599516SKenneth E. Jansen 155*59599516SKenneth E. Jansen* ramg_ggb.f 156*59599516SKenneth E. Jansen 157*59599516SKenneth E. JansenMust read the section of GGB in Chun Sun's thesis. ARPACK/PARPACK is 158*59599516SKenneth E. Jansenused through a user-specified AP-product. here in ggb_av. 159*59599516SKenneth E. Jansen 160*59599516SKenneth E. Jansengenerate_gmap creates a mapping and reverse mapping (gmap,grevmap) for a 161*59599516SKenneth E. Jansenvector mapping from GGB (without overlap boundary) to phasta format 162*59599516SKenneth E. Jansen(with overlap boundary). ggb_G2A and ggb_A2G are actual function calls 163*59599516SKenneth E. Jansento do that. 164*59599516SKenneth E. Jansen 165*59599516SKenneth E. JansenIn ggb_setup, there is a Allreduce with MPI_MAX, this is to search for a 166*59599516SKenneth E. Jansenmaximum vector length for vector allocation. ARPACK does not dynamic 167*59599516SKenneth E. Jansenmemory at different processor, so we allocate maximum nshg for every 168*59599516SKenneth E. Jansenproc. 169*59599516SKenneth E. Jansen 170*59599516SKenneth E. JansenThere are 3 things you may want to modify for this GGB implementation. 171*59599516SKenneth E. Jansen1) You may want to do lu-decomposition first in the setup, then in GGB 172*59599516SKenneth E. Jansenapply, do a simple back-substitution. (like we did in coarsest solver 173*59599516SKenneth E. Jansenif using direct). This saves time. 174*59599516SKenneth E. Jansen2) You may want to figure out how the tolerance are related to the 175*59599516SKenneth E. Jansenactual performance (tol = 1.0E-3)? 176*59599516SKenneth E. Jansen3) Eventually you want your own eigen calculator, can't use ARPACK 177*59599516SKenneth E. Jansenforever. 178*59599516SKenneth E. Jansen 179*59599516SKenneth E. Jansen* ramg_ITAI.f 180*59599516SKenneth E. Jansen 181*59599516SKenneth E. JansenThis does A_(l+1)=Q^T*A_l*Q, creation of coarser matrices. 182*59599516SKenneth E. Jansen 183*59599516SKenneth E. JansenNote in the code we use "I" instead of "Q". 184*59599516SKenneth E. Jansen 185*59599516SKenneth E. JansenWhen you do sparse matrix-matrix product, don't attempt to do it in one 186*59599516SKenneth E. Jansenloop. Use three loops, calculate nnz, create colm, then create rowp and 187*59599516SKenneth E. Jansenmtx. This looks like you are using 3 loops over 1, but it saves you 188*59599516SKenneth E. Jansentime. and easier for compiler to vectorify. 189*59599516SKenneth E. Jansen 190*59599516SKenneth E. JansenAs to diagonal scaling to unity, maybe you want to test out how to 191*59599516SKenneth E. Jansendisable that with modification to the smoother (polynomial smoother) 192*59599516SKenneth E. Jansen 193*59599516SKenneth E. Jansen* ramg_mls.f 194*59599516SKenneth E. Jansen 195*59599516SKenneth E. JansenRead thesis, and appendix too, everything's straight-forward if you do 196*59599516SKenneth E. Jansenso. For the setup, similar structure as you do GGB. There are segments 197*59599516SKenneth E. Jansenof the code that do the so called "post-smoothing", but they are never 198*59599516SKenneth E. Jansenactivated. Don't worry about that though. 199*59599516SKenneth E. Jansen 200*59599516SKenneth E. JansenI'm writing things in ramg_cheby.f here too, coz they are all polynomial 201*59599516SKenneth E. Jansensmoothers. ramg_chebyratio is an input coefficient to determine the 202*59599516SKenneth E. Jansensmallest eigenvalue you want to smooth using chebyshev polynomial, it is 203*59599516SKenneth E. Jansenset to 10 in input.config. You can change it to the ratio of coarse/fine 204*59599516SKenneth E. Jansennodes, though I don't think there will be substantial difference. 205*59599516SKenneth E. Jansen 206*59599516SKenneth E. Jansen* ramg_paratools.f 207*59599516SKenneth E. Jansen 208*59599516SKenneth E. JansenVery important tools for parallelization. 209*59599516SKenneth E. Jansen 210*59599516SKenneth E. Jansenglobal_lhs is a function that reads in lhsP (in sparse format so there 211*59599516SKenneth E. Jansenwill be three arrays), and output lhsGP (summation of boundary-boundary 212*59599516SKenneth E. Janseninteractions then put back to each processor, read thesis and slides). 213*59599516SKenneth E. JansenThere are tricky steps, first you need to extract the G_bb matrices, 214*59599516SKenneth E. Jansenthen the summation will alter the sparse structure, and you will have to 215*59599516SKenneth E. Jansenalter the structure of lhsP too. This is done task by task. There are 216*59599516SKenneth E. Jansenmappings, reverse mappings all used inside this routine. I suggest you 217*59599516SKenneth E. Jansenleave this subroutine as is. If you really want to do something, read 218*59599516SKenneth E. Jansenthe source very carefully. One thing about this, is that we are not 219*59599516SKenneth E. Jansenconsidering triple boundaries, if a boundary node has 3 or more 220*59599516SKenneth E. Jansenrepresentations (lives in 2 or more tasks), only one will be considered. 221*59599516SKenneth E. JansenThis will result the G_bb value for two such nodes slightly smaller, 222*59599516SKenneth E. Jansenwhich won't be a big issue for the overall performance. It is a direct 223*59599516SKenneth E. Jansenresult from the task structure of phasta. I'm mentioning it here because 224*59599516SKenneth E. Jansenif you look really really carefully you will find this problem, but 225*59599516SKenneth E. Jansendon't worry about it. (slightly different value of G_bb only). 226*59599516SKenneth E. Jansen 227*59599516SKenneth E. JansenPPE_Ap routines perform matrix-vector product for PPE, just pay 228*59599516SKenneth E. Jansenattention to scaling and corresponding function calls in ramg_tools.f 229*59599516SKenneth E. Jansen 230*59599516SKenneth E. Janseninit_ilwork initialize all the ilwork structure for higher level AMG 231*59599516SKenneth E. Jansenmatrices. Everything is similar to plain CG, but smaller in size because 232*59599516SKenneth E. Jansenof coarsening. 233*59599516SKenneth E. Jansen 234*59599516SKenneth E. Janseninit_bcflag initialize global array amg_paramap. Ignore the first bunch 235*59599516SKenneth E. Jansenof the code (if (numpe.lt.2) then) because that's for reduced serial, 236*59599516SKenneth E. Jansenwhich we no longer use. 237*59599516SKenneth E. Jansen 238*59599516SKenneth E. Jansenramg_commout, ramg_commin, same commout, commin for higher level AMG 239*59599516SKenneth E. Jansensystems. 240*59599516SKenneth E. Jansen 241*59599516SKenneth E. JansenOkay that's it, again, be very carefull in this file. 242*59599516SKenneth E. Jansen 243*59599516SKenneth E. Jansen* ramg_smoother.f 244*59599516SKenneth E. Jansen 245*59599516SKenneth E. JansenIt's a wrapper for G-S, Jacobi, Cheby, and MLS. Nothing fancy here. 246*59599516SKenneth E. Jansen 247*59599516SKenneth E. JansenOnly thing to note is in Gauss-Seidel, we follow red-black smoothing, 248*59599516SKenneth E. Jansencoarse first fine second (or other way around). In direct solve we can't 249*59599516SKenneth E. Jansendo parallel direct solve (or CG! Farzin said so because that destroy 250*59599516SKenneth E. Jansenpreconditioner. i.e. preconditioner is not the same solve by solve, 251*59599516SKenneth E. Jansenwhile it should be the same and symmetric). Let's suggest a smoother to 252*59599516SKenneth E. Jansenmake the direct solve. 253*59599516SKenneth E. Jansen 254*59599516SKenneth E. Jansen* ramg_tools.f 255*59599516SKenneth E. Jansen 256*59599516SKenneth E. Jansennothing fancy here. but there are a lot of output tools you may want to 257*59599516SKenneth E. Jansenuse ramg_dump* functions. Search for them to find examples, very 258*59599516SKenneth E. Jansenconvenient when you go to parallel debugging, and do matlab 259*59599516SKenneth E. Jansenverification. 260*59599516SKenneth E. Jansen 261*59599516SKenneth E. JansencalcAv_g is the global Ap-product for all AMG-matrices, the kernel is 262*59599516SKenneth E. Jansenhere. 263*59599516SKenneth E. Jansen 264*59599516SKenneth E. Jansen 265*59599516SKenneth E. JansenGood luck. 266