1 2=================================================================== 3 4This document is a README file for coding parallel AMG for Pressure 5Poisson Equation in PHASTA. If you want to start to modify or add 6functionality or simply want to understand the code, please use this 7document as a help. 8 9This document is intended to serve as extra comments for the source, not 10to be published. You made your changes, please also update this file. 11 12Please have basic knowledge of sparse matrices, and read paper/thesis as 13you continue. row-wise sparse storage is used everywhere in this 14document. 15 16Chun Sun, March, 2008 17 18================================================================== 19 20*** Files related to AMG in the ../common/ and ../incompressible/ 21directories: 22 23** solfar.f 24 Two things involving AMG are done here between #ifdef AMG 25statements: 1) setup AMG by calling ramg_control. ( don't worry you'll 26do setup everytime, ramg_control handles that. ) 2) Build a loop for 27smartAMG, which is never used. 28 29** lestools.c 30 Provides C function: LesPrecPPE. LesLIB calls this function for 31preconditioning; and this function calls fortran function ramg_interface 32to do the actual preconditioning. This is the interface between our AMG 33and ACUSIM solver. 34 35** input_fforms.cc 36 37 Handling I/O from input.config and solver.inp, nothing special. 38 39**** in AMG directory ****** 40 41* ramg_data.f 42 43 We have to start from this file. This is a module file containing 44global control variables, matrices, and vectors. To have a global module 45file and use it whenever necessary, saved us passing long parameters 46every function call (though it's still tedious in some places). 47 48 First we have defined data type: pointer to arrays with different 49dimentions. This is used, as shown in this file later, in submatrices 50communication and build PPE multilevels. 51 52 Then we have our multilevel AMG stored in amg_A_* arrays as sparse 53matrices. amg_para* arrays are for communication. amg_ilwork, amg_nlwork 54are same as ilwork and nlwork, but in higher levels with smaller size. 55amg_ppeDiag is important diagonal scaling matrices for different levels. 56CF_map, CF_rev(erse)map are arrays recording coarsening information. 57 58* ramg_coarse.f 59 60The purpose of this file is: 1) CF-spliting, 2) setup interpolation 61operator. 62 63CF-spliting is done in a seperate function ramg_CFsplit. Other than the 64thesis, one thing to note is that we use heap sort for lambda array. The 65good thing is we keep changing this array and everytime we want to get 66the highest value out from the array. So heap sort is a natural choice. 67 68After CF-split, we do a communication: from ramg_update_cfmap to 69ramg_readin_cfmap. This is to ensure that shared boundary nodes get same 70coarsening on both processors. Note commout does not pass property for 71integers(it only work with real*8 numbers), so we have commout_i to do 72this for integer, check that commout_i. 73 74CF_map is an array to set the order for all the nodes according to 75"coarse first, fine second". This order will be used in Gauss-Seidel 76smoothing, when we smooth nodes according to this order. aka, 77"red-black" Gauss-Seidel, see paper/thesis for description. 78 79amg_paramap is an array to record "grouping" information for coarsening 80by group in parallel. For each node, there is a "group id" assigned to 81it. For all interior node, the "group id" is the rank of the local 82processor. For a boundary node, if it is a "master", the rank of it is 83the neighbour rank; if it is a "slave", the rank of it is the negative 84of the neighbour rank. (see ramg_initbcflag in ramg_paratools.f). This 85information is computed at very beginning for the finest level. Then 86after each coarsening, the information is carried on to higher level in 87ramg_coarse.f, and will be used for communication at higher levels. 88Also, CF-spliting is done within each group, if you pay attention to 89every "if (afmap....)" in ramg_CFsplit you will see that. 90 91Then we build temp variable amg_I for interpolation operator. 92Interpolation operator is a sparse matrix, but it's hard to build it at 93once. amg_I is used then. It's an array of allocatable pointers. So we 94can bulid row by row. Eventually construct it to form I_cf_* which is 95interpolation operator matrix. And transpose it to form I_fc_*. Pay 96attention to transpose algorithm, it's a bit complicated but as usual, 97standard for row-wise sparse storage. 98 99* ramg_control.f 100 101In ramg_prep, please note that variable maxstopsign is the control 102variable for coarsening. Coarsening is stopped if this value is true. 103How do we decide it? When every processor has true value. Otherwise, 104coarsening is going on. For those already got coarsened and have true 105value, the result of coarsening will be trivial identity matrix as 106interpolator. A tiny bit of waste compare to communicate criterias (see 107thesis). 108 109Also in ramg_control, we do lu decomposition for the coarsest level, if 110direct solve for coarsest level is selected. Thus, each iteration will 111only need to do a back substitution, which saves time greatly, 112especially when you have a bit larger coarest level. 113 114For "smartamg" prototype, we check consecutive 7 norms from CG (stored 115in ramg_window/ramg_winct), if it does not go down, we start restart 116AMG-cg. Before restart we just cheat ACUSIM by giving it zero norms for 117the first sweep of solves (last iteration of CG and first iteration of 118GMRES). This whole segment might be abandoned provided ACUSIM give us 119better interface and norm control. 120 121* ramg_driver.f 122 123The main wrap, nothing fancy. ramg_interface is the only gate to the 124actual ACUSIM solve. In the actual ramg_driver, we have left 125possibilities (mode) to do a stand-alone Preconditioned CG by us, which 126is never used. 127 128Pay attention to diagonal scaling in V-cycle, at the finest level, the 129PPE has already been scaled to diagonal unity. We do scaling only at 130second level and above. 131 132(Now when I was writing these: There seems to be two places that we can 133improve efficiency: For second level and above, scaling can be 134incorporated with interpolation operator, i.e. interpolation is modified 135to directly give a scaled matrix. Another place is the allocation of 136coarse level vector. We don't need to do this every iteration. This can 137be done in extract and use the vector everytime.) 138 139* ramg_extract.f 140 141sparse matrix-matrix production: You need to do these three steps 142seperatedly: calculate nonzeros; form colm; form rowp and A. Yes there 143are possibilities that you combine these three loops into one, but that 144will dramatically decrease efficiency. Actually, a rule of thumb: break 145the complicated loop into several less complicated loops will 146increaseing efficiency (better optimized for vector operations), even if 147the math hasn't changed. But here if you combine matrix-matrix product 148into one loop, the math will change to a more complicated one. Oh, 149another rule of thumb: use less "if" in a loop, that will increase 150efficiency. 151 152Now to parallel. lhsGP* is a sparse-communicated duplicate of lhsP 153matrix. 154 155* ramg_ggb.f 156 157Must read the section of GGB in Chun Sun's thesis. ARPACK/PARPACK is 158used through a user-specified AP-product. here in ggb_av. 159 160generate_gmap creates a mapping and reverse mapping (gmap,grevmap) for a 161vector mapping from GGB (without overlap boundary) to phasta format 162(with overlap boundary). ggb_G2A and ggb_A2G are actual function calls 163to do that. 164 165In ggb_setup, there is a Allreduce with MPI_MAX, this is to search for a 166maximum vector length for vector allocation. ARPACK does not dynamic 167memory at different processor, so we allocate maximum nshg for every 168proc. 169 170There are 3 things you may want to modify for this GGB implementation. 1711) You may want to do lu-decomposition first in the setup, then in GGB 172apply, do a simple back-substitution. (like we did in coarsest solver 173if using direct). This saves time. 1742) You may want to figure out how the tolerance are related to the 175actual performance (tol = 1.0E-3)? 1763) Eventually you want your own eigen calculator, can't use ARPACK 177forever. 178 179* ramg_ITAI.f 180 181This does A_(l+1)=Q^T*A_l*Q, creation of coarser matrices. 182 183Note in the code we use "I" instead of "Q". 184 185When you do sparse matrix-matrix product, don't attempt to do it in one 186loop. Use three loops, calculate nnz, create colm, then create rowp and 187mtx. This looks like you are using 3 loops over 1, but it saves you 188time. and easier for compiler to vectorify. 189 190As to diagonal scaling to unity, maybe you want to test out how to 191disable that with modification to the smoother (polynomial smoother) 192 193* ramg_mls.f 194 195Read thesis, and appendix too, everything's straight-forward if you do 196so. For the setup, similar structure as you do GGB. There are segments 197of the code that do the so called "post-smoothing", but they are never 198activated. Don't worry about that though. 199 200I'm writing things in ramg_cheby.f here too, coz they are all polynomial 201smoothers. ramg_chebyratio is an input coefficient to determine the 202smallest eigenvalue you want to smooth using chebyshev polynomial, it is 203set to 10 in input.config. You can change it to the ratio of coarse/fine 204nodes, though I don't think there will be substantial difference. 205 206* ramg_paratools.f 207 208Very important tools for parallelization. 209 210global_lhs is a function that reads in lhsP (in sparse format so there 211will be three arrays), and output lhsGP (summation of boundary-boundary 212interactions then put back to each processor, read thesis and slides). 213There are tricky steps, first you need to extract the G_bb matrices, 214then the summation will alter the sparse structure, and you will have to 215alter the structure of lhsP too. This is done task by task. There are 216mappings, reverse mappings all used inside this routine. I suggest you 217leave this subroutine as is. If you really want to do something, read 218the source very carefully. One thing about this, is that we are not 219considering triple boundaries, if a boundary node has 3 or more 220representations (lives in 2 or more tasks), only one will be considered. 221This will result the G_bb value for two such nodes slightly smaller, 222which won't be a big issue for the overall performance. It is a direct 223result from the task structure of phasta. I'm mentioning it here because 224if you look really really carefully you will find this problem, but 225don't worry about it. (slightly different value of G_bb only). 226 227PPE_Ap routines perform matrix-vector product for PPE, just pay 228attention to scaling and corresponding function calls in ramg_tools.f 229 230init_ilwork initialize all the ilwork structure for higher level AMG 231matrices. Everything is similar to plain CG, but smaller in size because 232of coarsening. 233 234init_bcflag initialize global array amg_paramap. Ignore the first bunch 235of the code (if (numpe.lt.2) then) because that's for reduced serial, 236which we no longer use. 237 238ramg_commout, ramg_commin, same commout, commin for higher level AMG 239systems. 240 241Okay that's it, again, be very carefull in this file. 242 243* ramg_smoother.f 244 245It's a wrapper for G-S, Jacobi, Cheby, and MLS. Nothing fancy here. 246 247Only thing to note is in Gauss-Seidel, we follow red-black smoothing, 248coarse first fine second (or other way around). In direct solve we can't 249do parallel direct solve (or CG! Farzin said so because that destroy 250preconditioner. i.e. preconditioner is not the same solve by solve, 251while it should be the same and symmetric). Let's suggest a smoother to 252make the direct solve. 253 254* ramg_tools.f 255 256nothing fancy here. but there are a lot of output tools you may want to 257use ramg_dump* functions. Search for them to find examples, very 258convenient when you go to parallel debugging, and do matlab 259verification. 260 261calcAv_g is the global Ap-product for all AMG-matrices, the kernel is 262here. 263 264 265Good luck. 266