xref: /phasta/phSolver/AMG/amgread.txt (revision 595995161822a203c8467e0e4a253d7bd7d6df32)
1*59599516SKenneth E. Jansen
2*59599516SKenneth E. Jansen===================================================================
3*59599516SKenneth E. Jansen
4*59599516SKenneth E. JansenThis document is a README file for coding parallel AMG for Pressure
5*59599516SKenneth E. JansenPoisson Equation in PHASTA. If you want to start to modify or add
6*59599516SKenneth E. Jansenfunctionality or simply want to understand the code, please use this
7*59599516SKenneth E. Jansendocument as a help.
8*59599516SKenneth E. Jansen
9*59599516SKenneth E. JansenThis document is intended to serve as extra comments for the source, not
10*59599516SKenneth E. Jansento be published. You made your changes, please also update this file.
11*59599516SKenneth E. Jansen
12*59599516SKenneth E. JansenPlease have basic knowledge of sparse matrices, and read paper/thesis as
13*59599516SKenneth E. Jansenyou continue. row-wise sparse storage is used everywhere in this
14*59599516SKenneth E. Jansendocument.
15*59599516SKenneth E. Jansen
16*59599516SKenneth E. JansenChun Sun, March, 2008
17*59599516SKenneth E. Jansen
18*59599516SKenneth E. Jansen==================================================================
19*59599516SKenneth E. Jansen
20*59599516SKenneth E. Jansen*** Files related to AMG in the ../common/ and ../incompressible/
21*59599516SKenneth E. Jansendirectories:
22*59599516SKenneth E. Jansen
23*59599516SKenneth E. Jansen** solfar.f
24*59599516SKenneth E. Jansen    Two things involving AMG are done here between #ifdef AMG
25*59599516SKenneth E. Jansenstatements: 1) setup AMG by calling ramg_control. ( don't worry you'll
26*59599516SKenneth E. Jansendo setup everytime, ramg_control handles that. ) 2) Build a loop for
27*59599516SKenneth E. JansensmartAMG, which is never used.
28*59599516SKenneth E. Jansen
29*59599516SKenneth E. Jansen** lestools.c
30*59599516SKenneth E. Jansen    Provides C function: LesPrecPPE. LesLIB calls this function for
31*59599516SKenneth E. Jansenpreconditioning; and this function calls fortran function ramg_interface
32*59599516SKenneth E. Jansento do the actual preconditioning. This is the interface between our AMG
33*59599516SKenneth E. Jansenand ACUSIM solver.
34*59599516SKenneth E. Jansen
35*59599516SKenneth E. Jansen** input_fforms.cc
36*59599516SKenneth E. Jansen
37*59599516SKenneth E. Jansen    Handling I/O from input.config and solver.inp, nothing special.
38*59599516SKenneth E. Jansen
39*59599516SKenneth E. Jansen**** in AMG directory ******
40*59599516SKenneth E. Jansen
41*59599516SKenneth E. Jansen* ramg_data.f
42*59599516SKenneth E. Jansen
43*59599516SKenneth E. Jansen  We have to start from this file. This is a module file containing
44*59599516SKenneth E. Jansenglobal control variables, matrices, and vectors. To have a global module
45*59599516SKenneth E. Jansenfile and use it whenever necessary, saved us passing long parameters
46*59599516SKenneth E. Jansenevery function call (though it's still tedious in some places).
47*59599516SKenneth E. Jansen
48*59599516SKenneth E. Jansen  First we have defined data type: pointer to arrays with different
49*59599516SKenneth E. Jansendimentions. This is used, as shown in this file later, in submatrices
50*59599516SKenneth E. Jansencommunication and build PPE multilevels.
51*59599516SKenneth E. Jansen
52*59599516SKenneth E. Jansen  Then we have our multilevel AMG stored in amg_A_* arrays as sparse
53*59599516SKenneth E. Jansenmatrices. amg_para* arrays are for communication. amg_ilwork, amg_nlwork
54*59599516SKenneth E. Jansenare same as ilwork and nlwork, but in higher levels with smaller size.
55*59599516SKenneth E. Jansenamg_ppeDiag is important diagonal scaling matrices for different levels.
56*59599516SKenneth E. JansenCF_map, CF_rev(erse)map are arrays recording coarsening information.
57*59599516SKenneth E. Jansen
58*59599516SKenneth E. Jansen* ramg_coarse.f
59*59599516SKenneth E. Jansen
60*59599516SKenneth E. JansenThe purpose of this file is: 1) CF-spliting, 2) setup interpolation
61*59599516SKenneth E. Jansenoperator.
62*59599516SKenneth E. Jansen
63*59599516SKenneth E. JansenCF-spliting is done in a seperate function ramg_CFsplit. Other than the
64*59599516SKenneth E. Jansenthesis, one thing to note is that we use heap sort for lambda array. The
65*59599516SKenneth E. Jansengood thing is we keep changing this array and everytime we want to get
66*59599516SKenneth E. Jansenthe highest value out from the array. So heap sort is a natural choice.
67*59599516SKenneth E. Jansen
68*59599516SKenneth E. JansenAfter CF-split, we do a communication: from ramg_update_cfmap to
69*59599516SKenneth E. Jansenramg_readin_cfmap. This is to ensure that shared boundary nodes get same
70*59599516SKenneth E. Jansencoarsening on both processors. Note commout does not pass property for
71*59599516SKenneth E. Jansenintegers(it only work with real*8 numbers), so we have commout_i to do
72*59599516SKenneth E. Jansenthis for integer, check that commout_i.
73*59599516SKenneth E. Jansen
74*59599516SKenneth E. JansenCF_map is an array to set the order for all the nodes according to
75*59599516SKenneth E. Jansen"coarse first, fine second". This order will be used in Gauss-Seidel
76*59599516SKenneth E. Jansensmoothing, when we smooth nodes according to this order. aka,
77*59599516SKenneth E. Jansen"red-black" Gauss-Seidel, see paper/thesis for description.
78*59599516SKenneth E. Jansen
79*59599516SKenneth E. Jansenamg_paramap is an array to record "grouping" information for coarsening
80*59599516SKenneth E. Jansenby group in parallel. For each node, there is a "group id" assigned to
81*59599516SKenneth E. Jansenit. For all interior node, the "group id" is the rank of the local
82*59599516SKenneth E. Jansenprocessor. For a boundary node, if it is a "master", the rank of it is
83*59599516SKenneth E. Jansenthe neighbour rank; if it is a "slave", the rank of it is the negative
84*59599516SKenneth E. Jansenof the neighbour rank. (see ramg_initbcflag in ramg_paratools.f). This
85*59599516SKenneth E. Janseninformation is computed at very beginning for the finest level. Then
86*59599516SKenneth E. Jansenafter each coarsening, the information is carried on to higher level in
87*59599516SKenneth E. Jansenramg_coarse.f, and will be used for communication at higher levels.
88*59599516SKenneth E. JansenAlso, CF-spliting is done within each group, if you pay attention to
89*59599516SKenneth E. Jansenevery "if (afmap....)" in ramg_CFsplit you will see that.
90*59599516SKenneth E. Jansen
91*59599516SKenneth E. JansenThen we build temp variable amg_I for interpolation operator.
92*59599516SKenneth E. JansenInterpolation operator is a sparse matrix, but it's hard to build it at
93*59599516SKenneth E. Jansenonce. amg_I is used then. It's an array of allocatable pointers. So we
94*59599516SKenneth E. Jansencan bulid row by row. Eventually construct it to form I_cf_* which is
95*59599516SKenneth E. Janseninterpolation operator matrix. And transpose it to form I_fc_*. Pay
96*59599516SKenneth E. Jansenattention to transpose algorithm, it's a bit complicated but as usual,
97*59599516SKenneth E. Jansenstandard for row-wise sparse storage.
98*59599516SKenneth E. Jansen
99*59599516SKenneth E. Jansen* ramg_control.f
100*59599516SKenneth E. Jansen
101*59599516SKenneth E. JansenIn ramg_prep, please note that variable maxstopsign is the control
102*59599516SKenneth E. Jansenvariable for coarsening. Coarsening is stopped if this value is true.
103*59599516SKenneth E. JansenHow do we decide it? When every processor has true value. Otherwise,
104*59599516SKenneth E. Jansencoarsening is going on. For those already got coarsened and have true
105*59599516SKenneth E. Jansenvalue, the result of coarsening will be trivial identity matrix as
106*59599516SKenneth E. Janseninterpolator. A tiny bit of waste compare to communicate criterias (see
107*59599516SKenneth E. Jansenthesis).
108*59599516SKenneth E. Jansen
109*59599516SKenneth E. JansenAlso in ramg_control, we do lu decomposition for the coarsest level, if
110*59599516SKenneth E. Jansendirect solve for coarsest level is selected. Thus, each iteration will
111*59599516SKenneth E. Jansenonly need to do a back substitution, which saves time greatly,
112*59599516SKenneth E. Jansenespecially when you have a bit larger coarest level.
113*59599516SKenneth E. Jansen
114*59599516SKenneth E. JansenFor "smartamg" prototype, we check consecutive 7 norms from CG (stored
115*59599516SKenneth E. Jansenin ramg_window/ramg_winct), if it does not go down, we start restart
116*59599516SKenneth E. JansenAMG-cg. Before restart we just cheat ACUSIM by giving it zero norms for
117*59599516SKenneth E. Jansenthe first sweep of solves (last iteration of CG and first iteration of
118*59599516SKenneth E. JansenGMRES). This whole segment might be abandoned provided ACUSIM give us
119*59599516SKenneth E. Jansenbetter interface and norm control.
120*59599516SKenneth E. Jansen
121*59599516SKenneth E. Jansen* ramg_driver.f
122*59599516SKenneth E. Jansen
123*59599516SKenneth E. JansenThe main wrap, nothing fancy. ramg_interface is the only gate to the
124*59599516SKenneth E. Jansenactual ACUSIM solve. In the actual ramg_driver, we have left
125*59599516SKenneth E. Jansenpossibilities (mode) to do a stand-alone Preconditioned CG by us, which
126*59599516SKenneth E. Jansenis never used.
127*59599516SKenneth E. Jansen
128*59599516SKenneth E. JansenPay attention to diagonal scaling in V-cycle, at the finest level, the
129*59599516SKenneth E. JansenPPE has already been scaled to diagonal unity. We do scaling only at
130*59599516SKenneth E. Jansensecond level and above.
131*59599516SKenneth E. Jansen
132*59599516SKenneth E. Jansen(Now when I was writing these: There seems to be two places that we can
133*59599516SKenneth E. Jansenimprove efficiency: For second level and above, scaling can be
134*59599516SKenneth E. Jansenincorporated with interpolation operator, i.e. interpolation is modified
135*59599516SKenneth E. Jansento directly give a scaled matrix. Another place is the allocation of
136*59599516SKenneth E. Jansencoarse level vector. We don't need to do this every iteration. This can
137*59599516SKenneth E. Jansenbe done in extract and use the vector everytime.)
138*59599516SKenneth E. Jansen
139*59599516SKenneth E. Jansen* ramg_extract.f
140*59599516SKenneth E. Jansen
141*59599516SKenneth E. Jansensparse matrix-matrix production: You need to do these three steps
142*59599516SKenneth E. Jansenseperatedly: calculate nonzeros; form colm; form rowp and A. Yes there
143*59599516SKenneth E. Jansenare possibilities that you combine these three loops into one, but that
144*59599516SKenneth E. Jansenwill dramatically decrease efficiency. Actually, a rule of thumb: break
145*59599516SKenneth E. Jansenthe complicated loop into several less complicated loops will
146*59599516SKenneth E. Jansenincreaseing efficiency (better optimized for vector operations), even if
147*59599516SKenneth E. Jansenthe math hasn't changed. But here if you combine matrix-matrix product
148*59599516SKenneth E. Janseninto one loop, the math will change to a more complicated one. Oh,
149*59599516SKenneth E. Jansenanother rule of thumb: use less "if" in a loop, that will increase
150*59599516SKenneth E. Jansenefficiency.
151*59599516SKenneth E. Jansen
152*59599516SKenneth E. JansenNow to parallel. lhsGP* is a sparse-communicated duplicate of lhsP
153*59599516SKenneth E. Jansenmatrix.
154*59599516SKenneth E. Jansen
155*59599516SKenneth E. Jansen* ramg_ggb.f
156*59599516SKenneth E. Jansen
157*59599516SKenneth E. JansenMust read the section of GGB in Chun Sun's thesis. ARPACK/PARPACK is
158*59599516SKenneth E. Jansenused through a user-specified AP-product. here in ggb_av.
159*59599516SKenneth E. Jansen
160*59599516SKenneth E. Jansengenerate_gmap creates a mapping and reverse mapping (gmap,grevmap) for a
161*59599516SKenneth E. Jansenvector mapping from GGB (without overlap boundary) to phasta format
162*59599516SKenneth E. Jansen(with overlap boundary). ggb_G2A and ggb_A2G are actual function calls
163*59599516SKenneth E. Jansento do that.
164*59599516SKenneth E. Jansen
165*59599516SKenneth E. JansenIn ggb_setup, there is a Allreduce with MPI_MAX, this is to search for a
166*59599516SKenneth E. Jansenmaximum vector length for vector allocation. ARPACK does not dynamic
167*59599516SKenneth E. Jansenmemory at different processor, so we allocate maximum nshg for every
168*59599516SKenneth E. Jansenproc.
169*59599516SKenneth E. Jansen
170*59599516SKenneth E. JansenThere are 3 things you may want to modify for this GGB implementation.
171*59599516SKenneth E. Jansen1) You may want to do lu-decomposition first in the setup, then in GGB
172*59599516SKenneth E. Jansenapply, do a simple back-substitution. (like we did in coarsest solver
173*59599516SKenneth E. Jansenif using direct). This saves time.
174*59599516SKenneth E. Jansen2) You may want to figure out how the tolerance are related to the
175*59599516SKenneth E. Jansenactual performance (tol = 1.0E-3)?
176*59599516SKenneth E. Jansen3) Eventually you want your own eigen calculator, can't use ARPACK
177*59599516SKenneth E. Jansenforever.
178*59599516SKenneth E. Jansen
179*59599516SKenneth E. Jansen* ramg_ITAI.f
180*59599516SKenneth E. Jansen
181*59599516SKenneth E. JansenThis does A_(l+1)=Q^T*A_l*Q, creation of coarser matrices.
182*59599516SKenneth E. Jansen
183*59599516SKenneth E. JansenNote in the code we use "I" instead of "Q".
184*59599516SKenneth E. Jansen
185*59599516SKenneth E. JansenWhen you do sparse matrix-matrix product, don't attempt to do it in one
186*59599516SKenneth E. Jansenloop. Use three loops, calculate nnz, create colm, then create rowp and
187*59599516SKenneth E. Jansenmtx. This looks like you are using 3 loops over 1, but it saves you
188*59599516SKenneth E. Jansentime. and easier for compiler to vectorify.
189*59599516SKenneth E. Jansen
190*59599516SKenneth E. JansenAs to diagonal scaling to unity, maybe you want to test out how to
191*59599516SKenneth E. Jansendisable that with modification to the smoother (polynomial smoother)
192*59599516SKenneth E. Jansen
193*59599516SKenneth E. Jansen* ramg_mls.f
194*59599516SKenneth E. Jansen
195*59599516SKenneth E. JansenRead thesis, and appendix too, everything's straight-forward if you do
196*59599516SKenneth E. Jansenso. For the setup, similar structure as you do GGB. There are segments
197*59599516SKenneth E. Jansenof the code that do the so called "post-smoothing", but they are never
198*59599516SKenneth E. Jansenactivated. Don't worry about that though.
199*59599516SKenneth E. Jansen
200*59599516SKenneth E. JansenI'm writing things in ramg_cheby.f here too, coz they are all polynomial
201*59599516SKenneth E. Jansensmoothers. ramg_chebyratio is an input coefficient to determine the
202*59599516SKenneth E. Jansensmallest eigenvalue you want to smooth using chebyshev polynomial, it is
203*59599516SKenneth E. Jansenset to 10 in input.config. You can change it to the ratio of coarse/fine
204*59599516SKenneth E. Jansennodes, though I don't think there will be substantial difference.
205*59599516SKenneth E. Jansen
206*59599516SKenneth E. Jansen* ramg_paratools.f
207*59599516SKenneth E. Jansen
208*59599516SKenneth E. JansenVery important tools for parallelization.
209*59599516SKenneth E. Jansen
210*59599516SKenneth E. Jansenglobal_lhs is a function that reads in lhsP (in sparse format so there
211*59599516SKenneth E. Jansenwill be three arrays), and output lhsGP (summation of boundary-boundary
212*59599516SKenneth E. Janseninteractions then put back to each processor, read thesis and slides).
213*59599516SKenneth E. JansenThere are tricky steps, first you need to extract the G_bb matrices,
214*59599516SKenneth E. Jansenthen the summation will alter the sparse structure, and you will have to
215*59599516SKenneth E. Jansenalter the structure of lhsP too. This is done task by task.  There are
216*59599516SKenneth E. Jansenmappings, reverse mappings all used inside this routine. I suggest you
217*59599516SKenneth E. Jansenleave this subroutine as is. If you really want to do something, read
218*59599516SKenneth E. Jansenthe source very carefully. One thing about this, is that we are not
219*59599516SKenneth E. Jansenconsidering triple boundaries, if a boundary node has 3 or more
220*59599516SKenneth E. Jansenrepresentations (lives in 2 or more tasks), only one will be considered.
221*59599516SKenneth E. JansenThis will result the G_bb value for two such nodes slightly smaller,
222*59599516SKenneth E. Jansenwhich won't be a big issue for the overall performance. It is a direct
223*59599516SKenneth E. Jansenresult from the task structure of phasta. I'm mentioning it here because
224*59599516SKenneth E. Jansenif you look really really carefully you will find this problem, but
225*59599516SKenneth E. Jansendon't worry about it. (slightly different value of G_bb only).
226*59599516SKenneth E. Jansen
227*59599516SKenneth E. JansenPPE_Ap routines perform matrix-vector product for PPE, just pay
228*59599516SKenneth E. Jansenattention to scaling and corresponding function calls in ramg_tools.f
229*59599516SKenneth E. Jansen
230*59599516SKenneth E. Janseninit_ilwork initialize all the ilwork structure for higher level AMG
231*59599516SKenneth E. Jansenmatrices. Everything is similar to plain CG, but smaller in size because
232*59599516SKenneth E. Jansenof coarsening.
233*59599516SKenneth E. Jansen
234*59599516SKenneth E. Janseninit_bcflag initialize global array amg_paramap. Ignore the first bunch
235*59599516SKenneth E. Jansenof the code (if (numpe.lt.2) then) because that's for reduced serial,
236*59599516SKenneth E. Jansenwhich we no longer use.
237*59599516SKenneth E. Jansen
238*59599516SKenneth E. Jansenramg_commout, ramg_commin, same commout, commin for higher level AMG
239*59599516SKenneth E. Jansensystems.
240*59599516SKenneth E. Jansen
241*59599516SKenneth E. JansenOkay that's it, again, be very carefull in this file.
242*59599516SKenneth E. Jansen
243*59599516SKenneth E. Jansen* ramg_smoother.f
244*59599516SKenneth E. Jansen
245*59599516SKenneth E. JansenIt's a wrapper for G-S, Jacobi, Cheby, and MLS. Nothing fancy here.
246*59599516SKenneth E. Jansen
247*59599516SKenneth E. JansenOnly thing to note is in Gauss-Seidel, we follow red-black smoothing,
248*59599516SKenneth E. Jansencoarse first fine second (or other way around). In direct solve we can't
249*59599516SKenneth E. Jansendo parallel direct solve (or CG! Farzin said so because that destroy
250*59599516SKenneth E. Jansenpreconditioner. i.e. preconditioner is not the same solve by solve,
251*59599516SKenneth E. Jansenwhile it should be the same and symmetric). Let's suggest a smoother to
252*59599516SKenneth E. Jansenmake the direct solve.
253*59599516SKenneth E. Jansen
254*59599516SKenneth E. Jansen* ramg_tools.f
255*59599516SKenneth E. Jansen
256*59599516SKenneth E. Jansennothing fancy here. but there are  a lot of output tools you may want to
257*59599516SKenneth E. Jansenuse ramg_dump* functions. Search for them to find examples, very
258*59599516SKenneth E. Jansenconvenient when you go to parallel debugging, and do matlab
259*59599516SKenneth E. Jansenverification.
260*59599516SKenneth E. Jansen
261*59599516SKenneth E. JansencalcAv_g is the global Ap-product for all AMG-matrices, the kernel is
262*59599516SKenneth E. Jansenhere.
263*59599516SKenneth E. Jansen
264*59599516SKenneth E. Jansen
265*59599516SKenneth E. JansenGood luck.
266