xref: /phasta/phSolver/AMG/amgread.txt (revision 595995161822a203c8467e0e4a253d7bd7d6df32)
1
2===================================================================
3
4This document is a README file for coding parallel AMG for Pressure
5Poisson Equation in PHASTA. If you want to start to modify or add
6functionality or simply want to understand the code, please use this
7document as a help.
8
9This document is intended to serve as extra comments for the source, not
10to be published. You made your changes, please also update this file.
11
12Please have basic knowledge of sparse matrices, and read paper/thesis as
13you continue. row-wise sparse storage is used everywhere in this
14document.
15
16Chun Sun, March, 2008
17
18==================================================================
19
20*** Files related to AMG in the ../common/ and ../incompressible/
21directories:
22
23** solfar.f
24    Two things involving AMG are done here between #ifdef AMG
25statements: 1) setup AMG by calling ramg_control. ( don't worry you'll
26do setup everytime, ramg_control handles that. ) 2) Build a loop for
27smartAMG, which is never used.
28
29** lestools.c
30    Provides C function: LesPrecPPE. LesLIB calls this function for
31preconditioning; and this function calls fortran function ramg_interface
32to do the actual preconditioning. This is the interface between our AMG
33and ACUSIM solver.
34
35** input_fforms.cc
36
37    Handling I/O from input.config and solver.inp, nothing special.
38
39**** in AMG directory ******
40
41* ramg_data.f
42
43  We have to start from this file. This is a module file containing
44global control variables, matrices, and vectors. To have a global module
45file and use it whenever necessary, saved us passing long parameters
46every function call (though it's still tedious in some places).
47
48  First we have defined data type: pointer to arrays with different
49dimentions. This is used, as shown in this file later, in submatrices
50communication and build PPE multilevels.
51
52  Then we have our multilevel AMG stored in amg_A_* arrays as sparse
53matrices. amg_para* arrays are for communication. amg_ilwork, amg_nlwork
54are same as ilwork and nlwork, but in higher levels with smaller size.
55amg_ppeDiag is important diagonal scaling matrices for different levels.
56CF_map, CF_rev(erse)map are arrays recording coarsening information.
57
58* ramg_coarse.f
59
60The purpose of this file is: 1) CF-spliting, 2) setup interpolation
61operator.
62
63CF-spliting is done in a seperate function ramg_CFsplit. Other than the
64thesis, one thing to note is that we use heap sort for lambda array. The
65good thing is we keep changing this array and everytime we want to get
66the highest value out from the array. So heap sort is a natural choice.
67
68After CF-split, we do a communication: from ramg_update_cfmap to
69ramg_readin_cfmap. This is to ensure that shared boundary nodes get same
70coarsening on both processors. Note commout does not pass property for
71integers(it only work with real*8 numbers), so we have commout_i to do
72this for integer, check that commout_i.
73
74CF_map is an array to set the order for all the nodes according to
75"coarse first, fine second". This order will be used in Gauss-Seidel
76smoothing, when we smooth nodes according to this order. aka,
77"red-black" Gauss-Seidel, see paper/thesis for description.
78
79amg_paramap is an array to record "grouping" information for coarsening
80by group in parallel. For each node, there is a "group id" assigned to
81it. For all interior node, the "group id" is the rank of the local
82processor. For a boundary node, if it is a "master", the rank of it is
83the neighbour rank; if it is a "slave", the rank of it is the negative
84of the neighbour rank. (see ramg_initbcflag in ramg_paratools.f). This
85information is computed at very beginning for the finest level. Then
86after each coarsening, the information is carried on to higher level in
87ramg_coarse.f, and will be used for communication at higher levels.
88Also, CF-spliting is done within each group, if you pay attention to
89every "if (afmap....)" in ramg_CFsplit you will see that.
90
91Then we build temp variable amg_I for interpolation operator.
92Interpolation operator is a sparse matrix, but it's hard to build it at
93once. amg_I is used then. It's an array of allocatable pointers. So we
94can bulid row by row. Eventually construct it to form I_cf_* which is
95interpolation operator matrix. And transpose it to form I_fc_*. Pay
96attention to transpose algorithm, it's a bit complicated but as usual,
97standard for row-wise sparse storage.
98
99* ramg_control.f
100
101In ramg_prep, please note that variable maxstopsign is the control
102variable for coarsening. Coarsening is stopped if this value is true.
103How do we decide it? When every processor has true value. Otherwise,
104coarsening is going on. For those already got coarsened and have true
105value, the result of coarsening will be trivial identity matrix as
106interpolator. A tiny bit of waste compare to communicate criterias (see
107thesis).
108
109Also in ramg_control, we do lu decomposition for the coarsest level, if
110direct solve for coarsest level is selected. Thus, each iteration will
111only need to do a back substitution, which saves time greatly,
112especially when you have a bit larger coarest level.
113
114For "smartamg" prototype, we check consecutive 7 norms from CG (stored
115in ramg_window/ramg_winct), if it does not go down, we start restart
116AMG-cg. Before restart we just cheat ACUSIM by giving it zero norms for
117the first sweep of solves (last iteration of CG and first iteration of
118GMRES). This whole segment might be abandoned provided ACUSIM give us
119better interface and norm control.
120
121* ramg_driver.f
122
123The main wrap, nothing fancy. ramg_interface is the only gate to the
124actual ACUSIM solve. In the actual ramg_driver, we have left
125possibilities (mode) to do a stand-alone Preconditioned CG by us, which
126is never used.
127
128Pay attention to diagonal scaling in V-cycle, at the finest level, the
129PPE has already been scaled to diagonal unity. We do scaling only at
130second level and above.
131
132(Now when I was writing these: There seems to be two places that we can
133improve efficiency: For second level and above, scaling can be
134incorporated with interpolation operator, i.e. interpolation is modified
135to directly give a scaled matrix. Another place is the allocation of
136coarse level vector. We don't need to do this every iteration. This can
137be done in extract and use the vector everytime.)
138
139* ramg_extract.f
140
141sparse matrix-matrix production: You need to do these three steps
142seperatedly: calculate nonzeros; form colm; form rowp and A. Yes there
143are possibilities that you combine these three loops into one, but that
144will dramatically decrease efficiency. Actually, a rule of thumb: break
145the complicated loop into several less complicated loops will
146increaseing efficiency (better optimized for vector operations), even if
147the math hasn't changed. But here if you combine matrix-matrix product
148into one loop, the math will change to a more complicated one. Oh,
149another rule of thumb: use less "if" in a loop, that will increase
150efficiency.
151
152Now to parallel. lhsGP* is a sparse-communicated duplicate of lhsP
153matrix.
154
155* ramg_ggb.f
156
157Must read the section of GGB in Chun Sun's thesis. ARPACK/PARPACK is
158used through a user-specified AP-product. here in ggb_av.
159
160generate_gmap creates a mapping and reverse mapping (gmap,grevmap) for a
161vector mapping from GGB (without overlap boundary) to phasta format
162(with overlap boundary). ggb_G2A and ggb_A2G are actual function calls
163to do that.
164
165In ggb_setup, there is a Allreduce with MPI_MAX, this is to search for a
166maximum vector length for vector allocation. ARPACK does not dynamic
167memory at different processor, so we allocate maximum nshg for every
168proc.
169
170There are 3 things you may want to modify for this GGB implementation.
1711) You may want to do lu-decomposition first in the setup, then in GGB
172apply, do a simple back-substitution. (like we did in coarsest solver
173if using direct). This saves time.
1742) You may want to figure out how the tolerance are related to the
175actual performance (tol = 1.0E-3)?
1763) Eventually you want your own eigen calculator, can't use ARPACK
177forever.
178
179* ramg_ITAI.f
180
181This does A_(l+1)=Q^T*A_l*Q, creation of coarser matrices.
182
183Note in the code we use "I" instead of "Q".
184
185When you do sparse matrix-matrix product, don't attempt to do it in one
186loop. Use three loops, calculate nnz, create colm, then create rowp and
187mtx. This looks like you are using 3 loops over 1, but it saves you
188time. and easier for compiler to vectorify.
189
190As to diagonal scaling to unity, maybe you want to test out how to
191disable that with modification to the smoother (polynomial smoother)
192
193* ramg_mls.f
194
195Read thesis, and appendix too, everything's straight-forward if you do
196so. For the setup, similar structure as you do GGB. There are segments
197of the code that do the so called "post-smoothing", but they are never
198activated. Don't worry about that though.
199
200I'm writing things in ramg_cheby.f here too, coz they are all polynomial
201smoothers. ramg_chebyratio is an input coefficient to determine the
202smallest eigenvalue you want to smooth using chebyshev polynomial, it is
203set to 10 in input.config. You can change it to the ratio of coarse/fine
204nodes, though I don't think there will be substantial difference.
205
206* ramg_paratools.f
207
208Very important tools for parallelization.
209
210global_lhs is a function that reads in lhsP (in sparse format so there
211will be three arrays), and output lhsGP (summation of boundary-boundary
212interactions then put back to each processor, read thesis and slides).
213There are tricky steps, first you need to extract the G_bb matrices,
214then the summation will alter the sparse structure, and you will have to
215alter the structure of lhsP too. This is done task by task.  There are
216mappings, reverse mappings all used inside this routine. I suggest you
217leave this subroutine as is. If you really want to do something, read
218the source very carefully. One thing about this, is that we are not
219considering triple boundaries, if a boundary node has 3 or more
220representations (lives in 2 or more tasks), only one will be considered.
221This will result the G_bb value for two such nodes slightly smaller,
222which won't be a big issue for the overall performance. It is a direct
223result from the task structure of phasta. I'm mentioning it here because
224if you look really really carefully you will find this problem, but
225don't worry about it. (slightly different value of G_bb only).
226
227PPE_Ap routines perform matrix-vector product for PPE, just pay
228attention to scaling and corresponding function calls in ramg_tools.f
229
230init_ilwork initialize all the ilwork structure for higher level AMG
231matrices. Everything is similar to plain CG, but smaller in size because
232of coarsening.
233
234init_bcflag initialize global array amg_paramap. Ignore the first bunch
235of the code (if (numpe.lt.2) then) because that's for reduced serial,
236which we no longer use.
237
238ramg_commout, ramg_commin, same commout, commin for higher level AMG
239systems.
240
241Okay that's it, again, be very carefull in this file.
242
243* ramg_smoother.f
244
245It's a wrapper for G-S, Jacobi, Cheby, and MLS. Nothing fancy here.
246
247Only thing to note is in Gauss-Seidel, we follow red-black smoothing,
248coarse first fine second (or other way around). In direct solve we can't
249do parallel direct solve (or CG! Farzin said so because that destroy
250preconditioner. i.e. preconditioner is not the same solve by solve,
251while it should be the same and symmetric). Let's suggest a smoother to
252make the direct solve.
253
254* ramg_tools.f
255
256nothing fancy here. but there are  a lot of output tools you may want to
257use ramg_dump* functions. Search for them to find examples, very
258convenient when you go to parallel debugging, and do matlab
259verification.
260
261calcAv_g is the global Ap-product for all AMG-matrices, the kernel is
262here.
263
264
265Good luck.
266