Chef/Mesh Partitioning

From PHASTA Wiki
Revision as of 23:22, 28 March 2015 by Skinnerr (talk | contribs) (Chef)
Jump to: navigation, search

This webpage is inspired from a tutorial provided to Igor and his team at NCSU in order to set up two phase flow test cases on a local cluster named Firebird at NCSU and Cetus/Mira at ALCF. At this time, do not expect anything but a series of copy-paste from emails. Please update this page for our viz nodes when you get a chance.

Thanks,

- Michel



Here is a tutorial about how to respectively partition the initial mesh and generate the phasta files on firebird (and other platforms including Cetus/Mira) using Chef. This tutorial is rather long but should include everything you need. The testcase to demonstrate the workflow is the familiar 3-way subchannel flow. The root path of this test case is /sgidata2/mrasquin/Models/subchannel. The parasolid model is located in /sgidata2/mrasquin/Models/subchannel/convertParasolid2ParasolidNative/geomFromSimmodeler_nat.xmt_txt. The workflow that describes how to use Chef is now explained in the next sections.

Env variables

All the subsequent tools need

  • The fresh version of openmpi I built on firebird
  • The latest Simmetrix library I installed in /Install on firebird.

To update your paths, source the following file: /Install/SCOREC.develop/envLinux2014.sh.

The env variables defined or updated in this env script include PATH and LD_LIBRARY_PATH. What is defined in this script should prevail on your settings but I strongly suggest removing any redundancy that you may have, for instance, in your .basrc. Note that I actually source this env file directly in my .bashrc so that I do not have to do it manually every time I log in to firebird. When you source it, it will also print the version of gcc, openmpi and simmodsuite lib that are set up.

BLMesherParallel

Note that Simmetrix only supports matched faces for single part mesh so that the mesh must be built with one core. However the initial mesh must already include some information related to the partitioning, even if the mesh only includes a single part for format reasons. This additional information about the partitioning is required for conversion of the mesh file from the Simmetrix format to the SCOREC MDS format that Chef can read.

The initial mesh for the 3-way subchannel was built in /sgidata2/mrasquin/Models/subchannel/subchannel_3way/Mixed-Parallel1-parasolid-9.0-140927/2-A0. Check the script named runBLMesherParallel.sh in this directory.

Running ./runBLMesherParallel.sh with no arguments will tell you the usage, that is:

Usage: ./runBLMesherParallel.sh <X> <Y> <Z>

The arguments are as follows.

  • <X> (geometric model) should be the parasolid model geomFromSimmodeler_nat.xmt_txt.
  • <Y> (attribute file) should be BLattr.inp.
  • <Z> (number of processors) should be 1 here since we need to generate a single part mesh using a single core.

The BLattr.inp input file is the same as the one read by the old serial version of BLMesher. But BLMesherParallel can do whatever the old version of BLMesher can do. In addition, if your test case does not include any matched face, you may try to mesh in parallel by specifying <Z> to be larger than 1. However, some meshing features are available only when BLMesherParallel is used with a single core so it is always important to check the resulting mesh.

BLMesherParallel outputs the following files.

  • mesh.sms --- The resulting mesh is stored in a directory named mesh.sms, which is a parameter hardcoded in the runBLMesherParallel.sh script.
  • BLMesher.log --- The log from BLMesherParallel is saved in BMesher.log, whereas the Simmetrix log is saved in mesh.log. Both filenames are also hardcoded in the script.

I also mentioned in previous discussions that Simmetrix has developed its own model format called geomsim. However, the boundary layer collapses near matched faces with this model format, which is not the case when we use the parasolid format. This issue has been reported to Simmetrix but until they can provide a fix, we are forced to start with the parasolid format when our test cases include matched faces.

Mesh conversion

Chef can read only the MDS format developed at SCOREC. Therefore, the Simmetrix mesh mush first be converted to this format.

This operation was carried out for the 3-way channel in /sgidata2/mrasquin/Models/subchannel/subchannel_3way/Mixed-Parallel1-parasolid-9.0-140927/2-A0/simMeshToMdsMesh. Simply run the script ./simMeshToMdsMesh.sh, which executes the "convert" executable. In the script, you can see that the convert executable reads 3 arguments:

  1. The input parasolid model named named geom.xmt_txt, which points to geomFromSimmodeler_nat.x_t. Note that convert expects an .xmt_txt extension (or .smd extension for the complete geomsim format).
  2. The input Simmetrix mesh named here parts.sms (for historical reason but can be renamed).
  3. The name of the output mds mesh directory, which is mdsMesh_bz2 here. Note that this name is prepended by "bz2:", which means that the output mds mesh file is compressed using bzip2. "bz2:" will not be part of the name of the output directory. If you do not specify "bz2:", the mds mesh file will be saved in ascii format, which is a waste of space so I suggest to always prepend your directory name by "bz2:". This will also apply later to the output mesh directory generated by Chef (see below).

Note that convert needs to run with a number of processes (-np ##) equal to the number of input parts in the Simmetrix mesh. For cases that include match faces, the Simmetrix mesh must include only one part, which is the reason why convert runs here with -np 1. But in other circumstances, convert can run in parallel if the Simmetrix mesh has already been partitioned in n parts with n>1 (for instance mesh generated in parallel with BLMesherParallel and/or partitioned with phParAdapt-Simmetrix).

Boundary and initial conditions (spj file)

Before running Chef for mesh operations such as uniform refinement, tetrahedronization and partitioning, we need to define the BCs and ICs for the generation of the phasta files. These BCs and ICs are defined in an spj file, which is in ASCII to facilitate scripting of BCs/ICs. Most of the attributes you are familiar with from the Simmodeler GUI can be specified in the spj file.

For the 3-way channel flow, see the spj file located in /sgidata2/mrasquin/Models/subchannel/subchannel_3way/Simplified_SPJ_file/geom.spj. Each line corresponds to one attribute that applies to one face.

The structure of the spj file is:

# Optional comments anywhere preceded by the pound symbol (#).
# For each boundary or initial condition a line as follows:
<attribute_name>: <face_id> <dimension> <attribute list>

Note the following.

  • <dimension>: 2 for a face attribute in 2D, 3 for the initial conditions that applies to the 3D domain. 1D and 0D attributes are also allowed for lines and vertices if needed.
  • <attribute list>: typically magnitude and direction if this applies>.

Syntax is strict.

  • No empty line. Each line should be either a comment which starts with the # character, or an attribute.
  • There must be one single space after the semicolon character.
  • There must be one single space between any numbers.

In this example, a zero "traction vector" attribute is specified on the periodic faces parallel to the length of the channel. It is wrong to specify such an attribute on these periodic faces for a 3-way channel, but this was inherited from the 1-way periodic channel where these faces were slip walls instead of periodic faces. I will try to update my test cases in the future. But because we have now continuous integration tools that run every night to verify the Chef code, I will need to update all the cases if I modify the spj file now. So double check the attributes that you need for this model and consider the existing spj file as a source of inspiration rather than the correct spj file for production runs.

Chef

A few rules must be followed to run Chef.

First, the number of mpi processes must be equal to the number of input parts (this has changed in the newest version of Chef, as described below).

Second, Chef is threaded with openmp and the total number of output parts after partitioning should be at most equal to the total number of available hardware threads of your machine/allocation. On BGQ, there are 4 hardware threads per core. On Linux platform such as firebird, the number of hardware threads corresponds to the number of available cores. That said, we have observed that if the number of output parts is equal to the total number of available hardware threads, Chef can hang. Therefore, it is safer to limit the number of output parts to a lower number than the number of available hardware threads. Therefore, on firebird, we should not try to partition a mesh to more than 16 parts.

The next mesh operations will have to take place on Tukey and Cetus/Mira.

The first example of a partitioning with Chef can be found in /sgidata2/mrasquin/Models/subchannel/subchannel_3way/Mixed-Parallel1-parasolid-9.0-140927/2-A0/1PFPP-phPA/4-1-Chef-PartLocal-Scratch. With my naming convention, 4-1-Chef-PartLocal-Scratch can be decomposed as follows:

  • The first number (4) corresponds to the number of output parts
  • The second number (1) correspond to the number of input parts
  • "Chef" means this mesh was treated with this program (in opposition to phParAdapt, phTest, etc which are previous executables that we used for similar purpose).
  • "PartLocal" means the mesh is partitioned locally.
  • "Scratch" means that the initial solution in the resulting phasta files is generated entirely from the spj file defined in a previous section of this tutorial. That is, we are starting a simulation "from scratch," using the spj file's initial conditions as opposed to a solution migrated from a previous run.

In summary, Chef was used in this directory to partition a single part mesh into 4 parts and the solution in the phasta files was generated directly from scratch using the spj file.

Chef's input files

The script to run Chef is named runChef.sh in this directory and simply call the executable. Chef reads all it needs from two input files called numstart.dat and adapt.inp.

numstart.dat

Instead of building the initial solution from scratch using the initial conditions defined in the spj file, the user can migrate an existing solution stored in a set of restart files that were saved from a previous phasta simulation. Numstart.dat contains the time step stamp of the input restart files to read in order to migrate a solution.

adapt.inp

This input file contains all the other parameters Chef expects. Note that many of these parameters have been inherited from the old phParAdapt, and are currently obsolete or unused. In what follows, all the parameters available in adapt.inp are listed and the critical parameters are in bold. Any line that starts with # is ignored.

  • globalP: obsolete/unused.
  • timeStepNumber: this is the time step of the output phasta files that will be generated by Chef. This stamp can be different from the number specified in numstart.dat which can be practical in some situations. But most of the time, this number is set equal to what is specified in numstart.dat
  • ensa_dof: this corresponds to the number of degrees of freedom in the solution field of the output restart file. Note that it should correspond to the number of initial conditions specified in the spj file if the solution is built from scratch. When the solution is migrated from existing restart files, it should also correspond to the number of dof in the existing solution field. Here, this number is set to 5 for single phase flow with no turbulence model.
  • attributeFileName: path to the spj file for the boundary and potentially initial conditions
  • modelFileName: path to the geometric model (can be a parasolid or geomsim model on Linux but only geomsim is available on BGQ).
  • meshFileName: path to the directory that includes the input mesh files under the SCOREC MDS format. Note that the path must end with a /. This path can also be prepended by "bz2:" to tell the mesh file reader that the files have been compressed. This follows the same convention as mentioned in 3)
  • outMeshFileName: obviously the name of the directory that will include the resulting output mesh files. Note again the trailing / character. The same convention with "bz2:" keyword applies.
  • restartFileName: this gives the path to the restart files that needs to be read in when solution migration is activated. In this case, the path should look for instance like "../4-procs_case/restart". The phasta reader will then add the time step stamp to the name of this restartFileName variable, as well as the file #. When there is no solution migration like in this example, this parameter can be commented out for the sake of clarity.
  • adaptFlag: if 0, no mesh adaptation will take place. But if set to 1 and if AdaptStrategy is set to 7, then the mesh will be uniformly refined. Note that adaptation only works with a mixed mesh (with wedges in the BL) and not with an all-tet mesh. Tetrahedronization should therefore take place after uniform refinement. Right now, the mixed mesh gets uniformly refined everywhere including the BL but it is possible to refine uniformly outside the BL only with some light modifications of the code. In the future, we hope to have other adaptation strategies in place in Chef based on local error indicator. If interested in these strategy, then phParAdapt-Simmetrix must be used. If adaptFlags is set to 1, note also that solutionMigration must be also set to 1 (see below for this parameter) and the path to the restart files specified.
  • rRead: obsolete/unused.
  • rStart: obsolete/unused.
  • AdaptStrategy: This parameter is read if adaptFlat is 1. When set up to 7, uniform refinement of a mixed mesh can take place. This is currently the only strategy tested in Chef. If interested in other more sophisticated adaptation strategies, phParAdapt-Simmetrix must be used for now.
  • RecursiveUR: if AdaptStrategy is set to 7, Chef offers the possibility to do recursive uniform refinement within the same job. Beware of the memory consumption if you set this value to more than 1, since the mesh can grow quickly.
  • Periodic: obsolete. Periodicity in the mesh and in the solution is not treated automatically as long as i) the mesh built with BLMesher is periodic (i.e. location of the mesh vertices on periodic faces in the same) and ii) the spj file contains the correct "periodic slave" attributes.
  • prCD: obsolete/unused.
  • timing: obsolete/unused.
  • outputFormat: obsolete. Phasta files are saved by default in binary format.
  • internalBCNodes: obsolete/unused.
  • WRITEASC: obsolete/unused.
  • phastaIO: obsolete/unused.
  • numTotParts: Final number of parts. If numTotParts is larger than the number of Chef processes which is equal to the number of input parts, the mesh will be partitioned.
  • elementsPerMigration: In order to reduce the memory foot print of Chef, the user can reduce the default number of elements that can be migrated at a time during partitioning or partition improvement.
  • SolutionMigration: Activates the migration of the solution from an existing set of restart files. In this case, the path to the phasta files that contain the solution to migrate must be specified through the restartFileName parameter (see above). If the mesh is refined, the solution that is migrated will be interpolated to the new vertices of the mesh. Note also that if the solution is migrated, then the spj file should contain NO information about the initial condition. Indeed any information mentioned in the spj file will prevail. Therefore, if the spj file contains information about the initial conditions, the solution migrated from existing restart files will be overwritten and the resulting phasta files will include again the scratch solution specified in the spj file.
  • DisplacementMigration: Migrates also the displacement field along with with solution field for other adaptation strategies. Not used for AdaptStrategy 7 so can be ignored for now.
  • isReorder: obsolete/unused. Reordering for better cache performance is now applied by default to both the phasta files and mesh files.
  • Tetrahedronize: tetrahedronize a mixed mesh if set to 1. Note that if both AdaptFlag and Tetrahedronize are set to 1, adaptation of the input mixed mesh will take place before tetrahedronization. In all cases, partitioning is always the last mesh operation. But again, an all tet mesh cannot be further refined so tetrahedronization should not take place too early in the partitioning workflow in order to get enough aggregated memory for potential future adaptation.
  • numSplit: obsolete/unused.
  • LocalPtn: local partitioning if set to 1, set global partitioning if set to 0. Currently, only local partitioning is implemented in Chef and has been shown to be sufficient so far.
  • RecursivePtn: should always be set to 1. In the past, this parameter allowed recursive partitioning steps in phParAdapt. The code will stop or crash if this parameter is not 1.
  • RecursivePtnStep: obsolete/unused.
  • partitionMethod: Currently, the GRAPH method for local partitioning is hard coded in one of the Chef routine.
  • ParmaPtn: If set to 1, the load balance in terms of both elements and vertices per part is improved further after the partitioning with Parma. It is strongly suggested to keep ParmaPtn set to 1.
  • dwalMigration: This parameter is useful in case the distance to the wall for a turbulence model such as RANS or DDES has already been computed by phasta. In this case, it is possible to migrate also this field along with the solution field. SolutionMigration must therefore be set to 1 for that purpose, since the dwal field cannot be migrated alone without the solution field.
  • buildMapping: This computes the vertex mapping between the input and output mesh. It is strongly suggested to keep this parameter always set to 1. Otherwise, you will not be able to reduce your solution from your final partitioning down to the initial or any intermediate mesh (we have developed a tool for that purpose), which can be catastrophic if you are interested in local adaptation based on an error indicator. Note that building the mapping does not make sense if the mesh is uniformly refined so it should be set to 0 in this case.
  • initBubbles: The Chef will use the external bubble information file 'bubbles.inp' to initialize the level set distance field if this flag is detected to be activated.

The second example of a partitioning with Chef can be found in /sgidata2/mrasquin/Models/TwoPhase/subchannel/subchannel_3way/Mixed-Parallel1-parasolid-9.0-140906/2-A0/1PFPP-phPA/4-1-Chef-PartLocal-Scratch/8-4-Chef-Tet-PartLocal-SolMgr. For this case, based on the naming convention of 8-4-Chef-Tet-PartLocal-SolMgr (and the parameters specified in adapt.inp and numstart.dat),

  • the number of output parts requested is 8,
  • the number of input parts is 4 (note "-np 4" in the runChef.sh script),
  • the input mixed mesh is first tetrahedronized before being partitioned.
  • the solution in the resulting phasta files is migrated from the previous Chef run.

Note that the spj file is different for this second example and the initial conditions have been commented out in order not to overwrite the solution that is migrated from the previous Chef run.

The third and final example can be found in /sgidata2/mrasquin/Models/TwoPhase/subchannel/subchannel_3way/Mixed-Parallel1-parasolid-9.0-140906/2-A0/1PFPP-phPA/4-1-Chef-PartLocal-Scratch/8-4-Chef-UR2-Tet-PartLocal-SolMgr. In this directory 8-4-Chef-UR2-Tet-PartLocal-SolMgr, Chef

  • reads a four part mesh,
  • applies a double recursive uniform refinement,
  • tetrahedronize the resulting mixed mesh that has been uniformly refined twice,
  • partition the resulting 4 part all-tet uniformly refined mesh into 8 parts,
  • migrate and interpolate the solution read from existing restart files coming from the first example.

As a final comment, note that the restart files are always read directly from a procs_case directory. However, when the number of output restart files exceeds 2048, the restart files are then saved in subdirectories of the root procs_case directory in order to reduce file contention, in the same (but still different) way as what you have implemented at some point in your version of phasta. The best strategy would be to write phasta files using mpi_io for instance so that we can store more than one part in a single file and avoid large number of phasta files.

For further partitioning on BG/Q machines a conversion to the native Parasolid model is required. The tool is located in: /Install/SCOREC.develop/scorec/test/cadToSim/cadToSim and should be run from [Case directory]/convertParasolid2ParasolidNative/ on firebird.

Updated Chef version (2015/03/26)

1) MPI implementation

A new version of chef has been implemented and does not rely on threads any more. Instead, it is now based on a pure MPI implementation. That means that there is an important change in how chef is called at runtime.

With the previous threaded version, the number of MPI processes had to be equal to the number of input parts. Chef was then in charge of starting a number of threads equal to the number of output parts, which was automatic.

Since the pure MPI version of chef does not start thread any more, it now requires a number of MPI processes equal to the final number of output parts, and not input parts.


2) adapt.inp

In the new version of chef, "numTotParts" in adapt.inp (which was used to specify the final number of output parts) has been replaced by "splitFactor", which corresponds to the ratio of the number of output parts with the number of input parts. If you set this parameter to 1, the mesh will not be split and the number of output parts will be equal to the number of input parts. If you set this parameter to 2, each part of your input mesh will be split in 2 new sub-parts, etc Keep in mind that the number of MPI processes that needs to be requested for chef must therefore be equal to (number of input parts times) * (splitFactor).

I have also removed the obsolete parameter in adapt.inp and saved a representative version of this file in /projects/tools/SCOREC.develop/runscripts/adapt.inp


3) Paths

I have updated chef on the VIz nodes, Mira and Tukey so that it only relies on the more robust pure MPI implementation.

On the viz nodes, use /projects/tools/SCOREC.develop/build-chefMPI-GNU-*/test/chef For simplicity, this is the default version of the master branch coming directly from our github repository.

On Tukey, use /home/mrasquin/SCOREC.develop/build-tukey-GNU-OptG-c2c360bc-mpi-* - build-tukey-GNU-OptG-c2c360bc-mpi-tol35-noblsnap means that the target imbalance for the vtx and elem is 3% and 5% respectively, and BL snapping is off during uniform refinement (UR). - build-tukey-GNU-OptG-c2c360bc-mpi-tol35 means that the target imbalance for the vtx and elem is 3% and 5% respectively, and BL snapping is on during UR. - build-tukey-GNU-OptG-c2c360bc-mpi-tol33 means that the target imbalance for both the vtx and elem is 3%, and BL snapping is on during UR. Note that these versions have been slightly modified w.r.t. the master branch. In particular, the imbalance target is not a parameter yet. Also, in Parma, HPS (Heavy Part Splitting) and FixDisconnectedPart are not called here because the latest version of the diffusion algorithm with improved selection of (i) target parts for element exchange and (ii) elements.

On Mira, use /home/mrasquin/SCOREC.develop/build-XL-OptG-c2c360bc-mpi-* Similar comments applies to build-XL-OptG-c2c360bc-mpi-tol33, build-XL-OptG-c2c360bc-mpi-tol35 and build-XL-OptG-c2c360bc-mpi-tol35-noblsnap.

Note that BL snapping is not called for a repartitioning of the mesh. It can only play a role during uniform refinement. Consequently, if you do not request a UR in adapt.inp, then build-*-tol35 and build-*-tol35-noblsnap will behave the same way.

In case you are wondering what the weird numbers are in the name of the build directory, this comes from the git log hash, which is a unique number associated with a git commit (easier to couple an executable with a version of the code).