Difference between revisions of "NAS"

From PHASTA Wiki
Jump to: navigation, search
(Add common commands)
(Add Aitken to machine list)
Line 26: Line 26:
 
|  
 
|  
 
| Merope
 
| Merope
 +
| Compute
 +
|-
 +
|
 +
| Aitken
 
| Compute
 
| Compute
 
|-
 
|-

Revision as of 15:35, 12 October 2022

Wiki for information related to the NASA Advanced Supercomputing (NAS) facility.

Overview

Key Value Notes
Machines Pleiades Compute
Lou Storage and Analysis
Electra Compute
Endeavour Compute
Merope Compute
Aitken Compute
Job Submission System PBS
Facility Documentation Support Knowledgebase

How-To's

How-To's in Separate Wiki's

Backup Data from Scratch Directories

This is done simply by copying data from the /nobackup/$USER directories to your home directory on Lou (lfe). The /nobackup/$USER directories are mounted onto lfe, so transfers should be done on lfe.

It is recommended to mirror the directory structure of your /nobackup/$USER directory on lfe to allow for the data to be easily recovered back to it's original state. This is especially important if you use symlinks (as they are path dependent and will break if either the source file or the symlink itself are not in the correct location).

This can be done with scp, but it is recommended to use NASA's in-house utility shiftc. shiftc will automatically perform parallel file transfers, data integrity checks and repairs, and syncing features similar to rsync.

Commands:

jrwrigh7@lfe7: shiftc -r -d --sync /nobackup/jrwrigh7/models/STGFlatPlate/STFM_Tet_dz4-10_dx15 .

This will copy the directory STFM_Tet_dz4-10_dx15 to the current location (.). The flags do as follows

  • -r: Recursively copy files from destination
  • -d: Create required directories that don't already exist. Equivalent of the -p flag for mkdir
  • --sync: Only copy over "new" files, where "new" are any changes to the modification time or file size.
    • If a file exists on destination (.), but not source (STFM_Tet_dz4-10_dx15), it will not be copied back to source nor will it be deleted to match the state of source.

Once this command is submitted, the transfer process will be backgrounded. Progress can be viewed by running shiftc --monitor. Additionally, you will recieve an email with the transfer job is completed.

jrwrigh7@lfe7: shiftc --stop --id [shiftc job ID]

This will stop the given shiftc job. The [shiftc job ID] is the same number that appears beside the output of shiftc --monitor.

More documentation for shiftc can be found in its man page (man shiftc) and on NAS's documentation website.

Control MPI Rank Placement

Rank 1 Solo Node

To make the rank 1 MPI process take a node on it's own, put this in the PBS directives:

#PBS -l select=1:mpiprocs=1:model=sky_ele+1:mpiprocs=40:model=sky_ele

This will request 2 nodes: One will have the rank 1 process all by itself, and the other will have 40 MPI Processes (for all 40 CPU cores available on sky_ele nodes).

Distribute Non-First Rank MPI Processes

For controlling the placement of non-first rank MPI processes, use the mbind.x utility.

For example, if we have requested 4 nodes and want 10 MPI processes per node, the mpiexec command needs to be modified to the following:

mpiexec -np 40 /u/scicon/tools/bin/mbind.x -n10 [executable]

Note that mbind.x is also socket aware, so it will distribute nodes evenly between nodes and between CPU's in each node (NAS nodes have 2 CPU's per node). For more information on mbind.x, see it's help flag (mbind.x -help) or NAS's documentation website.

Common commands

  • node_stats.sh: Displays how many nodes are available or actively running jobs
  • tracejobssh: Helps to answer "Why isn't my job running?". Part of the git repo.

See Priority "Score" in Queue

To see what your priority "score" in PBS is use the qstat -W o=+pri to add the "Priority" column to the output of qstat.

Priority Scoring (as of 2021-01-22)

  • Job priority score grows by 1 every 12 hours
  • We are capped at a max score of 20 per job
    • Note that other users/groups using NAS may start with higher priority and grow higher than 20
    • Result is that it's quite difficult to get large jobs running
  • If you don't have any jobs running, you get an addition +10 to the score
    • This score bump is removed as soon as you have a running job