GridFTP

From PHASTA Wiki
Revision as of 15:38, 4 August 2011 by Matthb2 (talk | contribs) (Created page with "==Introduction== GridFTP is a high performance file transfer tool supported by many of the large super-compute sites. It uses parallelism to hide the effects of latency and TCP'...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Introduction

GridFTP is a high performance file transfer tool supported by many of the large super-compute sites. It uses parallelism to hide the effects of latency and TCP's congestion controls.

GridFTP is part of the larger Globus Toolkit, which also includes (among other things) a custom PKI based authentication mechanism. This can be used to achieve password-less authentication with supported sites (most of TeraGrid for example).

Setup

GridFTP should already be in your path and configured by default on jumpgate-phasta.colorado.edu. If it is not (due to unusual shell configuration or other issues) you can usually get it by running

 source /etc/profile

To test if you have GridFTP in your path correctly you can run these command (each should give you some output)

 which globus-url-copy
 which myproxy-logon
 echo $GLOBUS_LOCATION
 echo $LD_LIBRARY_PATH | grep gt5

To use sshftp mode, you'll also want to have a SSH preshared key configured with your account, (either without a password, or with ssh-agent)

Authentication

Most sites support the use of "GSI" password-less authentication. If the site you want to connect to offers this option, you probably want to use it.

The way GSI authentication is usually implemented uses a tool called "myproxy" which generates a key which you can use for a short time to authenticate without a password.

The first step to using GSI authentication is to install the necessary certificates in your home directory. Most sites will allow you to do this automatically using the myproxy tool's "-T" option. You only need to use this option the first time you use myproxy.

For example, to configure my account to connect to the GridFTP/myproxy server at ANL, I could use this command:

 myproxy-logon -T -l matthb -t 300 -s gs1.intrepid.alcf.anl.gov

This command can be disected as follows: -T tells myproxy to fetch the trust roots (CA, etc). You should only need to do this the first time you connect to a particular server and when the server's CA is updated.

-l matthb specifies my username (replace "matthb" with your username)

-t 300 specifies that I want my session key to expire in 300 hours (the default is 24)

-s gs1.intrepid.alcf.anl.gov specifies the myproxy server (*not* necessarily the gridftp server)


Common myproxy Servers

Intrepid and Eureka (ALCF/ANL)

 gs1.intrepid.alcf.anl.gov

(authenticate with your CryptoCard and pin)

Teragrid (Kraken, Ranger, etc)

 myproxy.teragrid.org

(authenticate with your Teragrid portal username/password)

  • -phasta.colorado.edu
 myproxy-phasta.colorado.edu

(authenticate with your normal UNIX/LDAP username/password) (coming soon)

GSISSH

Once you have a GSI key/ticket, many sites allow you to use it to gain shell access as well as do file transfers. For example, even if I don't have my SecureID for Kraken handy, I can still log-in using GSI and my TeraGrid credentials as follows:

 myproxy-logon -T -l myteragriduser -s myproxy.teragrid.org
 gsissh -v mykrakenuser@kraken-gsi.nics.utk.edu

GridFTP

The primary tool for doing GridFTP transfers is called "globus-url-copy" To see it's complete usage you can run

 globus-url-copy -help

In general, you should start with the following set of options:

 globus-url-copy -r -cd -rst -rst-retries 0 -fast -vb -p 64 -g2 -stripe -tcp-bs 4M

followed by your source and destination URLs.

To transfer to/from jumpgate-phasta.colorado.edu your URL should look something like this

sshftp://username@jumpgate-phasta.colorado.edu/users/username/file

Please see the site specific documentation for other site's URLs. These two URLs may be helpful to get you started: https://wiki.alcf.anl.gov/index.php/Using_GridFTP https://www.teragrid.org/web/user-support/transfer_location

Recent versions of globus-url-copy also support a mode which behaves somewhat like rsync, enabled by the

 -sync -sync-level 3

flags.

A complete example to copy a directory from ALCF to jumpgate would be:

 globus-url-copy -r -cd -rst -rst-retries 0 -fast -vb -p 64 -stripe -tcp-bs 4M -g2 -sync -sync-level 3 gsiftp://gridftp.intrepid.alcf.anl.gov/intrepid-fs0/users/matthb/scratch/ sshftp://matthb2@jumpgate-phasta.colorado.edu/scratch/matthb2/scratch_from_anl/

GridFTP Tuning

Depending on the network conditions, you can see massive GridFTP can perform very well or very poorly. You can often improve things by changing a few parameters:

 -p 64

Specifies that you want to use 64 streams. Because TCP needs to receive an acknowledgment before more data can be sent latency can drastically reduce performance over long distances. By adding more streams you can increase the amount of in-transit data and somewhat hide the effects of latency. If performance is poor try increasing this number until you stop seeing improvement. Using excessive numbers of streams won't help and will increase the load on the servers involved. You shouldn't need more than one or two more doublings of the value recommended here.

 -pp

Enable pipelining. This can help with large number of files (particularly relatively small ones), but can also be buggy. If your transfers hang, try disabling this. If you're transferring lots of files, try enabling it.

 -cc 4

Split the transfer over several (4 in this case) concurrent GridFTP sessions. This can also help with large numbers of small files. Use sparingly, and disable if you experience stalled/crashed transfers.

 -sync-level 3

If you're using "sync" mode (similar to rsync) this determines the algorithm used to decide which files need to be (re) transferred. Level 3 uses checksums and will be the most reliable, but slowest. Level 2 is probably fine for most uses. See the globus-url-copy documentation for all the options.