Grid Tutorial

Ganga is no longer the suggested tool for Pheno job submission.

There are multiple grid middleware distributions available. Ganga is a package, implemented in python, that provides abstraction of the middleware layer. No matter which of these processing system (backend) is used, the job submission proceeds in the same way. For most, the real benefit is that Ganga makes the job managment easy.

Ganga stores all the default settings in "~/.gangarc". Open this file with a text editor and change the value of "VirtualOrganisation" to "pheno".

1. First run

Similarly with Python, Ganga can be used via shell or with input files. Ganga's interactive IPython shell is opened when you launch Ganga without an input file.

/cvmfs/ganga.cern.ch/Ganga/install/6.1.1-hotfix1/bin/ganga

If run for the first time, Ganga will now create a file describing the default settings. This is stored in "~/.gangarc". Exit Ganga (type Ctrl-D), and open this file with a text editor and change the value of "VirtualOrganisation" to "pheno".

Start Ganga again by typing

/cvmfs/ganga.cern.ch/Ganga/install/6.1.1-hotfix1/bin/ganga

From now on, commands launched in the Ganga shell are differentiated from the grid UI shell commands with a blue background.

To run a "Hello World!" job, execute the following commands in the Ganga shell

job=Job()

job.application = Executable(exe=File("/bin/echo"))

job.application.args = "Hello World!"

job.submit()

You can get a list of all your jobs with

jobs

Without defining a backend the job will run locally without using the grid. Once the job finishes, all the output is stored in a directory called "gangadir" which location is defined in "~/.gangarc". Pulling back the output files happens automatically as long as the Ganga shell is running. You can take a look into the directory by using 'peek' function. Here we assume that the number of this job is 0.

jobs(0).peek()

There is always two files, stdout and stderr or stdout.gz and stderr.gz, containing the standard output and standard error streams from the execution of the job. The 'peek' command can be used to view the files as well.

job(0).peek("stdout")

If everything went smoothly, this file should contain "Hello World!" and the error file should be empty.

2. First run on the grid

Only change we need to do in order to run on the grid is to define a backend. For ARC backend, also the CE needs to be defined here, if it is not set in "~/.gangarc" or in "~/.arc/client.conf". If you define multiple CEs in the latter, ARC broker will decide where to submit according some default algorithm.

job=Job()

job.application = Executable(exe=File("/bin/echo"))

job.application.args = "Hello World!"

job.backend = "ARC"

job.backend.CE = "ce1.dur.scotgrid.ac.uk"

job.submit()

Mind the time delay.

As mentioned in the ARC section, it can take time before the job enters the information system.

In rare occasions, the job information trasfer to ganga does not work.

If you get several warnings that "cannot find a job for job id None" or something similar, you need to update the Ganga ARC job database manually.

arcsync -j ~/.arc/gangajobs.xml -c nameOfCE

3. Spliting jobs

What makes Ganga very useful is its ability to split a job into several subjobs. Here we use the default ArgSplitter which splits a job according to a list of arguments given.

job=Job()

job.application = Executable(exe=File("/bin/echo"))

args = [["Hello Durham"],["Hello UK"],["Hello Europe"],["Hello World"]]

splitter=ArgSplitter(args=args)

job.splitter=splitter

job.backend = "ARC"

job.backend.CE = "ce1.dur.scotgrid.ac.uk"

job.submit()

You can get the list of all your subjobs with

jobs(1).subjobs

and see the output files of the first subjob with

jobs(1).subjobs(0).peek()

4. Using a submission script

Usig the submission script below (ganga-submit.py), we can submit the simple c++ program in the same way as before in the ARC section. Here the submission script, the c++ program and the inout file are expected to in a directory called "/mt/home/username/GridTutorial/".

#!/usr/bin/env ganga

import sys
import os

# Create the job
j0 = Job()
j0.name = 'TestJob'
j0.application = Executable(exe=File("simple"))
j0.application.args = "input.txt"
j0.backend="ARC"
j0.backend.CE='ce1.dur.scotgrid.ac.uk'

## Add input files.
j0.inputfiles = [LocalFile("input.txt","/mt/home/username/GridTutorial/"),]
# Declare the output files
j0.outputfiles = [LocalFile('output.txt')]


# Finally submit...
print "Submitting... be patient"
j0.submit()
print 'Job',j0.fqid,'submitted'

The second argument in "LocalFile" is optional if the file is in the same directory where you execute the submission script. To processs the file, execute

/cvmfs/ganga.cern.ch/Ganga/install/6.1.1-hotfix1/bin/ganga ganga-submit.py

Then you can launch the Ganga shell and the job monitoring proceeds as before.

A list of computing sites that would accept your arc jobs can be obtained with the programme lcg-infosites, asking to list all "computing elements" for pheno: (this requires you to have already made a proxy certificate with arcproxy or voms-proxy-init)

lcg-infosites --vo pheno CE

This will return a long list of UK sites. Submitting to RAL is good (they have very many computers), and so is e.g. Glasgow. To submit to RAL, one would simply replace the single line in the python job submission script detailing the CE with e.g. (for submission to RAL):

12. j0.backend.CE='arc-ce01.gridpp.rl.ac.uk'

or (for submission to Glasgow):

12. j0.backend.CE='svr009.gla.scotgrid.ac.uk'

It is possible to have ganga pick compute elements arbitrary from a list (to spead the load onto many clusters). And there is an even smart way of using a central service to submit to jobs to the site that have the jobs start the quickest (see the section on Dirac).