GRID UK Work Program

Nick West
Last modified: Wed Aug 15 11:02:02 BST 2007

This is a management document whose sole function is to support the migration of the UK MINOS groups from the RAL csf batch farm to the LHC GRID.
It is not a source of GRID documentation, for that see MINOS on the GRID

On March 19th we had a meeting and made some decisions.


Contents


Objectives

The primary objective of this work program is the migration of the UK MINOS groups from the RAL csf batch farm to the LHC GRID using resources at the RAL T1 and T2. A secondary objective is to broaden the context to include other sites in the UK and US and also groups from the US, at least to the extent of not doing anything gratuitously that would make this development any harder than it has to be.


Project Status

The overall status of the project is shown by this colour coded table.

Task Phase Action Items
Overview Analysis Overview Scope
Overview Use Cases
Software Installation Analysis Scope
Technology Choice Software
Database
Flux files
Customisation Customisation
Implementation minossoft
labyrinth
Flux files
Job Submission Analysis Scope
Technology Choice Introduction
Understanding Scheduling Policies
Studying Existing Practices
Customisation Customisation
Implementation Implementation
Data Access Analysis Scope
Logical API
Technology Choice Front-End
Back-End
Customisation Front-End
Back-End
Implementation Front-End
Back-End
Globalisation Analysis Scope
Technology Choice Choice
Customisation n/a
Implementation n/a
Documentation Analysis Scope
Technology Choice Choice
Customisation n/a
Implementation Implementation


Definitions

Tasks

The work program is divided into the follow tasks.

1 Overview

This is the top level task that sets the overall context and records Use Cases. Other subordinate tasks are identified here.

2 Software Installation

This deals with the installation of our software, both Fortran and C++ together with and e.g. flux files, databases etc that they require.

3 Job Submission

This section deal with all aspects of job submission and subsequent data retrieval, from the humble simple one-off job to a full scale multi-stage production with automated resubmission and summary of status.

4 Data Access

This section deals with all aspects of data I/O from a batch worker node (WN) to its local disk and Storage Element (SE), to wide area transfers between SEs and in particular between RAL and FNAL.

5 Globalisation

This section deals with the broadening of context from computing on the LHC GRID using resources at RAL T1 and T2 to other sites and in particular FNAL and also to the US groups.

6 Documentation

.. because no project is complete until it is documented!


Phase

Each Task is divided into the following, possibly overlapping, phases.

1 Analysis

Determine the scope of the task and decide in a more or less technology neutral way, what we require. This needs to be set in the context of our primary objective: using in the region of:-
    * CPU:  50 kSpecInt2k
    * Disk: 10 TB 
    * Tape: 10 TB 
resources at RAL. Some retrenchment of our requirements may be inevitable given the technologies we have available and the resources we have to implement decisions, but that should be a secondary consideration.

2 Technology Choice

This is the early design phase. We are not starting at ground zero; there is a lot of software out there and a lot of experience using it and we should exploit both!

3 Customisation

This is the later design phase. Having decided on the technology we have to customise it and fill in the gaps so that we can use it.

4 Implementation

Having completed the design, solutions have to be implemented.

Note that in the case of Overview there will only be an Analysis phase. It should generate further tasks to be taken through the remaining phases.


Action Item

The work program is organised into a series of action items including such activities as:- Once the item is complete there will be a record of it. For example in the cases above:-


Colour Coding

In order to get instant overview of the project status, tasks and action items are colour coded as follows:-
ColourMeaning
red Behind schedule, needs further effort.
orange Actively being worked on, more or less on schedule.
green Ahead of schedule, no need for further effort now.


Action Items


Action Item: Overview Scope

The primary scope of this project is leap the fast approaching hurdle: GRID-only access to the RAL T1 resources by the end of August 2007. Even assuming there were an appetite to do this is a "GRID-pure" way we have to be pragmatic and focus on this to minimise the risk that we won't be ready in time. On the other hand we need to keep one eye on Globalisation and balance long terms gains against short term costs.

In a nutshell what we require is to be able to continue to run the same mix of jobs as before with as little extra effort as possible.


Action Item: Overview Use Cases

Mike has supplied the following Use Cases.

Created with: latex2html -split 0 use_cases.tex


Action Item: Software Installation Scope

We have to be able to install the following We do need to support Snapshot Releases but not development flip-flop.


Action Item: Software Installation Technology Choice: Software

For the installation of our software we will use RSD: Remote Software Deployment. It can be used for minossoft, labyrinth and the supporting libraries.


Action Item: Software Installation Technology Choice: Database

Within RAL T1 and T2 we can connect to the sql.gridpp.rl.ac.uk server and as at least software admins will be allowed to log into cluster head node we can run DBMauto to keep it up to date.

On the other hand there is Globalisation to consider. Whereas within the LHC GRID I think the standard operating system is base on SLx which includes MySQL I don't know if that's the case with wider GRID or whether logging to maintain it would be permitted which is why this section is still orange.


Action Item: Software Installation Technology Choice: Flux files

I had hoped that RSD: Remote Software Deployment could be used but it won't work: in general there isn't sufficient space in the software area (which is shared with other groups) to have ~ 50GB of flux files.

Most likely we will stored them like other data and then this just becomes another aspect of Data Access albeit that the way the data is requested will be different.


Action Item: Software Installation Implementation: minossoft

The system RSD is up and running. Each new release has to be packaged up but that should be an almost automatic process.

The support libraries for NeugenInterface i.e. Neugen3, Pythia and Lund have a RSD wrapper.

RSD supports farms with heterogeneous platforms and prevent binaries being incorrectly shared by encoding the platform name and building it into the directory structure. Currently this encoding is only set up for SLx but should be easy extend if needed. Although the GRID is not the right place for program development, there still a case to run software that isn't ready, or may never be destined, to be part of a Frozen Release. This means that we need to be able to install and run a Test Release.


Action Item: Software Installation Implementation: labyrinth

RSD supports DAIKON_VERSIONS and DAIKON_SCRIPTS which are Robert's solution to packaging the labyrinth.


Action Item: Software Installation Implementation: Flux files

Flux files are too big to store on a software disk so have to be pulled, as required from a SE using DCM which has been extended so that it can preserve the correct directory structure. The script copy_flux_files_lcg.sh uses DCM to load the correct flux files for a given beam and has a resource locking mechanism to avoid overloading the LAN.


Action Item: Job Submission Scope

We need to cover:- Mike would like to do something even more ambitious:

I want to run some kind of mega-job which generates, say, 10 detector and 10 rock singles files in parallel (e.g., 20 cpus) and then overlays the output once all are done. Each job takes typically 6-15 hours and require large input files (total 12G) which are copied to the execution host to minimise LAN I/O as they are read several times. Have to be carefull that multiple jobs starting synchronously don't overload the LAN. The output files are 300-500 Mb/job (compressed). Ideally the mega-job would be able to react if one of the singles files crashed, probably by ordering another singles file run. This kind of operation is not straightforward in pbs, though maybe not impossible, at least in RAL's version. The tricky bit is coping with crashes.


Action Item: Job Submission Technology Choice

So far I only know of very basic job submission tools edg-job-* and globus-job-run which I have supplemented with RSD job * commands see Tutorial: Submitting jobs I have separated this item into a further two: Understanding Scheduling Policies and Studying Existing Practices


Action Item: Job Submission Technology Choice: Understanding Scheduling Policies

JDL (Job Definition Language) allows the user to target a CE (Computing Element) satisfying certain conditions including load levels and estimated time from entering the queue to execution. However, the GRID isn't a free for all and ultimately resources used have to fall into line with those that are allocated. We need to understand how this balance is achieved. It maybe, while we stay on RAL T1 that a fare-shares scheme operates as now, but if we do expand to other CEs we need to understand how this works in case there is a danger that either we fail to get our share or that we use it too rapidly and run short towards the end of a quota period.


Action Item: Job Submission Technology Choice: Studying Existing Practices

Several group e.g. BaBar and some LHC experiments already do production work on the GRID so we need to pick their brains. I have contacted CMS (Dave Newbold), ATLAS (Roger Jones) and LHCb (Raja Nandakumar):-

CMS BOSS

BOSS might be a possibility, if our workload is sufficiently large to warrant the overhead. The latest User Guide is over a year out of date (there have been a number of releases since then) but perhaps at least gives a feel for the system. My first impressions are:-

Pros

Cons

Speaking Gutsche Oliver of CMS, BOSS although good, is "very complicated" and "still not rock-stable". David Collings agrees that if we want basically a replacment batch system we should layer scripts on the basic job submission tools.

For the medium to long term BOSS, at the very least, is an interesting model to study.

CMS Crab

LCG Interoperability

Uses BOSS and adapts it to the CMS environment and forms a layer beneath which jobs can run on both OSG and LCG/EGEE GRIDs. David Colling: CRAB is CMS specific, would not be suitable for MINOS.

LHCb DIRAC

Andrei Tsaregorodtsev (LHCb)suggested we look at DRAC as it is a generic system. I have found:-
 DIRAC Distributed Infrastructure with Remote Agent Control
Abstract of that paper:-

This paper describes DIRAC, the LHCb Monte Carlo production system. DIRAC has a client/server architecture based on: Compute elements distributed among the collaborating institutes; Databases for production management, bookkeeping (the metadata catalogue) and software configuration; Monitoring and cataloguing services for updating and accessing the databases. Locally installed software agents implemented in Python monitor the local batch queue, interrogate the production database for any outstanding production requests using the XML-RPC protocol and initiate the job submission. The agent checks and, if necessary, installs any required software automatically. After the job has processed the events, the agent transfers the output data and updates the metadata catalogue. DIRAC has been successfully installed at 18 collaborating institutes, including the DataGRID, and has been used in recent Physics Data Challenges. In the near to medium term future we must use a mixed environment with different types of grid middleware or no middleware. We describe how this flexibility has been achieved and how ubiquitously available grid middleware would improve DIRAC."

First impression is that this is too heavyweight for us. It incorporates software deployment of application software based on the GAUDI framework and managed with CMT. We already have RSD. Also its bookkeeping replicates SAM.

Atlas/LHCb GANGA

Early on I found little information on this and was inclined to dismiss it. However I have now taken a closer look and it appears almost perfectly suited to our needs and have now chosen this as our Job Submission system.

FNAL CONDOR_G

The FNAL batch farm also uses GRID tools so I have spoken with Howie. I have looked for examples of CONDOR_G running on LCG and have found:-

A Grid of Grids using Condor-G

which is the "mainstay of ATLAS production on LCG". However, I have also found, in talk in October 2006

EGEE GRID: Also CondorG submission is possible Requires some expertise and has no support from the service provider

Instead the recommended way to submit to EGEE is via GANGA.

CMS appears to take the view that the two GRIDs: OSG and LCG/EGEE would not harmonise rapidly enough that a single interface would emerge in time for LHC so instead provided a more generic layer than could underneath talk to both GRIDs and only use CONDOR_G for OSG.

I have also spoken with Gutsche Oliver (CMS). OSG uses direct submission CONDOR_G and uses it's input/output sandbox handling while on EGEE jobs are submitted to the RB that handles the sandbox while CONDOR_G deals with job submission with I/O directed to the RBs SE. To me this sounds like we have to understand the workings CONDOR_G and RBs quite well and I know neither! CONDOR_G isn't going to be a quick fix solution.

Conclusion

We will use Ganga.


Action Item: Job Submission Customisation

Ganga can be used out-of-the-box but to make it as easy and reliable as possible for production work we need to develop layer on top to provide a way to automatically monitor a set of jobs and respond to changes for example:- So the custimisation itself is not an issue, its now down to Implementation


Action Item: Job Submission Implementation

In order to provide a batch system on top of Ganga Ganga-based Batch Submission (GBS) is being developed.


Action Item: Data Access Scope

Data Access covers:- We need to decouple the job production system from details of data access as far as possible so I have separated out Data Access Logical API first as a way to analyse what we require and then hopefully develop a system to satisfy it that job production can use.


Action Item: Data Access Logical API

A logical interface has to specify the following, either explicitly or implicitly:-

A good interface should be able to supply defaults for all but the first two, for example right now DCM can be passed a file name and uses that to determine the Remote Location, copy it over if required and return a soft link as the Local Representation. However, we do need to control each of these independently. For example


Action Item: Data Access Technology Choice Front-End

Having a system that impliments our Data Access Logical API offers adavantages both as we migrate to the GRID and subsequently if the technology changes. Currently DCM is the system we use and it has been agreed that we develop it to satisfy the API.


Action Item: Data Access Technology Choice Back-End

There is plenty of choice of Storage Element Protocols We still have learn which the best ones are. In the meantime, the following should be considered.


Action Item: Data Access Customisation Front-End

We will use DCM so there isn't anything that needs to be customised.


Action Item: Data Access Customisation Back-End

For the short to medium term we will use rfio for CASTOR and dccp (to copy) and posix (to list) for dCache.


Action Item: Data Access Implementation Front-End

On the assumption that we will use DCM there are a number of short term items that need addressing. Some are already complete. In the longer term the DCM mapping from file name to a URL needs to be work at sites away from RAL. Hopefully LFC will mature enough by then.

Data Access Implementation Back-End

We have chosen low-level, technology-specific CASTOR and dCache protocols for the short term. This will need to be revised in the longer term when we need to work at sites away from RAL.


Action Item: Globalisation Scope

In a perfect world we could easily set up a system that runs anywhere in the world. In the real world we have a GRID of overlapping incompatible technologies and deadline by which we need to be running at RAL. So we have to make that our primary focus but shouldn't gratuitously make choices that make it harder to globalise, so need to to at least consider what would work at FNAL.

I will divide the move towards globalisation into 3 time periods:-

  1. Short term (next 6 months)
    In this time frame we want to be running on the RAL T1 with a singe SE (CASTOR)

  2. Medium term (6 months - 18 months)
    It has been suggested that as LHC comes on stream there is going to be a great deal of pressure on RAL T1 and we may well do better by moving off onto local T2 at RAL, Oxford and the London Universities (who have an MOU that requires them ALL to provide resources to any experiment that ANY of them run. This doesn't impact much on our technology choice but will complicate our storage model as we have to handle shipping data to and from RAL

  3. Long term (>18 months)
    Here the ultimate goal is to allow all of MINOS to have a common interface to both EGEE and OSG GRIDS. Well, we can dream can't we?


Action Item: Globalisation Technology Choice

From what I have learned so far (see Job Submission Technology Choice ) the two GRIDs haven't harmonised to the point where this can affect our choice in the short term; although Ganga does have a Condor back-end so it might work on Fermigrid.


Action Item: Documentation Scope

As it explains in MINOS on the GRID: Intended Audience

"This web has been written not because there is insufficient information already on the Internet, but perversely because there is far too much. The problem is that the GRID isn't a single complete consistent whole but rather multiple overlapping, conflicting, evolving systems and we have to select from the myriad of available documents just those that are strictly relevant to get our job done."

So the scope of the document is global: to cover every aspect of the GRID that is necessary to get the job done, however it can and should point to material on the web wherever appropriate.


Action Item: Documentation Technology Choice

Our standard documentation forms are LaTex and HTML but it's no real contest for a constantly changing document in the computing domain: it has to be HTML.


Action Item: Documentation Implementation

This action item will be used to list known weaknesses in MINOS on the GRID
that need to be addressed.

Also, although the documentation describes how to how to use the lcg-infosites, lcg-info and ldapsearch commands to collect information about the GRID, these commands are not particularly friendly so I propose writing a utility GIQ (GRID Information Query) that will be able to run all the standard queries. The reason I have placed this development here is that it will include an option to display the commands it has executed to collect the information so acts as a teaching aid to help users when they need to go beyond the standard query set.