GRID UK Work Program
Nick West
Last modified: Wed Aug 15 11:02:02 BST 2007
This is a management document whose sole function is to support
the migration of the UK MINOS groups from the RAL csf batch farm to
the LHC GRID.
It is not a source of GRID documentation, for that see
MINOS on the GRID
On March 19th we had a meeting
and made some decisions.
Contents
The primary objective of this work program is the migration of the UK
MINOS groups from the RAL csf batch farm to the LHC GRID using
resources at the RAL T1 and T2. A secondary objective is to broaden
the context to include other sites in the UK and US and also groups
from the US, at least to the extent of not doing anything gratuitously
that would make this development any harder than it has to be.
The overall status of the project is shown by this
colour coded
table.
The work program is divided into the follow tasks.
This is the top level task that sets the overall context and records
Use Cases. Other subordinate tasks are identified here.
This deals with the installation of our software, both Fortran and C++
together with and e.g. flux files, databases etc that they require.
This section deal with all aspects of job submission and subsequent
data retrieval, from the humble simple one-off job to a full scale
multi-stage production with automated resubmission and summary of
status.
This section deals with all aspects of data I/O from a batch worker
node (WN) to its local disk and Storage Element (SE), to wide area
transfers between SEs and in particular between RAL and FNAL.
This section deals with the broadening of context from computing on the LHC GRID
using resources at RAL T1 and T2 to other sites and in particular
FNAL and also to the US groups.
.. because no project is complete until it is documented!
Each
Task
is divided into the following, possibly overlapping, phases.
Determine the scope of the task and decide in a more or less
technology neutral way, what we require. This needs to be set in
the context of our primary objective: using in the region of:-
* CPU: 50 kSpecInt2k
* Disk: 10 TB
* Tape: 10 TB
resources at RAL. Some retrenchment of our requirements may be
inevitable given the technologies we have available and the resources
we have to implement decisions, but that should be a secondary
consideration.
This is the early design phase. We are not starting at ground zero;
there is a lot of software out there and a lot of experience using it
and we should exploit both!
This is the later design phase. Having decided on the technology we have
to customise it and fill in the gaps so that we can use it.
Having completed the design, solutions have to be implemented.
Note that in the case of
Overview
there will only be an Analysis phase. It should generate further
tasks to be taken through the remaining phases.
The work program is organised into a series of action items including
such activities as:-
- Gathering information
- Making a decision
- Writing software
Once the item is complete there will be a record of it. For example
in the cases above:-
- A summary of the information gathered
- The decision with supporting reasons
- A summary of the software.
In order to get instant overview of the project status,
tasks
and
action items
are colour coded as follows:-
| Colour | Meaning
|
|---|
| red
| Behind schedule, needs further effort.
|
| orange
| Actively being worked on, more or less on schedule.
|
| green
| Ahead of schedule, no need for further effort now.
|
The primary scope of this project is leap the fast approaching hurdle:
GRID-only access to the RAL T1 resources by the end of August 2007.
Even assuming there were an appetite to do this is a "GRID-pure" way we have to
be pragmatic and focus on this to minimise the risk that we won't be ready in time.
On the other hand we need to keep one eye on
Globalisation
and balance long terms gains against short term costs.
In a nutshell what we require is to be able to continue to run the
same mix of jobs as before with as little extra effort as possible.
Mike has supplied the following Use Cases.
Created with: latex2html -split 0 use_cases.tex
We have to be able to install the following
- Our C++ code i.e. minossoft
- Private Test Releases
- Our Fortran code i.e. labyrinth
- All support libraries e.g. Cernlib, Root, Genie
- The flux files
- The database
We do need to support Snapshot Releases but not development flip-flop.
For the installation of our software we will use
RSD: Remote Software Deployment.
It can be used for minossoft, labyrinth and the supporting libraries.
Within RAL T1 and T2 we can connect to the sql.gridpp.rl.ac.uk server
and as at least software admins will be allowed to log into cluster
head node we can run DBMauto to keep it up to date.
On the other hand there is
Globalisation
to consider. Whereas within the LHC GRID I think the standard
operating system is base on SLx which includes MySQL I don't know if
that's the case with wider GRID or whether logging to maintain it
would be permitted which is why this section is still orange.
I had hoped that
RSD: Remote Software Deployment
could be used but it won't work: in general there isn't sufficient
space in the software area (which is shared with other groups) to have
~ 50GB of flux files.
Most likely we will stored them like other data and then this just becomes another
aspect of
Data Access
albeit that the way the data is requested will be different.
The system
RSD
is up and running. Each new release has to be packaged up
but that should be an almost automatic process.
The support libraries for NeugenInterface i.e. Neugen3, Pythia and Lund
have a RSD wrapper.
RSD supports farms with heterogeneous platforms and prevent binaries
being incorrectly shared by encoding the platform name and building it
into the directory structure. Currently this encoding is only set up
for SLx but should be easy extend if needed.
Although the GRID is not the right place for program development,
there still a case to run software that isn't ready, or may never be
destined, to be part of a Frozen Release. This means that we need to
be able to install and run a Test Release.
RSD supports DAIKON_VERSIONS and DAIKON_SCRIPTS which are Robert's
solution to packaging the labyrinth.
Flux files are too big to store on a software disk so have to be pulled, as required
from a SE using
DCM
which has been extended so that it can preserve the correct directory structure.
The script
copy_flux_files_lcg.sh
uses DCM to load the correct flux files for a given beam and has a
resource locking mechanism to avoid overloading the LAN.
We need to cover:-
- One-off job submission and retrieval.
- Production operations including
- Queue loading: Submitting further jobs when there is capacity.
- Job dependency: Releasing job B after job A successfully completes.
- Job resubmission: Automated resubmission of failures in those
cases where the failure is transient.
- Production status: A overview of status showing status of all jobs.
Mike would like to do something even more ambitious:
I want to run some kind of mega-job which generates, say, 10
detector and 10 rock singles files in parallel (e.g., 20 cpus) and
then overlays the output once all are done. Each job takes typically
6-15 hours and require large input files (total 12G) which are copied
to the execution host to minimise LAN I/O as they are read several
times. Have to be carefull that multiple jobs starting synchronously
don't overload the LAN. The output files are 300-500 Mb/job
(compressed). Ideally the mega-job would be able to react if one of the
singles files crashed, probably by ordering another singles file run.
This kind of operation is not straightforward in pbs, though maybe not
impossible, at least in RAL's version. The tricky bit is coping with
crashes.
So far I only know of very basic job submission tools edg-job-*
and globus-job-run which I have supplemented with RSD
job * commands see
Tutorial: Submitting jobs
I have separated this item into a further two:
Understanding Scheduling Policies
and
Studying Existing Practices
JDL (Job Definition Language)
allows the user to target a
CE (Computing Element)
satisfying certain conditions including load levels and estimated time
from entering the queue to execution. However, the GRID isn't a free
for all and ultimately resources used have to fall into line with
those that are allocated. We need to understand how this balance is
achieved. It maybe, while we stay on RAL T1 that a fare-shares scheme
operates as now, but if we do expand to other CEs we need to
understand how this works in case there is a danger that either we
fail to get our share or that we use it too rapidly and run short
towards the end of a quota period.
Several group e.g. BaBar and some LHC experiments already do
production work on the GRID so we need to pick their brains. I have
contacted CMS (Dave Newbold), ATLAS (Roger Jones) and LHCb (Raja
Nandakumar):-
- CMS: Talk to Dave Colling
- LHCb: Talk to Andrei Tsaregorodtsev
CMS BOSS
BOSS
might be a possibility, if our workload is sufficiently large to
warrant the overhead. The latest
User Guide
is over a year out of date (there have been a number of releases since
then) but perhaps at least gives a feel for the system. My first
impressions are:-
Pros
- System is generic and extensible
- Offers real time monitoring of jobs
- Already supports multiple batch systems including gLite and Condor
- Has long term support by a LHC experiment
Cons
- Is complex (~ 30K lines of C/C++) + relational DB
- No under our control, might end up having to support our own version
- Still have to write
- Scripts to analyse job output
- Scripts to chain jobs together (but BOSS packages up all the info to make decisions)
Speaking Gutsche Oliver of CMS, BOSS although good, is "very
complicated" and "still not rock-stable". David Collings agrees that
if we want basically a replacment batch system we should layer scripts
on the basic job submission tools.
For the medium to long term BOSS, at the very least, is an interesting
model to study.
CMS Crab
LCG Interoperability
Uses BOSS and adapts it to the CMS environment and forms a layer
beneath which jobs can run on both OSG and LCG/EGEE GRIDs. David
Colling: CRAB is CMS specific, would not be suitable for MINOS.
LHCb DIRAC
Andrei Tsaregorodtsev (LHCb)suggested we look at DRAC as it is a
generic system. I have found:-
DIRAC Distributed Infrastructure with Remote Agent Control
Abstract of that paper:-
This paper describes DIRAC, the LHCb Monte Carlo production
system. DIRAC has a client/server architecture based on: Compute
elements distributed among the collaborating institutes; Databases for
production management, bookkeeping (the metadata catalogue) and
software configuration; Monitoring and cataloguing services for
updating and accessing the databases. Locally installed software
agents implemented in Python monitor the local batch queue,
interrogate the production database for any outstanding production
requests using the XML-RPC protocol and initiate the job
submission. The agent checks and, if necessary, installs any required
software automatically. After the job has processed the events, the
agent transfers the output data and updates the metadata
catalogue. DIRAC has been successfully installed at 18 collaborating
institutes, including the DataGRID, and has been used in recent
Physics Data Challenges. In the near to medium term future we must
use a mixed environment with different types of grid middleware or no
middleware. We describe how this flexibility has been achieved and how
ubiquitously available grid middleware would improve DIRAC."
First impression is that this is too heavyweight for us. It
incorporates software deployment of application software based on the
GAUDI framework and managed with CMT. We already have RSD. Also its
bookkeeping replicates SAM.
Atlas/LHCb GANGA
Early on I found little information on this and was inclined to
dismiss it. However I have now taken a closer look
and it appears almost perfectly suited to our needs and have now
chosen this as our Job Submission system.
FNAL CONDOR_G
The FNAL batch farm also uses GRID tools so I have spoken with Howie.
- The system they use is CONDOR_G (yes the _G stands for GRID!).
- CONDOR_G has a monitoring tool for overall status and to
determine when to reload the queue, supports job dependency and
resubmits if jobs fails abnormally.
- It's not "fully GRID":-
I have looked for examples of CONDOR_G running on LCG and have found:-
A Grid of Grids using Condor-G
which is the "mainstay of ATLAS production on LCG". However, I have
also found, in talk in October 2006
EGEE GRID: Also CondorG submission is possible Requires some expertise
and has no support from the service provider
Instead the recommended way to submit to EGEE is via GANGA.
CMS appears to take the view that the two GRIDs: OSG and LCG/EGEE would
not harmonise rapidly enough that a single interface would emerge in
time for LHC so instead provided a more generic layer than could
underneath talk to both GRIDs and only use CONDOR_G for OSG.
I have also spoken with Gutsche Oliver (CMS). OSG uses direct
submission CONDOR_G and uses it's input/output sandbox handling while
on EGEE jobs are submitted to the RB that handles the sandbox while
CONDOR_G deals with job submission with I/O directed to the RBs SE.
To me this sounds like we have to understand the workings CONDOR_G and
RBs quite well and I know neither! CONDOR_G isn't going to be a quick
fix solution.
Conclusion
We will use Ganga.
Ganga can be used out-of-the-box but to make it as easy and reliable
as possible for production work we need to develop layer on top to
provide a way to automatically monitor a set of jobs and respond to
changes for example:-
- Send email when a job ends.
- Resubmits obvious failures.
- Submit more jobs if the queue is running low.
This, of course, implies we know how many MINOS jobs are queued and running on a CE,
something I haven't worked out yet!
- Eventually provide information to user written scripts to make
more sophisticated choices.
So the custimisation itself is not an issue, its now down to
Implementation
In order to provide a batch system on top of Ganga
Ganga-based Batch Submission (GBS)
is being developed.
Data Access covers:-
- Job Access Reading and writing data files running on a farm node.
- Replication Making copies of input files available to a farm.
- Harvesting Collecting output files and returning them to
the master store.
- Cataloging Recording location and status of data files.
- Concatenation Tape is used as the back-end medium in a
number of our storage elements and don't handle small files
efficiently, so we have to concatenate them.
We need to decouple the job production system from details of data
access as far as possible so I have separated out
Data Access Logical API
first as a way to analyse what we require and then hopefully develop
a system to satisfy it that job production can use.
A logical interface has to specify the following, either explicitly or
implicitly:-
- Mode
The type of access: read, write (or both?).
- Data Selection
The data to be accessed. In its simplest form it is a file name.
Another possibility is a dataset, either as supported by SAM, or some
other means, for example a file containing a list of file names.
Datasets are a powerful concept when it comes to setting up production
systems but they are never going to be a complete answer: to process a
dataset quickly will require that the production system launch
multiple jobs each processing only part of the set.
- Local Representation
How the data appears locally. Possibilities are
- Local/NFS disk, e.g. for data coming from a local storage element
being written or read by code that cannot read directly from the storage element.
This will include all Fortran code.
- Soft link as a way to decouple production from the physical
organisation on local/NFS disk
- A storage URL in those cases where the file can be read or
written directly to a storage element.
- Remote Location
Where the data should ultimately reside. This could be a physical
directory directly accessible to the job or a logical directory for
example the
LHC File Catalog
or
CASTOR directory structure.
A good interface should be able to supply defaults for all but the
first two, for example right now DCM can be passed a file name and
uses that to determine the Remote Location, copy it over if
required and return a soft link as the Local Representation.
However, we do need to control each of these independently. For
example
- A private data set with its own Remote Location
- Forcing Local Representation to be a local disk copy in
the case of a flux file read multiple times.
Having a system that impliments our
Data Access Logical API
offers adavantages both as we migrate to the GRID and subsequently if the technology changes.
Currently
DCM
is the system we use and it has been agreed that we develop it to satisfy the API.
There is plenty of choice of
Storage Element Protocols
We still have learn which the best ones are. In the meantime, the
following should be considered.
- Clearly it makes no sense to use them all and, as at least in
the short term we plan only to run at RAL, there is a temptation to
keep things as simple as possible and use the lowest level protocols.
This strategy is given further weight as unattributable, but well
informed, sources have warned us that
LFC/LCG
is not yet a mature reliable service. See also
Atlas Web IssuesWithPosixIO
To further confuse the issue other reliable sources have warned that
middle-ware SRM protocols are changing from V1 to V2 and this isn't an
area for the non-expert and that the long term future is LFC/LCG!
- dCache has proved to be far
from satisfactory for us. This may in part have been down to the fact
that we share our server with Atlas who manage to put it into an odd
state on multiple occasions. However other experiments have also had
major problems which have encouraged them to migrate to CASTOR. The
final nail in this particular coffin is that there is a plan to
terminate the dCache service at RAL, at least for writing, at the end
of June 2007, but which time everyone should be using CASTOR.
- The above two points lead to the conclusion that we should use
rfio (Remote File IO) to CASTOR
and dccp to dCache in the short
term and review the situation when we eventually consider running and
T2 sites away from RAL.
- Do we want our own file catalogue or should we use
LFC?
For now the plan is to generate ASCII file "catalogues" by scanning
the dCache and CASTOR SEs and have DCM use these, in much the same way
it can already use a text file containing a list of file in FNAL's
Enstore. Again we need to review the situation when we eventually
consider running and T2 sites away from RAL. Hopefully by then LFC
will be fit for purpose.
- What about other, non-UK, groups working away from FNAL?
Do we eventually want a truly GRID-wide solution?
- What constraints are imposed accessing FNAL, will KERBEROS be a
problem?
We will use DCM so there isn't anything that needs to be customised.
For the short to medium term we will use rfio for CASTOR and dccp (to copy) and posix (to list) for dCache.
On the assumption that we will use DCM there are a number of short
term items that need addressing. Some are already complete.
- The purge facility will be withdrawn and start up speeded
up by not reading file stats by default. The local catalogue
service will be retained while we have NFS disks for it to catalogue.
DONE
- At present DCM can only retrieve data from Enstore at FNAL
using wget, it must be extended to support other protocols, and in
particular dCache CASTOR to local SEs. To do this it will maintain an
ASCII file of files in RAL's dCache and CASTOR to augment the file it
already has of dCache at FNAL. This will allow it to locate files that
satisfy wild-card queries.
DONE
- Currently DCM is meant to run interactively but to be any use
in a batch environment it has to be able to return results to a
calling program. In the short term an exit code indicating that it
did or didn't manage to locate all the requested files might suffice
but in the longer term, when we might want DCM simply to return a URL
that ROOT can open, we need to communicate more than a simple
pass/fail.
DONE
- DCM will have the ability to return requested files in a text
file and the option to control whether they have to be local files or
just URLs that can be passed to TFile::Open().
DONE
- DCM will be extended to support writing files as well as
reading them.
DONE
- DCM will have to run on Worker Nodes as well as the User
Interface machines.
This is partially complete, DCM will run on a WN and can read and
write to SEs but doesn't have access to catalogues for
searching. Currently the solution is to set up jobs on the UI, where
DCM can be used to locate files, and record them as URLs and then
submit jobs passing these URLs. It can write global logs back to the
software disk and there needs to be a system to collect up the logs
every few days.
- GRID job submission permits small data files (recommended
upper limit "a few KB") to be returned in the job's "sandbox". However
users will need to be able write much bigger files that they don't
want to publicly catalogue and there needs to be a "home delivery
service" that transparently delivers files back to the UI from which
the job was launched in this case.
In the longer term the DCM mapping from file name to a URL needs to be
work at sites away from RAL. Hopefully LFC will mature enough by then.
We have chosen low-level, technology-specific CASTOR and dCache
protocols for the short term. This will need to be revised in the
longer term when we need to work at sites away from RAL.
In a perfect world we could easily set up a system that runs anywhere
in the world. In the real world we have a GRID of overlapping
incompatible technologies and deadline by which we need to be running
at RAL. So we have to make that our primary focus but shouldn't
gratuitously make choices that make it harder to globalise, so need to
to at least consider what would work at FNAL.
I will divide the move towards globalisation into 3 time periods:-
- Short term (next 6 months)
In this time frame we want to be running on the RAL T1 with a singe SE
(CASTOR)
- Medium term (6 months - 18 months)
It has been suggested that as LHC comes on stream there is going to be
a great deal of pressure on RAL T1 and we may well do better by moving
off onto local T2 at RAL, Oxford and the London Universities (who have
an MOU that requires them ALL to provide resources to any experiment
that ANY of them run. This doesn't impact much on our technology
choice but will complicate our storage model as we have to handle
shipping data to and from RAL
- Long term (>18 months)
Here the ultimate goal is to allow all of MINOS to have a common
interface to both EGEE and OSG GRIDS. Well, we can dream can't we?
From what I have learned so far (see Job Submission
Technology Choice ) the two GRIDs haven't harmonised to the point
where this can affect our choice in the short term; although
Ganga does have a Condor back-end so it might work on Fermigrid.
As it explains in
MINOS on the GRID: Intended Audience
"This web has been written not because there is insufficient
information already on the Internet, but perversely because there is
far too much. The problem is that the GRID isn't a single complete
consistent whole but rather multiple overlapping, conflicting,
evolving systems and we have to select from the myriad of available
documents just those that are strictly relevant to get our job
done."
So the scope of the document is global: to cover every aspect of the
GRID that is necessary to get the job done, however it can and should
point to material on the web wherever appropriate.
Our standard documentation forms are LaTex and HTML but it's no real
contest for a constantly changing document in the computing domain: it
has to be HTML.
This action item will be used to list known weaknesses in
MINOS on the GRID
that need to be addressed.
Also, although the documentation describes how to how to use the
lcg-infosites, lcg-info and ldapsearch commands to collect information
about the GRID, these commands are not particularly friendly so I
propose writing a utility GIQ (GRID Information Query) that will be
able to run all the standard queries. The reason I have placed this
development here is that it will include an option to display the
commands it has executed to collect the information so acts as a
teaching aid to help users when they need to go beyond the standard
query set.