next_inactive up previous


Computing Use-Cases for
the RAL Tier-1A Cluster

Mike Kordosky

Abstract:

I describe some use-cases for computing on the RAL Tier-1A cluster.

Introduction

Beginning in August 2007 the RAL cluster will transition from a PBS job submission environment to GRID only access. This big change will be accompanied by a change to the data-storage model (NFS is being phased-out/disfavoured, CASTOR phased-in) and possible, if somewhat controversial, termination of the RAL user interface (i.e., remote login) nodes. Nick West and I are responsible for shepherding the experiment ``into the GRID''. This document attempts to summarise the way in which MINOS uses the RAL farm so as to guide our efforts.

In what follows we will refer to computing power in units of ``Kilo-SpecInt 2000''. The KSI2K unit is the output of a testing suite and characterises the computing power. It is only partially correlated with computing speed and is used here since it is the unit of common currency at RAL. In addition, it is not clear that the computing time of our MINOS jobs scale in proportion to KSI2K. Our requested quota on the RAL farm is [50]KSI2K. At the time of this writing, due to farm upgrades, we have [57.5]KSI2K.

Use Cases

MINOS computing usage may be divided up into 4 general use cases: MC ``singles'' generation in gminos, MC overlaying in gminos and production of reroot files, offline detector simulation and reconstruction, ntuple building and skimming.

Singles Generation

This relies only on GMINOS and GNuMI flux files. For L010185 ([-10]cm, [185]kA) we submit 11 ``detector'' (contained vertex) jobs at [2.1e16]POT each and 11 ``rock'' (rock vertex) jobs at [2.1e15]POT each. The jobs run in independently. Collectively these jobs are are referred to as one ``run''. One run represents [2e17]POT and takes approximately [12]KSI2K-days ([6]KSI2K-days/1e17 POT). Given our current [57.5]KSI2K allocation we can generate about [1e18]POT/day of standard L010185 MC.

Each job requires [12]Tb of flux files to be copied to the execution host1. Because the flux files are looped over several times in a job the practice of copying everything to the execution host is much more efficient than reading the flux files ``as needed'' over the network. The files are served from NFS mounts as dCACHE-disk was not reliable2. Copying flux files takes [20-30]min per job and can create a heavy load. To reduce the instantaneous load, jobs are staggered by [3-5]min when starting. Each job copies its output to FNAL using scp and kerberos53.

Overlaying and Reroot

Overlay jobs read the output of singles generation and construct overlayed rock+contained interaction spills for the Near Detector. This is done using the GMINOS4 program. First the eleven (or ten, we allow for one crash) rock files are merged into one larger file. Then the single rock file is combined with the detector files to make one output file per detector file, reusing rock events up to ten times. The neutrino interactions are time-arranged into spills with a given POT count (typically [2.4e13]POT/spill). The jobs take negligible computing resources in comparison to singles generation and reconstruction. The basic problem, from a batch computing standpoint, is that the overlaying job depends on the other jobs finishing and that overlaying may not proceed unless there are at least 10 rock files. The latter dependency is difficult to enforce and is currently done by hand, along with all overlay processing, by R. Hatcher and A. Kreymer at FNAL.

Rerooting refers to the process of converting the ZEBRA output file produced by GMINOS into a ROOT file for use by the offline framework. The computing resource usage is negligible but this stage does require that the MINOS offline be installed.

DetSim and Offline Reconstruction

This case refers to running reconstruction the RAL farm and producing an output ntuple. Currently all reconstruction is done at FNAL. An old estimate, based on R1.18 (``birch'') reconstruction, suggests [2.5]KSI2K-days/1e17 POT but this estimate may be off by a factor of two or more. Obviously this stage requires full access to the offline framework, including a database kept up to date.

Ntuple Building and Skimming

This case refers to processing reconstructed output into an analysis level ntuple. In the case of the ``PANs'' produced by the Mad package processing the entire first year dataset and [1.7e19]POT of the L010185 MC took approximately [7]kSI2K-days. Therefore,we may crudely estimate [6]kSI2K-days/1e20 POT. This is a small resource usage. The biggest challenge/difficulty in ntuple building/skipping is organising and referencing the sntp input files. This includes finding space, copying from FNAL and looping over the files in some number of jobs ($»1$ as one doesn't want to wait days for the output). New (currently cedar) versions of data-files are saved on nfs disk. Older versions are concatenated (all subruns into one file) and written to dcache-disk. This is not yet done for carrot era MC as it lacks subruns, but will probably be done in the near future. Note: the processing times above referred to reading from nfs. A crude observation is that dcache disk took roughly twice as long for a similar operation. This is wall, not cpu, time. In any case, the resources required to build analysis ntuples are quite small.

Conclusion

I would suggest we concentrate on Singles Generation as the primary use case followed by Ntuple Building. In the case of the latter we should think about the mechanics of running over many 1000s of jobs: How do we generate a list of jobs? How do we assure that the sntp files on the list are available? How do we find enough space for them, particularly if NFS is not available? How do we make the running efficient (one sntp/job suffers from fixed startup costs)? Running the reconstruction would be nice but depends on making overlaying work at RAL (rather than at FNAL as is the case now). There is also an issue in that the collaborators doing the offline processing apparently want to keep it local to FNAL. This is nominally for QA/QC reasons. As a result, I would deemphasise Overlaying+Reroot and DetSim+Reconstruction.

About this document ...

Computing Use-Cases for
the RAL Tier-1A Cluster

This document was generated using the LaTeX2HTML translator Version 2002 (1.62)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 use_cases.tex

The translation was initiated by n west (IT Staff) on 2007-03-02


Footnotes

... host1
Other sites, such as CalTech, permanently install the flux files on the execution hosts. This is most efficient but not permitted at RAL. It also leads to some overhead when switching beam types.
... reliable2
The future plan is to move to CASTOR data access. It is not very clear that CASTOR-disk is superior to dCACHE though.
... kerberos53
Access is via a ``keytab'' kept in a secure area on the RAL system.
...4
actually ``reco_minos'' as apposed to ``gminos_batch''

next_inactive up previous
n west (IT Staff) 2007-03-02