Mike Kordosky
Beginning in August 2007 the RAL cluster will transition from a PBS job submission environment to GRID only access. This big change will be accompanied by a change to the data-storage model (NFS is being phased-out/disfavoured, CASTOR phased-in) and possible, if somewhat controversial, termination of the RAL user interface (i.e., remote login) nodes. Nick West and I are responsible for shepherding the experiment ``into the GRID''. This document attempts to summarise the way in which MINOS uses the RAL farm so as to guide our efforts.
In what follows we will refer to computing power in units of ``Kilo-SpecInt 2000''. The KSI2K unit is the output of a testing suite and characterises the computing power. It is only partially correlated with computing speed and is used here since it is the unit of common currency at RAL. In addition, it is not clear that the computing time of our MINOS jobs scale in proportion to KSI2K. Our requested quota on the RAL farm is [50]KSI2K. At the time of this writing, due to farm upgrades, we have [57.5]KSI2K.
Each job requires [12]Tb of flux files to be copied to the execution host1. Because the flux files are looped over several times in a job the practice of copying everything to the execution host is much more efficient than reading the flux files ``as needed'' over the network. The files are served from NFS mounts as dCACHE-disk was not reliable2. Copying flux files takes [20-30]min per job and can create a heavy load. To reduce the instantaneous load, jobs are staggered by [3-5]min when starting. Each job copies its output to FNAL using scp and kerberos53.
Rerooting refers to the process of converting the ZEBRA output file produced by GMINOS into a ROOT file for use by the offline framework. The computing resource usage is negligible but this stage does require that the MINOS offline be installed.
This case refers to running reconstruction the RAL farm and producing an output ntuple. Currently all reconstruction is done at FNAL. An old estimate, based on R1.18 (``birch'') reconstruction, suggests [2.5]KSI2K-days/1e17 POT but this estimate may be off by a factor of two or more. Obviously this stage requires full access to the offline framework, including a database kept up to date.
I would suggest we concentrate on Singles Generation as the primary use case followed by Ntuple Building. In the case of the latter we should think about the mechanics of running over many 1000s of jobs: How do we generate a list of jobs? How do we assure that the sntp files on the list are available? How do we find enough space for them, particularly if NFS is not available? How do we make the running efficient (one sntp/job suffers from fixed startup costs)? Running the reconstruction would be nice but depends on making overlaying work at RAL (rather than at FNAL as is the case now). There is also an issue in that the collaborators doing the offline processing apparently want to keep it local to FNAL. This is nominally for QA/QC reasons. As a result, I would deemphasise Overlaying+Reroot and DetSim+Reconstruction.
This document was generated using the LaTeX2HTML translator Version 2002 (1.62)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 use_cases.tex
The translation was initiated by n west (IT Staff) on 2007-03-02