GRID UK Meeting

Nick West
Last modified: Mon Feb 11 15:34:57 GMT 2008

We are planning a GRID UK Meeting:-

We have had two GRID UK Meeting last year:-

See also our work program project status

GRID UK Meeting on Thursday 7 February 2008 at Oxford

Date   Thursday 7 February 2008
Time:  11:00 - 15:00 
Where: Martin Wood building Mendelssohn Room

The Mendelssohn Room is situated on the ground floor on the right hand side of the Martin Wood Lecture Theatre.

The theme of this meeting will batch job submission and the migration of all our NFS/dCache disk and legacy ADS tape data to CASTOR.

Here are the topics I have thought of to date, please send suggestions ASAP.

The following have said they will attend:-


Oxford:  Nick, Alex + 3 others
Sussex:  Marta
RAL:     Tobi
UCL:     Justin
  1. Migration of our ADS tape data (MINOS and Soudan2)

    We have to make a decision as to exactly what data we want and how it will be organised. What about Soudan2 data?

    See: Termination of RAL ADS/VTP Tape Service

  2. Migration of NFS/dCache disk to CASTOR

    It's unlikely that we will have any significant CASTOR allocation before March but once we have the clock starts: We then have no more that 3 months to move all our data there. The deadline is not so much that the NFS disk goes away but that RAL want to close general login accounts and we have argued that we need them as managing NFS disk is very hard without them. Once we have an alternative to NFS disk that argument collapses.

    We have to take a series of decisions:-

    Migration and catalogues are further discussed here

  3. Using GBS for batch submission

    Most people will have tried submitting jobs to the GRID using glite-wms-job-* and Ganga but GBS: Ganga-based Batch Submission a production system layered on Ganga is being developed. Its primary target is as a the work horse for MC production and part of the meeting will be devoted specifically to this (there is a demo production ). However it is designed to make job submission simple and could even offer an alternative entry point into GRID job submission.

    There is a wish list of future additions. Are more required?

GRID UK Meeting on Thursday 15 November at UCL

Date   Thursday 15 November 2007
Time:  10:00 - 16:00 (?)
Where: UCL physics building, on the ground floor, room E1

See UCL High Energy Physics - How to get here

People coming from Euston *Station* should come in on the Gower Place entrance, enter the building, go by the mail boxes, take a right and walk all the way down to the end of the hall to find E1. People coming from Warren Street or from Euston *Square* (Circle Line, et al) should come in through the quadrangle as our room E1 is next to the door out to the cloisters and the quad.

The theme of this meeting will be the migration of all computing at RAL to the GRID given that qsub submission to the RAL PBS farm terminates at the start of January 2008. It will mostly be a series of discussions of experiences to date and plans for the future. At this stage no presentations are planned although I (Nick) will try to give short talks "on demand" on any GRID topic covered by our web

Important: Anyone who attends and has ambitions to run GRID jobs in the near future is strongly encouraged to get a GRID certificate and attempt to work through the primer so that they have at least run one loon job via Ganga.

Here are the topics I have thought of to date, please send suggestions ASAP.

  1. Running Individual Jobs

    This is the subject of most general interest: "I used to log onto RAL and do qsub, now what am I supposed to do?"

  2. Developing a GRID Strategy

    We have to move away from a single farm + bunch of NFS disks mindset towards the brave new world of distributed computing. In the short term (~ June 2008) it's likely to remain RAL T1 and mostly NFS with data eventually moving to local Castor. However, as LHC ramps up we may come under pressure to move to other farms. RAL T2 is the first choice but what about Oxford and the London Universities? We should at least start thinking about what we could/should aim to do in the longer term.

  3. Migrating the MC Production

    This last topic has a very tight focus and may not interest most people directly. We provide a very valuable service to the collaboration and we have to ensure that this service doesn't break in January!

Actions: GRID UK Meeting on Thursday 15 November at UCL

Contents

Action List

Here are Mike's Notes of the second. An action list will be derived from his notes plus the addition items:-

Resolution of Issues Raised

In this section I (Nick) will attempt to resolve Mike's items into one or more of 3 groups:-

Job control, interrogation, exit codes
Question/ProblemResponse
Why isn't my job running? When will it run? Solved
Can I peek at the job during execution (e.g., something like qcat)? Outstanding Problem
Where are my jobs running? Solved
Can I order a job resubmitted? Partly Solved
Can I hold jobs that are not yet running? More generally, can I signal my jobs in any way (e.g., to kill them)? Partly Solved
Is there ``queued'' status distinct from ``submitted'' and ``running''? Yes, it's called SCHEDULED. Solved
Ganga GUI
Question/ProblemResponse
People reported their experiences with the ganga GUI. It seemed that about 1/2 could not make it work. Phil R. stated that he had a fix. Phil should post his fix somewhere. This can be fixed during installtion. It has been Solved
Job resources
Question/ProblemResponse
How do we request CPU,disk,memory resources for jobs? By directing the job to a queue with adequate resources. Solved
How do we figure out the limits of different queues/CEs? For example: lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-grid500M. Possible solution: j.backend. interface in ganga? Solved
Best Practices
Question/ProblemResponse
How should you pass files into your job? (a) list files on command line (b) ASCII file with list of file names (DCM urls?) (c) pass a DCM/SAM query Before we can develop best practices we need examples of the types of jobs to be submitted. Quote handling is even harder because of the extra layers (e.g.Ganga, JDL) although, for submission via GBS the script runs within a wrapper which at least offers a way to rebuild parameters immediately prior to shell invocation.
How should you pass information into your root macro? Macro functions can take C++ datatypes, though perhaps they should not. Making sure the shell doesn't remove quotes can be a real pain. Some concrete examples might be useful here. Really, make attempts to spread knowledge like this around. Otherwise everyone has to spend frustrating time reinventing the wheel.
External UIs
Question/ProblemResponse
How do we select a list of input files (e.g., from ``meta-data'' like date and file type/name-pattern) in order to drive jobs from outside RAL? Possible solution: DCM at RAL publishes catalogue in SE, DCM elsewhere reads that, allows selections. Also, maybe as time goes on we use LFC? This is a major unsolved problem. A solution similar to this has been implemented. This problem is Solved
How do we get files back to the universities? Maybe universities should run SEs then jobs just copy output directly with grid tools? Seems difficult. Jobs store files in SE (e.g., at RAL). Another process running at the university looks in logfiles and pulls files out of the SE? Seems clunky/annoying to Mike. This is a major unsolved problem. We should attempt to understand how other experiments tackle this issue. We are exploring tools, that indicate that a Solution Identified but it is also an Outstanding Problem
Control and Ownership of Files on a SE
Question/ProblemResponse
Who owns files on a SE? Are file permissions like UNIX? Does the system know which VO user wrote a file? If we use GRID middleware we all own everything and have to be careful! Solved
How do we protect ourselves from one user deleting everyone else's data?
How do we intentionally delete files, clean up disk? How painful is it? (Doesn't really matter!!) Files can be removed using local or GRID protocols; using GRID middleware ought not to be more harder.
What may well be harder is if we use the LFC and keeping it in sync with our SEs. I have been told: I think the best strategy is a change of philosophy - don't care about how tidy the SRM is; the LFC is your view of your data, the SRMs are just a dumping ground for files, which, because we mount the namespace, you can have a poke around in if you want to.
The LFC getting out of sync with the SRMs is an occupational hazard unfortunately. I'd suggest that if the lcg-del fails, then retrying and seeing if the error message changes and tidying up the lfc with the lfc-* commands.
Given that LFC cannot even see most of our data (NFS/FNAL) it is not encouraging is it?
Solved
Support for Test Releases.
Question/ProblemResponse
How can I run jobs that include private, uncommitted code? We now have a tool to do this. Solved

Testing Data Access from External UIs

Not only is this critical, it is also urgent; we have to report to the UB on Tuesday 4th December on how we are getting on submitting jobs remotely, so we urgently need to test this from as many external UIs as possible. Nick will test at Oxford but needs volunteers to test at other sites. Please contact Nick if you can help.

Preparation

Now you are ready to access data. This is the best I have come up with so far, using globus-url-copy. There are two cases:-

The second of these in particular is very tedious and slow. However, it should be fairly straight forward to develop scripts to deal with the tedium, the far more important question is whether the underlying services are sufficiently reliable that scripts based on them will work or whether we have to report failure at the UB.

Access to dCache Data

In the example that follows it is assumed that you have a file somewhere on an SE at RAL, you know the name, in this example I am picking:-
  n14001003_0001_L010185.sntp.R1_18_2.root
but not the location, and you want to copy it locally. This takes 1 step:-
  dcm get  n14001003_0001_L010185.sntp.R1_18_2.root
but only for data that is in dCache but not on RAL NFS disk in which case you will get the error:-
  ?Request get to SE ral_t1_ui-nfs not supported from ...
In which case see the section below.

Access to NFS disk data

This involves three steps

  1. Copy NFS file temporarily into dCache

    I have created the following directory tree

      drwxrwxr-x 1 nwest  minos  512 Nov 24 10:40 /pnfs/gridpp.rl.ac.uk/data/minos/in_transit
      drwxrwxr-x 1 nwest  minos  512 Nov 24 10:40 /pnfs/gridpp.rl.ac.uk/data/minos/in_transit/oxford/
      drwxrwxr-x 1 nwest  minos  512 Nov 24 11:53 /pnfs/gridpp.rl.ac.uk/data/minos/in_transit/oxford/west/
    
    so that I have a private place into which I can copy files. I suggest that others build on this for their private files.

    Important Give the directories group write access; you will be running a GRID job to copy files into your directory and then, as far as dCache is concerned you will be minosnnn e.g. minos003.

    Now you have to submit a GRID job to copy a file into your area. Start by creating a JDL file along the example cp_nfs_to_se_in_transit.jdl:-

      Executable = "/opt/d-cache/dcap/bin/dccp";
      Arguments = "/stage/minos-data1/west/dcm_tests/LVJ_F00034638_0000.mdaq.root /pnfs/gridpp.rl.ac.uk/data/minos/in_transit/oxford/west/";
      StdOutput = "dccp.out";
      StdError = "dccp.err";
      OutputSandbox = {"dccp.out", "dccp.err"};
      VirtualOrganisation = "minos.vo.gridpp.ac.uk";
      Requirements = other.GlueCEUniqueID == "lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS";
    
    as you can see it just runs dccp to copy the file from disk to the in-transit area.

    Run run_test_job.perl to submit and monitor this job:-

      perl $MOG_SCRIPTS/jobs/run_test_job.perl cp_nfs_to_se_in_transit.jdl
    
    The job typically takes 5 - 10 minutes to submit and run and it should end something like this:-
      Retrieving job output ...
      Job output returned to /tmp/run_test_job_20408/west_-K9zj1FdvuGvkBhm2Tbctw:-
      
        File: dccp.out  begins (first 20 lines max):-
      
        File: dccp.err  begins (first 20 lines max):-
          403597 bytes in 1 seconds (394.14 KB/sec)
      
      Cleaning up and removing /tmp/run_test_job_20408
    

  2. Copy the file to local site using globus-url-copy

    The URL prefix to use globus-url-copy copy is

      gsiftp://gftp0446.gridpp.rl.ac.uk:2811//
    
    form the copy command using the syntax
      globus-url-copy   <remote-url>  file:<absolute-file-name>
    
    For my example
      globus-url-copy \
        gsiftp://gftp0446.gridpp.rl.ac.uk:2811//pnfs/gridpp.rl.ac.uk/data/minos/in_transit/oxford/west/LVJ_F00034638_0000.mdaq.root \
        file:/tmp/LVJ_F00034638_0000.mdaq.root
    
    To get more output use
     -vb | -verbose
          during the transfer, display the number of bytes transferred
          and the transfer rate per second
     -dbg |-debugftp
           Debug ftp connections.  Prints control channel communication
           to stderr
    

  3. Remove temporary file from dCache

    This requires the submission of a second GRID job with JDL, say rm_from_se_in_transit.jdl along the lines:-

      Executable = "/bin/rm";
      Arguments = "/pnfs/gridpp.rl.ac.uk/data/minos/in_transit/oxford/west/LVJ_F00034638_0000.mdaq.root";
      StdOutput = "rm.out";
      StdError = "rm.err";
      OutputSandbox = {"rm.out", "rm.err"};
      VirtualOrganisation = "minos.vo.gridpp.ac.uk";
      Requirements = other.GlueCEUniqueID == "lcgce02.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-gridS";
    
    which can be submitted as before:-
      perl $MOG_SCRIPTS/jobs/run_test_job.perl rm_from_se_in_transit.jdl
    
    and should end without error:-
    Job output returned to /tmp/run_test_job_20771/west_QP4jEQ5jHliuBOSWuffOHA:-
    
      File: rm.out  begins (first 20 lines max):-
    
      File: rm.err  begins (first 20 lines max):-
    
    Cleaning up and removing /tmp/run_test_job_20771
    


GRID UK Meeting on Monday 19 March at Oxford

We had a meeting to talk about short and medium term computing issues at RAL T1.

Date  Monday 19 March
Time:  10:00 - 16:00
Where: Oxford. Fisher room (next to common room)

Agenda

Short term issues

Medium term issues

Actions: GRID UK Meeting on Monday 19 March at Oxford