Metadata links and notes

Indico: Home > Experiments > ATLAS Meetings > Data Preparation > Miscellaneous http://indico.cern.ch/conferenceDisplay.py?confId=37128 Run Summary Meta Data Discussion Tuesday 15 July 2008 http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=37128 Run Summary Portal - Shaun Roe
Hypernews: physicsMetadata
TAG Metadata discussions on: atlas-event-metadata@cern.ch
meta_LB_LHC.txt
Creating a Run Metadata relational database for TAGs: E's Conditions DB Metadata Documentation http://indico.cern.ch/sessionDisplay.py?sessionId=31&slotId=0&confId=3580#2007-09-03 CHEP 2007

For Jack: Metadata browsing for event selection Metadata groupings of events in meaningful divisions enables users to make event collections with common characteristics which are considered uniform(*) across that collection. Such groupings arise at various stages: from data taking online through data analysis and potentially multiple stages of reanalysis offline. These collections such that sufficient for analysis. (*) == 'Uniformity' includes a common detector/readout configuration trigger configuration Potentially non-uniform factors such as beam conditions calibrations can be leveled across different collections creating even larger sets of event forming more statistically significant samples from which we can more precisely assess the probability of processes which exist within the sample. COOL (see talk by A. Valassi, this conference)�COOL: A database schema and API designed to be technology neutral (Oracle/MySQL/SQLite) Data structure is hierarchical: Data entry is to a particular folder, channel and with a specific Interval Of Validity (IOV) and tag. =>These form a unique set of coordinates for the datum or resource.

My talk:


An Integrated Overview of Metadata in ATLAS 
   
Content: Metadata--data about data--arise in many contexts, from many diverse
sources, and at many levels in ATLAS. Familiar examples include run-level,
luminosity-block-level, and event-level metadata, and, related to processing and
organization, dataset-level and file-level metadata, but these categories are
neither exhaustive nor orthogonal. Some metadata are known a priori, in advance of
data taking or simulation; other metadata are known only after processing--and
occasionally, quite late (e.g., detector status or quality updates that may appear
after Tier 0 reconstruction is complete). Metadata that may seem relevant only
internally to the distributed computing infrastructure under ordinary conditions may
become relevant to physics analysis under error conditions ("What can I discover
about data I failed to process?"). This talk provides an overview of metadata and
metadata handling in ATLAS, and describes ongoing work to deliver integrated
metadata services in support of physics analysis. 

 
Dr. MALON, David (Argonne National Laboratory)
Dr. ALBRAND, Solveig (LPSC, Grenoble)
Dr. GALLAS, Elizabeth (University of Oxford)
Dr. TORRENCE, Eric (University of Oregon) 

Event Selection Services in ATLAS 
   
Content: ATLAS has developed and deployed event-level selection services based upon
event metadata records ("tags") and supporting file and database technology. These
services allow physicists to extract events that satisfy their selection predicates
from any stage of data processing and use them as input to later analyses. One
component of these services is a web-based Event-Level Selection Service Interface
(ELSSI). ELSSI supports event selection by integrating run-level metadata,
luminosity-block-level metadata (e.g., detector status and quality information), and
event-by-event information (e.g., triggers passed and physics content). The list of
events that pass the physicist's cuts is returned in a form that can be used
directly as input to local or distributed analysis; indeed, it is possible to submit
a skimming job directly from the ELSSI interface using grid proxy credential
delegation. Beyond this, ELSSI allows physicists who may or may not be interested in
event-level selections to explore ATLAS event metadata as a means to understand,
qualitatively and quantitatively, the distributional characteristics of ATLAS data:
to see the highest missing ET events or the events with the most leptons, to count
how many events passed a given set of triggers, or to find events that failed a
given trigger but nonetheless look relevant to an analysis based upon the results of
offline reconstruction, and more. This talk provides an overview of ATLAS
event-level selection services, with an emphasis upon the interactive Event-Level
Selection Service Interface. 


Dr. MALON, David (Argonne National Laboratory)
Dr. CRANSHAW, Jack (Argonne National Laboratory)
Dr. GALLAS, Elizabeth (University of Oxford)
Dr. HRIVNAC, Julius (LAL, Orsay)
Dr. KENYON, Michael (University of Glasgow)
Dr. MAMBELLI, Marco (University of Chicago)
Ms. MCGLONE, Helen (University of Glasgow)
TIQUE AIRES VIEGAS, Florbela (CERN)
Dr. ZHANG, Qizhi (Argonne National Laboratory) 
 



define Metadata
-- summary information (event counts at various stages)
-- references to other data
-- 

need to have data where/when you need it.

Usage
-- data processing
-- data analysis
-- integrity checks

A Challllengiing User!
? The ATLAS physiiciist?....
? Fast, efficient, accurate queries
? Reliable navigation to event data
? Seamless integration with analysis


Performance and Scalability Tests
for Relational TAG Database
? Large scalle realliistiic tests to uncover
challllenges brought wiith scalle
? Optiimiise and measure perfformance
Management
Partitioning
Indexing
Optimizer
Hints
Multiple clients
Query patterns





file metadata
-- in file
-- processing history (prov)

non-event metadata
-- generally in databases
with keys IOV or Run/LB

MIF is evil

but we all do it...

dataset nomenclature



AMI

Conditons DB

POOL files

TDAQ

trigger configuration

Computing challenges
-- improvements in processing speed have not been matched 
by improvements of access of jobs to the data required for processing.
-- Metadata, properly formed and used,
can be used to supply jobs with the metadata it needs to 
facilitate faster processing.
-- Challenges greater on a grid where network latency/failure
are real obstactes, 
in real time comperable or greater than the execution time.

17/04/2007 08:29

Dear All,

next week will be the Data Quality workshop at CERN [1] 
[1] http://indico.cern.ch/conferenceDisplay.py?confId=13869

on Tuesday and Wednesday, 
and the Physics Analysis Tools workshop at Bergen [2]. 
[2] http://indico.cern.ch/conferenceDisplay.py?confId=11987

On both occasions, metadata issues will be on the agenda. 

We will take the opportunity at Bergen and have a metadata session, 
integrated with the PAT workshop on Wednesday afternoon and possibly 
extending into the evening. 
This offers a good chance to discuss database aspects and analysis 
aspects of metadata together. 

The aim is to answer metadata implementation questions certainly 
on the coarse level (which database/file to use for which data) 
and in as much detail as possible. We should also answer 
the open issues from the document of the Metadata Task Force [3] 
[3] https://edms.cern.ch/cedar/plsql/doc.info?cookie=6290895&document_id=833723

and list any additional metadata items that have come up meanwhile. 

A TWiki [4] 
[4] https://twiki.cern.ch/twiki/bin/view/Atlas/MetaDataImplementation

collects material on metadata implementation. 
Please also see Giovanna's recent talk from the ATLAS overview week [5].
[5] http://indico.cern.ch/conferenceDisplay.py?confId=11266#20

Here is a list of subjects to be covered, which can also serve as draft agenda:
 - introduction/review of metadata work
 - metadata provided by trigger, run control, and LHC
 - metadata provided/used by Data Quality and throughout reco/analysis
 - inventory of metadata stored in COOL
 - database/file for event, lumi block, and run level metadata
 - database for file and dataset level metadata, relation to DDM
 - inventory of metadata stored in event data files
The idea is to review the status of each subject, to show the implementation as far as in hand and the remaining work, and to always check against the analysis needs - during the session and throughout the PAT week! At the end of the week we should be in a position to make all important implementation decisions.

Several of the database people will be attending the PAT workshop, 
others are involved in the DQ workshop. 
A phone conference has been set up to join the Bergen session. 


   Hans




------
Hans von der Schmitt
ATLAS Experiment 
MPI for Physics, Munich    Tel: +49-89-32354-358, Fax: -305
CERN  Office: 40-3-D12     Tel: +41-22-76-71255,  Fax: -78350
mailto:Hans.von.der.Schmitt@cern.ch









*** Discussion title: Streaming Test 2006

Dear streaming test people.

Please can you give me some insight on the metadata parameters returned
by the streamtest merging tasks?

In principle, and in all the tasks I have monitored up until now,
"maxEvents" gives the maximum possible number of events in the file, and
"events" is the actual number. So for tasks where there is some
filtering, one does not expect these two numbers to be the same.

However it is of course perfectly valid for the two numbers to be the
same.

I have written code which records both values, and in particular I
record "events" in the file table.

(By the way - I hope you have all noticed that AMI is cataloging files
for the StreamTest?)

 I notice that in these merge jobs,  ONLY the AOD files have  returned the "events" parameter, and that it has the same value as  the maxEvents parameter.
 Is this done on purpose?

 It means that although I can mark the number of events per file for  the AOD files I cannot for the TAG  or the SAN, unless I put in some acrobatic  code to analyse the datatype of the file as it goes by, and this would be especially for "merge" tasks.

I have up to now also been summing the values of "events" over the task,
just by using the events parameter I find in the first file of the list
for each job. Here again, I cannot do it for these tasks unless I
specifically look for the AOD file.

 So - briefly - is there a reason that the "events" parameter is  only returned for AOD in these tasks? Could it just be due to a little error in the transformation?

 Solveig.

PS - below I have pasted the XML returned by a merge job. Note that only
the AOD file has the parameter "events" returned.

>
> --
> *************************
> Solveig Albrand-Paget
> LPSC
> 53, Avenue des Martyrs
> 38026, GRENOBLE cedex,
> FRANCE
> *******************
> solveig.albrand@lpsc.in2p3.fr
> Tel: (+) (33) (0)4 76 28 41 25
> ********************

>
> 
> 
> 
> 
>    
>    
>    
>    
>    
>    
>    
>    
>    
>    
>       
>          
>       
>       
>       
>       
>       
>       
>       
>       
>       
>    
>    
>       
>          
>       
>       
>       
>       
>       
>       
>       
>       
>       
>       
>    
>    
>       
>          
>       
>       
>       
>       
>       
>       
>       
>       
>       
>    
>    
>       
>          
>       
>       
>       
>       
>       
>       
>       
>       
>       
>    
> 
>
>
>004967.Overlap.merge.AOD.v12000605_tid009514._00001.pool.root.1"/>
>       
>       
>       
>       
>       
>       
>       
>       
>       
>       
>    
>    
>       
>          
>       
>       
>       
>       
>       
>       
>       
>       
>       
>    
>    
>       
>          
>       
>       
>       
>       
>       
>       
>       
>       
>       
>    
> 
>
>

-------------------------------------------------------------
Visit this Atlas message (to reply or unsubscribe) at:
https://hypernews.cern.ch/HyperNews/Atlas/get/StreamingTest2006/36.html





The CMS Offline condition database software system 
   
Content: Non-event data describing detector conditions change with time and come
from different data sources. They are accessible by physicists within the offline
event-processing applications for precise calibration of reconstructed data as well
as for data-quality control purposes. Over the past three years CMS has developed
and deployed a software system managing such data. Object-relational mapping and the
relational abstraction layer of the LHC persistency framework are the foundation;
the offline condition framework updates and delivers C++ data objects according to
their validity. A high-level tag versioning system allows production managers to
organize data in hierarchical view. A scripting API in python, command-line tools
and a web service serve physicists in daily work. A mini-framework is available for
handling data coming from external sources. Efficient data distribution over the
worldwide network is guaranteed by a system of hierarchical web caches. The system
has been tested and used in all major productions, test-beams and cosmic runs. 

 




    
    Elizabeth Gallas (ATLAS)


Last modified: Tue Mar 10 21:08:37 GMT 2009