ESCAPE QoS fortnightly meeting


Minutes for 2020-09-02 T2.2 meeting


Aleem Sarwar
Enrico Vianello
Frederic Gillardo
Marek Szuba
Mario Lassnig
Martin Barisits
Nadine Neyroud
Paul Millar
Rizart Dona
Riccardo Di Maria
Xavier Espinal



Workplan for the pilot datalake assessment (next milestone)

Round table
Streams info can be found at:

Stream A: prototyping and demonstration

Datalake performance dress rehearsal (Paul)
Meeting this morning with task leaders and more detail will be on upcoming WP2 meeting on wednesday (09.02.2020)
3 Qos classes out of 5 but at least 2 classes to cover in the coming milestone.
Testing in the middle phase.
Data life cycle to demonstrate the Qos is working.
Data Inject could be online or through some tests.
Yellow circles represent task numbers.
What and how it would be achieved?
Focus on data life cycle.
It would be nice to cover more than one experiment.
Total we have 5 QoS classes and we can choose 3 and it’s up to us. 
We can try Tape, Disk and choose any other from rest three or totally new out of 5
FAIR QoS demo, round 2: batch mode (Marek)

Managed to run the demo - dummy analysis, using a multi-file input DID and three different QoS classes (AOD, results, logs) - in batch mode. Rucio clients can indeed authenticate using proxy rather than user certs (auth_type = x509_proxy in rucio.cfg + running voms-proxy-init beforehand) to avoid storing unencrypted X.509 user keys on shared systems. This works fine + virtually no overhead comparing to auth_type=x509 (have to run voms-proxy-init in advance anyway for file transfers to work);
Batch jobs use Singularity to work around difficulties with installing Rucio and dependencies (esp. gfal2) on GSI batch farms without admin privileges. Worked fine on the local/desktop machine; on the batch farm there originally were problems. Reason: gfal and Rucio authentication currently work independently, i.e. X.509 credential locations set in rucio.cfg are not propagated to upload/download steps. Opened an issue on Rucio GitHub requesting rucio auth configuration to be propagated to protocols ( Meanwhile, use magic environment variable X509_USER_PROXY to point gfal2 Python bindings to the proxy cert;
Impressions from this round:
when the Rucio server works fine, everything is smooth - but it has in the past occasionally become extremely slow (API call takes five minutes). When that happens, uploading a file + tagging it for QoS takes ages,
no feedback on when a replication rule to QoS to be satisfied (from the CLI);
Next step: try to do this with a real workflow, to be prepared together with CBM people involved in WP5.
Riccardo: in August there were operational issues with Rucio from software upgrade and moving the service to different hardware.
Paul: Could this use case be run periodically somewhere as a performance metric for the Rucio server? Marek: will have to think where at GSI to run it but in principle it’s entirely doable.
Paul: Consider running data transfers/tagging outside batch jobs. Marek: Makes very much sense but it would require much more complex set-up, beyond the scope of this demo.

Stream B: engagement with experiments

CBM responses to our questions about their QoS draft (Marek)
Volker got back on the questions:
Q1. RAW_HOT data only stored at FAIR or replicated elsewhere?
Most likely / primarily this will stay on-site, there might be some cases where some of the data will be replicated elsewhere.
Q2. Will RAW_HOT data from the last two years be accessed uniformly?
Q3. Will AODs from the last five years be accessed uniformly?
Some topics become more popular. So, for AOD, the patterns are difficult to predict.  AOD_COLD might make sense.
Q4. Extending the percentage of produced SIM data to be kept by adding opportunistic storage?
The currently quoted percentage (10%) is a guesstimate, CBM must analyse this in much more detail (esp. in the context of rare probes). Therefore, cannot say at this point whether adding an additional 5% via opportunistic storage would be helpful.

PS. The PANDA people have begun working on their QoS document and will likely get the first draft out soon.

Stream C: software developments

API testing (Aleem)
Trying the Rucio API using the username + password authentication method to get Rucio Auth token to perform different operations. When querying Rucio the response complains about exception messages i.e. Cannot authenticate with given credentials.
Rizart is looking into this, but perhaps best to look into through the dep-ops team.
Aleem: WIll create a ticket on Jira and discuss/pop up the issue on escape chat.
CRIC meeting (Martin)
Happy to have a meeting, so the next step is to organise a specific meeting time.
Everyone is welcome, but the topic will likely be very technical, and not generally of interest.


The next meeting will be on 16th Sep 2020 at the same time (14:00-15:00 CEST).



There are minutes attached to this event. Show them.
    • 2:00 PM 2:10 PM
      News 10m
      Speaker: Paul Millar (DESY)
    • 2:10 PM 2:50 PM
      Round Table 40m

      Stream A: Prototyping and demonstration

      Stream B: Engagement with experiments

      Stream C: Software development

    • 2:50 PM 3:00 PM
      AOB 10m