Calcul ATLAS France (CAF)

Europe/Paris
CC-IN2P3

CC-IN2P3

(in practice Vidyo only)
Frederic DERUE (LPNHE Paris), Luc Poggioli (LAL Orsay)
Description

Réunion du groupe Calcul ATLAS France
web wiki

 

Informations de connexion à la visio-conférence :

Connexion depuis un terminal individuel (Windows, OS X, tablette, smartphone ...) http://desktop.visio.renater.fr/scopia?ID=729898***9731&autojoin
Manuel d'installation de Scopia Desktop doc_scopia_desktop-fr.pdf
IP 194.214.202.146
Téléphone ou RNIS +33 (0)9 88 83 00 07
SIP sip:729898@195.98.238.109
  h323:729898@mgmt.visio.renater.fr
Numéro de la conférence 729898 (terminer par #)
Code d'accès 9731 (terminer par #)

 

MINUTES CAF MEETING 18/12/2019
                                    https://indico.in2p3.fr/event/19846

Remote only meeting (due to strikes in transports)
Morning          Aresh, Arnaud, Catherine, Eric, Fred, Jean-Pierre, Laurent, Luc, Sabine, Stéphane
Afternoon       Arnaud, Eric, Fred, Laurent, Luc, Manoulis, Sabine, Stéphane
Apologized :  Mihai

Morning session:

1) Intro (Fred)
- next meetings :
     - ML IN2P3 CEA workshop, 22-23rd January
     - S&C computing week (with "site flavour") 10-14th Feb

- ATLAS resources usage since previous CAF (3 months) similar as usual : >400k running slots, dominated by MC Full Sim, on grid. No peaks of jobs done on HPC

- RRB approved ATLAS provisional computing resources request for 2021 : 
increases of  10% tape, 15% CPU and 20% disk

- OTPs for FR-cloud provided, see Tables
  -> discussion if we need to do action to get better ratio of "squad/management", which is 0.5/0.67 (FTE). Until clear statement done by ADC on which contributions are relevant for each part, we will keep these numbers.
  -> would be good to know if other clouds have dedicated people, paid by specific projects, for squad (in particular)

- CAF chair + deputy : end of 1st year of Fred as chair (in March). He will ask to get another 2 year period to ATLAS France groups. During this period a deputy is not mandatory/needed, but will a year before announcement of a change of chair.

2) FR-T2-cloud (Fred)
  - regular/monthly reports available on
    https://cernbox.cern.ch/index.php/s/vrq0bs2qJGY72NV

Stable and good period
       - FR-cloud = 18.4% of T2s on this period - was 17.6% previous period
       - normal profile of jobs received by activities
       - by country in FR-cloud : Japan=49%, France=45%, Romania=4%, China=2%
         usual ranking in French sites (1st IRFU, 2nd CPPM)

       - CPU efficiency shown for all T2-FR cloud, i.e sum of cpu over sum of wallclock time
            - shown for different type of jobs for all sites
            - shown for EvGen for each sites
            - bug in ATLAS monitoring found by Stéphane (solved by Tadashi) for some type of jobs on ARC-CE. It explains low efficiencies seen previous months on TOKYO and CPPM (in particular) for EventGen jobs. Pb is fixed and efficiencies are becoming higher.
       - pledges : some numbers have been updated (increased).
         Numbers in tables are coming from REBUS, need to check with CRIC
         for disk pledge, LAL was missing ~250 TB in 2019. Will be installed asap
         Need to check in particular the level of free space which seems a bit high (few hundreds of TB)
       - CPU vs pledge for different sites
            - no problem seen on numbers
            - monitoring plots have issues as reported pledges are too high. A fix is ongoing

       - transfer matrix efficiency low as input, in particular for LPC
          -> very asymmetric (as input, as destination)
          -> need to check more regularly perfsonar

       - DOME : only LAPP, LPSC went to the end (version + gridftp redirector),
                       others need griftp redirector
       - ggus tickets: normal traffic, mostly for tranfer/deletion errors

Question by Eric for Tier-2s
  -> LPC, LPSC and CPPM are using the squid of ATLAS@CC which can represent a heavy load
  -> can these sites migrate to get their own squid
  -> need to check squid versions for each T2s - including non-FR ones

3a) Report WLCG (Laurent, no slides)
    -> recall of LCG-FR highlights
    -> reduction of 10% of LCG-FR budget for 2020, from 2M euros to 1.8 Meuros
        Given the obtained hardware prices recently obtained, foreseen pledges could
        be done despite this

3b) Tour of sites and AOB

CPPM : ras
IRFU : working on 100 Gbps
LAL : pledges to be installed
LAPP : on-going developments for FR-Alpes including also LPC and CPPM
LAPP : ras
LPNHE : new hardware bought, in particular 100 Gbps switch
LPSC : stopped Alice storage, to be used for ATLAS
L2IT : ras

4) CAF-user meeting (scheduled in afternon, but done in morning)
  - ML : need to check in each laboratory the need for specific ATLAS France training
            also who/which training have been done until now (SOS, ...)
  - S&C involvement : Tables provided for each lab + summary based on OTPs
       -> many numbers available only for 1st semester: multiplied by two when needed
       -> some contributions may not appear in OTPs and so in these tables, for example ACTS/tracking or DOMA-FR. Please check/add these contributions
         -> check numbers by each labs
         -> globally 13 FTE for software, 12 FTE for computing

5) ATLAS data management policy document (Fred)(scheduled in afternon, but done in morning)
    - required by CC for all experiments - for the end of this year
    - draft available for comments (link in slides) by CAF members
      and addition of text (some parts not yet filled)
           -> most of data management is done through existing WLCG MoU
           -> what we have to write is "our" management of data on sps, LOCALGROUPDISK
                and LOCALGROUPTAPE, and describe what we are doing in practice for a while
                   - data on sps not accessed since more than a year copied to             
                     LOCALGROUPTAPE
                     and then removed, with automatic (already existing) procedure
                   - after discussion : no automatic cleaning on LOCALGROUPDISK and TAPE
                   - add sentences to get a word on non automatic cleaning as decided by CAF and czar storage when needed
            -> "responsibles" are the czar stockage
    - comments by CAF members within two weeks, then give it to group leaders
    - Eric will show document to "storage" people to check that answers are going in the right way

 

6) FR-T1-cloud (Fred) (scheduled in afternon, but done in morning)
- Very good period with 99% availability & reliability)
     -> still part of "unefficiency" is during scheduled downtime so should not decrease numbers
- CC = 11.7% of T1 (was 11.5% last period),
  cpu by activities are shown for All T1 & CC - more GroupProduction
- ATLAS is using more than its pledges at CC (~120%), except in February
       -> average efficiency of jobs, from decisionnel, is ~92%, higher than what
            reported by grafana/atlas - not counting pilot inefficiencies etc ?
- ATLAS number of running jobs slots increases : ~15k jobs/day
- ATLASDATADISK ~ 10PB, ATLASDATATAPE ~10PB, ATLASMCTAPE ~10PB
- 1 ggus tickets for T1 in the lasT period

 

======
Afternoon session
1) CC-IN2P3

1b) CC status (Manoulis, no slides)
   - Grand unification of queues complicated on Grid Engine
     Easier on HTCondor, which already includes 23% of LHC pledges
   - pb of licensing of Oracle can affect Frontier at CC
     Tests are ongoing to check the use of only Frontier at CERN

1c) CC usage (Fred)
   - storage based on decisionnel :
         - dcache : 10197 TiB affected / 9591 TiB used
         - hpss-bande : 17988 TiB used
         - sps : 357 TiB allocated / 157 TiB used
                  ==> change to isilon done successfully in November
                  ==> all files have been accessed, so need to wait before
                           restarting deletion policy !
   - LOCALGROUPDISK : 525 TB allocated / 350 TB used
   - LOCALGROUPTAPE : 250 TB used
   - share of cpu per user : ATLAS users on batch represent 1.06% (2.09%) of all ATLAS jobs (grid included) in November (All year).
       Previous top users are decreasing
       1 (rather) new user consummed 50% of user-cpu in November
   - user batch usage : continue to see a drop of usage since beginning of the year
     ~180 jobs on average on last period, with up to 16k jobs requested
         -> no priority/share between users. Sending too many jobs in once can be bad for other ATLAS users
 

 

There are minutes attached to this event. Show them.