Calcul ATLAS France (CAF)

Europe/Paris
CC-IN2P3

CC-IN2P3

(Virtual)
Frederic DERUE (LPNHE Paris)
Description
Réunion du groupe Calcul ATLAS France (Web site)
 
Connexion à la conférence par zoom (défaut)
Connexion depuis un terminal individuel

https://cern.zoom.us/j/63873288488?pwd=aE5zK21tWWRsdmRsRWEyU3YrOTg5UT09

Code d'accès  

MINUTES CAF MEETING 24/06/2021
                                    https://indico.in2p3.fr/event/24152


Remote (morning)        Aresh, Arnaud, David B., David C., Fred,
                                      Jean-Pierre, Laurent, Pierre-Antoine,
Remote (afternoon)      Aresh, Arnaud, Catherine, avid B., Fred, Laurent, Pierre-Antoine
Apologized :  Andrea, Stéphane, David C. (après-midi), Jean-Pierre (après-midi)

Morning session:

1) Intro (Fred)
   - ATLAS resources usage since previous CAF (3 months) similar as usual :
     500k running slots, dominated by MC (~75%), on grid (at ~110-120% of pledge).
   - High disk usage as usual, 270 TB pledged
     will put not popular AOD on tape (see afternoon T1 presentation)
   - ATLAS France web site : https://atlas-fr.pages.in2p3.fr
    based on gitlab/Mkdocs
   - OTP Class3/4 computing : to be given for 5th July

   - news from ICB :
        - renewal/update of ICB Funding Agency contacts
        - search committee for ICB chair election
        - campaign on SW&C human ressources : discussions with FA contacts
                 -> discussion with Fred some weeks ago (mail was sent to CAF with content)
                 -> from conveners : lack of persons on "middle level" management,
                      e.g CRC, DDMOps, Rucio, Cric etc. where we had more impact in past
                      (with Sabine, Luc, Stéphane)
                 -> from our side : we have ongoing efforts on ACTS, not always even
                     counted in our S&C involvement tables, which do not correspond to any OTPs
                          ==> OTPs are more for production/every day work than for R&D,
                                   but conveners are aware of this. They expect/need also more
                                   practical return of R&D in ACTS, IA/GPUs etc in Athena.
                                   Missing area in ACTS is on muon reconstruction
                                   (-> IRFU was in muon reco years ago but not anymore,
                                   now in muon alignement)
                           ==> can correspond in part in the two new WG for ICB (on R&D projects,
                                    and sustainable pool of experts in computing & software
                           ==> possible discussion with CAF/group leaders & SC conveners during
                                    the CAF-user meeting of december

     - campaign/prospective of IN2P3 enegineers "metier" -> PECTIN
                -> feedback from master projects (for us "ATLAS/L. Serin" and "LCG-FR")
                         ==> needs for ASR engineers handled by LCG-FR
                         ==> checked with Pierre-Antoine that "metiers" for AMI
                                  are taken into account
                         ==> Fred's feedback to L. Serin is on lack of long term personpower
                                  on how to access HPC machines (hardware + software)
     - Oracle/AMI : end of campus license by April 2022
            -> meeting in April between S&C conveners, Eric, Andrea, Fred
                       - AMI can move as primary use to CERN instead of CC-IN2P3
                               (on going, should be done by September)
                             use of CC as secondary (not even sure it is absolutely necessay)
                                 with use of existing CC individual license (to be checked by Eric)
                        - can ease life of some condition data base replicas
                          (to be followed by Andrea)
                         - follow-up in september ?
      - pledges for 2022 : to be prepared for end of september
          2021->2022 : +10% for cpu/disk, +20% for tape
                       ==> large part of step was done at CC from 2020->2021
                                 should not be a problem, but wait for final "clef de repartition"
                                 between LHC experiments
                        ==> for CC usage (sps, batch, gpu): please foresee
                                 any unusual increase of requests

   - next CAF-user annual workshop : 9 Dec  -> by doodle
         afternoon session : readiness for Run3  -> to be better defined
                                        1h discussion with S&C convener + ICB chair on personpower

2) FR-T2-cloud (Fred)
  - regular/monthly reports available on
    https://cernbox.cern.ch/index.php/s/vrq0bs2qJGY72NV
  - FR-cloud = 16% of T2s on this period
   - normal/usual profile of jobs received by activities
   - by country in FR-cloud : France=53%, Japan=37%, Romania=7%,
                                             China=2%, Hong-Kong=1%
         non usual ranking in French sites, mostly for IRFU which put in security its cooling
       - CPU vs pledge for different sites
            - no problem seen on numbers on French sites
       - Storage vs pledge : ok - GRIF-LAL had deployed storage but not updated space token
       - ggus tickets: normal traffic, mostly for transfer/deletion errors

3) Reports
 3a) LCG-FR (Laurent):
      - HL-LHC network challenges, driven by DOMA-TPC, for this autumn
           -> check which monitoring, centralized or not at CERN
           -> check/update French network weathermap to extend the one of Renater
                to include sites not on LHCONE (e.g LAPP), or sharing LHCONE/public network
           -> CC is adding a monitoring on dCache to get the all internal traffic
      - LCG-FR protocole ongoing
      - IJCLAb+LLR now at 100Gbps
      - worries on price of hardware these months - but at CERN
        with grand public machines ?
      - FR-grid certificate: won't be handled anymore by Renater but same provider
        as for CNRS one -> main issue is for hardware certificate more than for users


  3b) DOMA-FR

 4) Tour des labos
  CPPM : some upgrades, discussion wirh DU for future LCG-FR convention +
               visite of DAS computing (Sabine) next week

   GRIF-IJCLab: token in place to really use the storage pledge

   GRIF-IRFU: achats for 2022 ongoing,
                       one old server machine with old racks (>10 years) not used.
                       Need infrastructure update
                       need replacement for CAF for Jean-Pierre
  GRIF-LPNHE: cooling system is old/badly working. Devis awaited,
                          but need middle term upgrade.
                          For the moment 40% of (old) machines stopped (~30% of cpu less)
                          Ongoing "achat" for storage to benefit from current storage market
   LAPP:
   LPC: pb of token space value on LOCALGROUPDISK
   LPSC : will reduce personpower on OTPs
   L2IT : 

Afternoon:
  1) T1 cloud report (Aresh)
      - good availabiliy/reliability, except in May (96%) due to the FR-grid acl issue
      - FR T1 represents 12% of all T1
      - significant increase of cpu (over pledge) since Feb, after fix of HT Condor in Jan.
      - users should not use GGUS tickets but std email list of ATLAS
      - sps usage : see https://gitlab.in2p3.fr/ccin2p3-support/formations/storage/11-2019-slides_storage/-/blob/master/ccin2p3-storage-best-practices-sps.pdf
      - Frontier R&D
            - DBMAMI incident in May
      - short lived data : non popular AOD will be put on tape, ~200-300 TB per quarter


   2) CC (Fred):
       - sps has 360 TB, 100 TB free and 150 TB with data not accessed since
       a year -> will be copied to ATLASLOCALGROUPTAPE
       - local batch : run smoothly, up to 3000 jobs pending, then run
                ~4% of ATLAS cpu is done by local users -> mostly by few users
          from Eric : better to check accounting on                         

http://cctools.in2p3.fr/mrtguser/mrtguser/atlas/sge_project_atlas_atlas.html

       - gpu usage : 8 users = 4000 h in Jan-June = 50% of pledge
                             Hope it will stay like this/increase
                             May count part of L2IT usage counted as "lab" ?
       - hpc farm usage : one user at lpnhe, ~20000 HS06 hours for generation/Sherpa
                  -> farm is not heavely used
                  -> don't expect much increase, and hard to predict

 3) Software (all)
      -> in CAF some (Arnaud, Pierre-Antoine) are involved in soft oriented for
           performance (b-tag, jet).
      -> development of system of configuration of component accumulator
           to simplify the use of Job options
            see presentation at S&C week
            https://indico.cern.ch/event/1042632/#6-flavor-tagging-software-stat
      -> see presentations/developments by Arnaud
https://its.cern.ch/jira/browse/ATLPHYSVAL-766
coté btagging et software : https://its.cern.ch/jira/browse/ATLPHYSVAL-767?focusedCommentId=3867717&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-3867717
and
https://its.cern.ch/jira/browse/ATLPHYSVAL-772


  4) ADAM

 3) HPC (Fred/Laurent/Erci)
     - pie-chart with HPC contribution by country -> no France !
                               HPC represents 19% of ATLAS cpu
     - discussion on access to Exascale machine : ongoing effort/collab between
       CC and IDRISS
 

5) Machine Learning (all)
 

There are minutes attached to this event. Show them.