Calcul ATLAS France (CAF)



Frederic DERUE (LPNHE Paris), Luc Poggioli (LAL Orsay)

Réunion du groupe Calcul ATLAS France
web wiki

Informations de connexion à la visio-conférence :

Téléphone ou RNIS +33 (0)9 88 83 00 07
SIP sip:724139@
Numéro de la conférence 724139 (terminer par #)
Code d'accès
  7790 (terminer par #)



     MINUTES CAF MEETING 17/06/2019

At CC:           Eric, Manoulis, Luc, Fred, Emmanuel, Sabine, Stéphane
Remote:         Laurent, Jean-Pierre (morning), Catherine (morning)
Apologized :  Mihai

Morning session:

1) Intro (Fred)
- ATLAS resources usage since previous CAF (2 months) similar as usual : >300k running slots, dominated by MC Full Sim, on grid, with peaks of jobs done on HPC
- Network :  from WLCG overview board, the link CERN-SURFSara, will be abandonned in 2022from WLCG overview board, the link CERN-SURFSara, expected for 2022, will be abandonned. Used as a LHCOPN backup for Corea, Taiwan and Russia and as a LHCOne link for Japan, Corea, Taiwan and Russia
- OTPs for 1semester 2019
    -> discussion where to put contribution of all CAF members and/or scientist in charge
        of T2, as part of Class 4 (OTP for T1&T2) or Class 3 (Cloud management)
             ==> not always clear where to put contributions. In practice will add A. Duperrin
                     (0.05 FTE) in Class 4 as scientist in charge of T2 at CPPM, 
                     + Stephane, Sabine, Manu at 0.05 FTE in Class 3 (Cloud management),
                               --> there was dicussion to take care that Sum of FTE for cloud
                                     management is at same level as Cloud support
             ==> will increase some numbers in Class 4 (e.g LPSC) due to heavier load with
                     CentOS7 migration + make appear some names of engineer members of
                     ATLAS (e.g LPNHE)
       ==> new numbers will be sent to CAF members in email
- Abstracts for conference :
       - one submitted for CHEP for FR-Alpes/DOMA activites
       - one talk at ICASC in Romania in September
   ==> need to collect other presentations, in particular on software/Machine Learning etc

2) FR-T2-cloud (Fred)
Stable and good period
       - FR-cloud = 16.2% of T2s on this period (Aprl-midMay) - was 15.5% previous period
       - normal profile of jobs received by activities (more MCSimFast) and number of cores
       - by country in FR-cloud : France=46%, Japan 43%, Romania=8%, China=3%
            LPC seems low (7% of the total in France) wrt available resources
       - no details give on cpu done vs pledge -TBD next time)
       - Storage (DATA+SCRATCH and LOCAL) on grid is shown
       - SAM/ASAP tests shown, but averaged on 2.5 months - is followed
         by squad more regularly
       - transfer matrix efficiency : good in general - even if there were network issues in this
         period, but need to check why source=IT has low efficiency on all our sites

       - CentOS7 migration : now done for all sites
             -> deployement of singularity is complicated, lack of information/wiki
                  default version for ATLAS works but could
       - IPv6 deployment : last done was RO-07 end of May
       - DOME migration
              - pilot sites with v1.12 mid-April for LAL, IRFU
              - difficult ongoing migration for LPC whereas was easy for AUVERGRID
              - no site went (also in other clouds) went to the very end of the process,
                i.e SRM-less
       - site issues & GGUS tickets
            -> many tickets on this period with all these migrations (CentOS7 / DOME)
            -> case for RO-02 which received tickets to remove SE from rucio + deploy CentOS7
                but no answer/feedback

3) Database
  Discussion about AMI at CC : CERN and CC are renewing their license to Oracle,
  which is used by AMI. CC (Eric) would like to ensure that if AMI stays at CC - which is the
  baseline - it will remain for a while.
  A presentation is done by J. Odier (LPSC) on usage of AMI at CC by different experiments.
  AMI can used different DB based or not on SQL (including Oracle).
  It is necessary to get "GoldenGate" to replicate DBs for ATLAS-AMI from CC to CERN
  There is will to keep AMI at CC  

4) Reports
4a) WLCG (Laurent)
  - feedback from WLCH overview board : difficult to make plans for Run-3 due to different scenarios for LS and Run-3 start/end - which could affect also HL-LHC plans. Proposal to follow baseline scenario except for tapes in case more storage would be needed.
  - funding for Run-3 : 7 FAs gave feedback - ok for France. UK has troubles.
  - LCG-France :
        - CC-LHCOne link upgraded to 40G and validated
        - CoDir for future of sites : uncertainties for LPSC, TBD for LPC
        - France/IN2P3 prospectives : GT-09 "Calcul & données"
                 -> need to subscribe to mailing lists + prepare a contribution for CAF/ATLAS
                     (possibly to be provided by 20th July ...)

4b) Tour of sites and AOB
IRFU : CREAM-CE arrete, que ARC
       arret assez long, pb clim
LAPP : panne reseau il y a 15j
       occupe avec les migrations + banc de test pour CHEP
       passage DOME sur FR-Alpes d'abord
LPSC : CentOS7 ferme pas remplie
       coeur de reseau neuf
       passage DOME sur FR-Alpes d'abord
CPPM : passage DPM 1.12 en cours
       pas de no-SRM !
       config singularity par defaut
LPNHE : clim
LAL   : migration DOME mi-avril

5) next CAF-user meeting
28th November at IPN Lyon
1st draft of agenda shown -> need to let the morning session more opened to global computing/software discussion (was too ML oriented), add a talk on overview on GPU usage in afternoon + keep few presentations on ML

Stéphane : R&D computing for HL-LHC is not part of IN2P3 upgrade days
           -> need to speak about it during this meeting
           -> also in ATLAS-FR contribution to GT-09 (see at the end)

Afternoon session
1) CC-IN2P3
1a) FR-T1-cloud (Fred)
- Very good period (availability & reliability)
- CC = 11.6% of T1 (was 9.9% last period)
  cpu by activities are shown for All T1 & CC - more EventGen and FastSim, less FullSim
- ATLAS was barely using its pledge in February but reached 126% in May
   (mostly due to lack/pbs of usage of batch farm)
- Perturbations in nfs system affecting batch system are visible in April + downtime in May

1b) CC status (Manoulis)
   - Local storage : Just one unreadable tape, only raw precious files recovered (raw ~1.9K files)from Castor Replica,
   - HTcondor installation under progress, we are working on accountingand fair share
   - Rucio automatically recovers the suspicious files
could have inutile transfers and deletion request if the data are there and just the pool is offline for some other reason irrelevant with data status, e.g. data/machine migration
        -> proposal by Manoulis to switch off this feature by the
declaration of Unscheduled downtimesat "RISK", to be discussed with DPM people (Cedric)
NEW and updates from GDB on DOMA TPC
  - LOCAL BATCH AND SPS SPACE : CC asked for more tests from ATLAS to test sps under isilon : in practice people who sent tests already, will probably won't redo same tests. A priori all was fine. Need to get more tests by end of summer ?
  - New SPS policy auto deletion policy
-Highlights :
New policy proposed for the autodeletion of the data which have been access more than 2 years ==> will be in place by early next year
     => need to fill DMP (plan de gestion des données) as provided by CC
     => need to provide ATLAS policy of usage of data (to be written asap)


  1c) CC usage (Fred)
   - storage based on decisionnel :
         - dcache : 9998 affected / 9902 used; NB: CC will add 100TB so that all pledges ar visibles
         - hpss-bande : 14553 TB used
         - sps : 320 TB allocated / 169 TB used
                  ==> ~19 TB of user data not accessed since >2 years; could be cleaned (i.e 
                  moved to tape) when switching to isilon
                  ==> need to ask groups about sps usage for next year (and 2021)

   - LOCALGROUPDISK : 525 TB among which 197 TB left
   - LOCALGROUPTAPE : 250 TB used
   - share of cpu usage between different atlas/projects as provided by "decisionnel" is shown
   - share of cpu per user : numbers for real users - not pilot jobs - show that only few users
     have notable contribution (~20) from May. Main user is a PhD from LAPP with ~40% of
     user-cpu usage, 2nd is 14% .... 
            ==> this explains why we easely see peaks of submission, and not constant flux
            ==> this PhD student is finishing his doctorate, so is sending less jobs, which can
                     also explain the drop in global ATLAS batch usage
  - usage of batch system : on average 144 jobs sent during last period (very peaky), vs 339 jobs in previous period, and 612 is previous one.

2) HPC -> preparation of GPU request (Fred)
  - slides for "ATLAS usage of GPU" for GPU workshop @CC for 4th April
  - 3 analyses were able to give estimation of time requested
       -> only one was partially done at CC farm
       -> each represents 1-1.5 GPU-year
  - need to prepare for september a request based on this. Need to do the request
    to ensure ATLAS will have access for different users, but not too much (need to ensure we will use what we request). Typically 4 GPU-year represents ~10% of GPU farm
              - time depends also on type of GPU used
              - need to ensure that these three analyses will be needed next year
              - check other analyses
                     ---> need to resend email to ATLAS-FR for a last round of information


Actions to be taken
  - mail to group leaders (-> then to groups)
       - sps usage by lab/groups for 2020, 2021 based on previous survey (last autumn)
       - GPU usage : more numbers if possible
    NB: aside target in particular D. Rousseau and student to ensure they will send jobs
  - data management policy
       - DMP document from CC to be filled
       - prepare similar document with ATLAS-FR policy
  - prospectives GT-09
       - subscribe to GT-09 mailing list (
       - prepare document for ATLAS-FR contribution - to be sent by 20th July ?
  - list of talks/seminars on software & computing 





There are minutes attached to this event. Show them.