Calcul ATLAS France (CAF)



Frederic DERUE (LPNHE Paris), Luc Poggioli (LAL Orsay)

Réunion du groupe Calcul ATLAS France
web wiki

Informations de connexion à la visio-conférence :

Téléphone ou RNIS +33 (0)9 88 83 00 07
SIP sip:725299@
Numéro de la conférence  725299 (terminer par #)
Code d'accès
  8503 (terminer par #)



                                 MINUTES CAF MEETING 01/04/2019

At CC:           Eric, Manoulis, Luc, Fred, Emmanuel
Remote:         Laurent (morning), Mihai (morning), Jean-Pierre (morning),
                      Catherine (afternoon), Sabine (afternoon)
Apologized :  Stéphane

Morning session:

1) Intro (Fred)
LCG-FR fundings are being transfered to labs
Pledges : pledges are installed for all/most of sites. The ones which are late have been contacted and/or are being installed
Networks issues seen in March on LHCOne are now solved

2) FR-T2-cloud (Fred, Emmanuel)
Stable and good period
       - FR-clould = 14.5% of T2s on this period (Feb-March) - was 11.8% previous period
       - normal profile of jobs reveived, 83% of 8-core
       - Tokyo=32% of FR-cloud T2 - was 22.7% n previous period
       - SAM tests / ASAP metric : red for some period for RO-02 due to clim pb (solved now)
 - list of CE and batch systems for sites, CREAM/ARC - pbs/arc;
   some sites with htcondor or slurm
 - CentOS7 migration : to be done by 4*RO, LPNHE, LPC, LPSC
 - DOME migration : some sites now, other during summer
 - lightweight site : limit for 2019 is 520 TB, concerns Beijing, RO-02, HK
   which are below the limit
 - analysis vs production : the low amount of analysis jobs seen last time is increasing now
   (detailed list of fractions for each site). Hard to understand differences between sites. To
   be followed but with less pressure
 - site issues : mostly linked to data transfer errors and FTS - often linked to pb of network.
   For most of issues, hardware restart solve issues
 - perfmuons token : in contact with muon WG - still in wait for proper cleaning to have the
   two DDM space tokens in CC and Tokyo not excluded
 - Files/directories out of token : some directories out of standard tokens seen in some sites.
   After check with C. Serfon, could be deleted.
 - Google cloud at Tokyo : 1 slide + link to full presentation to get succesful
   tests done in Tokyo

3) Reports
3a) Jamboree (Fred)
  - sites configuration & recommendations : taken from slides of A. Filipcics
         -> information also reported during the FR-T2-Cloud report and tout of sites
  - Sites/storage evolution : extract of slides of S. Campana (WLCG 2019) on evolution
    in 5 years. A questionnaire will/has been sent by LCG-FR for all sites
  - light weight sites : reminder of definition of light weight sites (<520 TB in 2019), already
    RO-14, RO-16 + possibility to send jobs via BOINC
            -> as more sites will become diskless different scenarios are presented +
                 link to DOMA-FR
  - monitoring and analytics : recall of most commonly used Grafana links
  - Shifts : CRC + ADC -> contact more often directly sites, not squad
  - Database : 1 slide resume for parallel sessions

3b) WLCG / DOMA (Laurent)
  - resume de la problematique QoS (Quality of Service) - studied at CC
  - DOMA in France
         ATLAS data federation CC-LAPP-LPSC
         Testbed FR-Alpes
         Tests CEPH sur GRIF
NB: see also WLCG resume at ADC Tech coordination :

3c) Tour of sites
IRFU: migration to DOME but see instabilities; on going decomissioning of CREAM-CE
          (mail sent few days 3 days after CAF meeting);
          upgrade of ARC-CE; change of pump in cooling system

LAL: equipment installed (or soon) for pledge; migration to DOME in April;
LPNHE: migration to CentOS7 in May, to DOME this summer
LPSC: first migration to DOME then to CentOS7 and HTCondor
RO-07: - storage/deletion files pb with some machines due to current update
               of the packages for DPM+DOME
             - today's configuration:
                  1. CREAM-CE for ATLAS single core production - shared with LHCb -
                      machine name is
                  2. CREAM-CE for ATLAS with 2 queues, analysis and multi-core simulations -  
                      machine name is
                  3. ARC-CE+SLURM - dedicated to ATLAS multi-core jobs -
                      machine name is
                  4. ARC-CE+HTCondor+Docker -has 2"queues" for both, analysis
                      and production - machine name is

              - start to install a new ARC-CE with the new software arc6 and LRMS
                will be HTCondor. Create a new Docker image of CentOS7.
                Soon ask for new queue for this machine
              - then starts to migrate the wns from the CREAM-CE to the new ARC-CE
              - then migrate the ARC-CE+SLURM to CentOS7.
              - the ARC-CE with HTCondor is already on CentOS7 but the Docker image
                running on the WNs is SL6(That's why I have created a new CentOS7 image)
             - migration to DOME to be done      

4) next CAF-user meeting
2 weeks possible, 18-22nd Nov, 25-29th Nov 2019
prebook rooms (amphi) at CC
send doodle to group leaders to converge before end of April

Afternoon session
1) CC-IN2P3
1a) FR-T1-cloud (Fred, Emmanuel)
- Very good period (availability & reliability)
- CC = 15.9% of T1 (was 14.8% last period) - TRIUMF still high
- CC delivered 110% of ATLAS request in January but only 90% in February
- a drop in number of jobs seen in CC and Grafana tools
- some issues mostly linked to LHCOne pb
- ATLASDISK : 7.5 PB, ~290 TB free

1b) CC status (Manoulis)
   - lower activity as seen in FR-T1-clould report; a priori less jobs sent
     no issues seen (good efficiency, running vs pending)
   - report on progress of HT condorm farm
   - details on atlas unified queue (atlasufd) on Harvester+Grid Engine
        -> fine, even if not all T1 production queues can be unified
             ( we need common maxMEM and wallclock values for each GE queue) (e.g.SCORE IN2P3-CC_CL7_HIMEM and SCORE IN2P3-CC_CL7_VVL)

1c) CC usage (Fred)
   - 291 TB allocated, 160 TB used - should have 350 TB soon
   - automatic cleaning of 31 TB to LOCALGROUPTAPE
   - Tests of isilon : done by Marc and Emmanuel
           -> no differences seen between analysis jobs on gpfs vs isilon
           -> ask these persons (+Konie) to stress a bit more the system with more jobs
                on batch system
   - LOCALGROUPDISK : 525 TB among which 200 TB left
   - LOCALGROUPTAPE : 240 TB used -> starts to be non negligible
                                         1 user asked for long term storage of analysis ntuples
                                         (but need to test procedure)
  - LOCALGROUPDISK-MW for SM : 75TB; recall that this is on hardware
    no more under warranty
  - usage of batch system : a bit less jobs on average than on previous CAF period
       stat of usage done by Manoulis for all/users do not show pb:
      Median wallclock and qtime of the jobs are not to high ~O(1h)
      High submition rate ( particular one user) cause the saturation of the resources (
      up to the limit of the slots)


2) HPC (Fred)
  - slides for "ATLAS usage of GPU" for GPU workshop @CC for 4th April
  - review of "small" HPC and GPU located "near" FR laboratories, at universities,
    CC and CEA.
  - IDRIS :
        - recall of tests done on old IDRIS machine by Manoulis in 2018 -
          reports available on the agenda 
        - new Jean Zay machine will become operational in summer 2019
          70k cores + ~1000 GPUs
          ==> need to see if/how/for what to access this new machine
  - EuroHPC: slides from WLCG on the EuroHPC project
  - slide with link to the workshop CERN/PRACE in november 2018





There are minutes attached to this event. Show them.