Reunion CAF



Luc Poggioli (LAL Orsay)
Téléphone ou RNIS +33 (0)9 88 83 00 07
SIP sip:724048@
Numéro de la conférence 724048 (terminer par #)
Code d'accès 2472 (terminer par #)


                                 MINUTES CAF MEETING 8/02/2019

By phone:     LaurentD, Romania (morning), Jean-Pierre (morning), Manu
At CC:        Sabine, Manoulis, David Bouvet, Fred, Catherine (afternoon), Luc

1) INTRO T2s (Luc)
-Stable and good period
- Less WT for FR-cloud due to CERN HLT higher contribution (shutdown period)
Sites status:
- OK for all (SAM, prod & analysis availability) except RO-02 (long UDT due to broken chiller)
Analysis share:
- ATLAS request 25% WT for analysis and 75% production for T2s
- Not the case in FR-cloud where average is 8.5%. Same as for all ATLAS T2s: general ATLAS problem to get more analysis 
jobs entering & running. In FR the problem is even more critical for GRIF sites. Not understood yet.
Sites issues:
- LPC many issues with Rucio access, missing files, IPV6 migration. In addition very low production yield this period.
- RO-07 issues with IPV6 (BREN issue). Switch back to IPV4 for the moment.
- Beijing No increase of DATADISK. Critical
- RO-O2 Long UDT & no increase of DATADISK. Critical
- USTC-T2 transfer errors. Responible hard to identify
For sites:
- Migration to dpm DOME. Some issues (ATLAS aware). To be discussed in detail in LCG-TECH 
- UCORE for all sites except Beijing, RO-02, RO-14. Some problems in deployment
- Harvester submission. Some problems in deployment (LAPP)
- CentOS7 migration CentOS7 migration. Deadline from ATLAS June 1st. All but Romania x4, LPNHE, LPSC (not top priority due to manpower), LPC 
- SE Dumps: Still to be provided on a regular basis every 1-3 M
- IPV6 migration OK in FR
- Off-site Squids removals (Historical, used as failover). OK for all FR-cloud sites, except RO-02 (offline). 
  Manoulis will email sites for dedicated ports to be opened
- Lightweight sites Recommendation (ICB-2018 ) to redirect funding from storage to CPUs. Hard limit 460 TB, 2019: 520 TB. 
  Concerned: Beijing & RO-02
- SCRATCHDISK: New ATLAS recommendation: 100TB per 1000 slots for analysis (ie 25% total slots)
- GRAFANA monitoring. Encourage to use and give feedback
- CAF to foster new S&C activities/interest. From CAF/PAF day. CAF rep. to collect infos
- Sites Jamboree 5-8 March, CERN

2) SQUAD REPORT T2s (Manu)
- Stable and good period

3) GPU AT CC (Nicolas Fournials)
- Difference & gain netween CPU and GPU architecture
- Today 40 GPU cores (36 batch, 4 interactive). Soon 24 more GPU deployed (20 batch, 4 interactive)
- Software available: CUDA 9.2 (NVIDIA proprietary) and OPENCL 1.2
- For each GPU 1 inetractive node for code optimization, + sevral workesr
- Parallel/multi-nodes jobs using OpenMPI 
- NB: Queue access granted to local users on demand
- Request: Today none for ATLAS, 2M hours request (biology,...). Only 25% is available today!
- Accounting: more complicated than with CPU. 
- Doc:
- Each CAF rep. should make a poll in his lab to collect the request for 2019.
- NB: This farm is not intended to become a GPU facility/farm

4) LCG-FR FEEDBACK (Laurent)
- All French sites under IPV6
- LHCOPN at 100Gb/s (but only 40 allocated). Already traffic IPV6 ~ IPV4
- LHCONE not yet to 100Gb/s
- More precise picture of Data Lake model available: Built around archival center, few Data & Compute Centers (DCC): 
large disk storage and CPU resources, and Compute Centers (CC): mostly with CPU, and cache+ (CCC)) 
- proxy-cache with failover functionality, located at CCC 

- RO-02 waiting for chiller repair (delays in funding)
- Problems with pilot handling (LAPP, LAL)
- LAPP/LPSC DOMA new spacetoken (FR_ALPES). Intend doing test with Hammercloud jobs (head server at lapp, rest at LPSC)
- LPSC CRIC network (between LPSC & Renater) at 40Gb/s
- IRFU new switch at 100Gb/s
- LPNHE Can use old EDF servers -> allow to buy more disks & less servers. Also Cloud computing tests 
 (400 slots from unpledged resource + OpenStack)
- CPPM decommissioning of CREAM CE soon (only  ARC left) 
6) INTRO T1 (Luc)
- Very good period (availability & reliability)
- 12.5k slots average. A bit lower than previous period to account for ATLAS share adjustment 
- CC delivered 115% of ATLAS request for 2018! Thanks!!

- Higher relative contribution of TRIUMF wrt previous periods
- Some issue to use new GRAFANA monitoring. All dashbords migrated by end 2019.

- +250TB deployed in 2018 -> Total 525TB. ~150TB left, (Last 200TB). Mostly used by IRFU (180TB) & LAPP (170TB)
- Actions: Contact individually biggest owners (~7 all active) to ask for some cleaning. In parallel, Manoulis archiving to be run more frequently.
- LGD-MW: Contact with SM group in Sacly but no feedback. Action: Proposal to archive to tape the 75TB, by no means disk renewed from LGD standard.
- Recovery of disk from groups (not ATLAS) moving to new platform. All pledged +200TB recoverd by March
-  In parallel, action to set up test with Isilon (replaxemnt of GPFS): 5TB will be available under /sps/atlas_test. In parallel identifiy 2-3 heavy sps users to run their 
code both on GPFS & Isilon to compare the performance (eg many jobs in parallel, jobs with many I/O)
- Reorganisaton of quota per group (users request): Not possible
- Some users code use pathresolver which create extra load on sps access. A trick exists from E. SDauvan at LAPP and M. Escalier at LAL.
- A priori no more SL6 in ATLAS 1st June (realistic?)
- A SL6 machine available at CC + Manoulis trick to submit from SL7+ Container
- It works well and resource deployed are adequate
- Discussion to try and improve batch system, like limiting # max of jobs to be submitted by 1 user. 
 Btw a fairshare/user is impossible to implement (CC statement) 
- In parallel, with Manoulis help, try and extract quantitative infos, like waiting time for jobs, ...

There are minutes attached to this event. Show them.
    • 10:30 11:15
      Introduction T2 45m
      Speaker: Dr Luc Poggioli (LAL Orsay)
    • 11:15 11:30
      T2 Squad report 15m
      Speaker: Emmanuel Le Guirriec (CPPM)
    • 11:30 12:00
      GPU usage at CC 30m
      Speaker: Nicolas Fournials (CC-IN2P3)
    • 12:00 12:30
      Feedback from LCG-FR - R&D for DOMA progress 30m
      Speakers: Catherine Biscarat (LPSC/IN2P3/CNRS), Dr Laurent Duflot (LAL)
    • 12:30 12:45
      Feedback from labs 15m
    • 12:45 13:45
      Lunch 1h
    • 13:45 14:05
      Introduction T1 20m
      Speaker: Dr Luc Poggioli (LAL Orsay)
    • 14:05 14:20
      T1 Squad report 15m
      Speaker: Emmanuel Le Guirriec (CPPM)
    • 14:20 15:20
      Discussion: batch, sps, LGD, SL6/CentOS7 1h
      Speakers: Frederic DERUE (LPNHE Paris), Luc Poggioli (LAL Orsay France)
    • 15:20 15:40
      T1 status 20m