|Téléphone ou RNIS||+33 (0)9 88 83 00 07|
|Numéro de la conférence||724048 (terminer par #)|
|Code d'accès||2472 (terminer par #)|
MINUTES CAF MEETING 8/02/2019
By phone: LaurentD, Romania (morning), Jean-Pierre (morning), Manu
At CC: Sabine, Manoulis, David Bouvet, Fred, Catherine (afternoon), Luc
1) INTRO T2s (Luc)
-Stable and good period
- Less WT for FR-cloud due to CERN HLT higher contribution (shutdown period)
- OK for all (SAM, prod & analysis availability) except RO-02 (long UDT due to broken chiller)
- ATLAS request 25% WT for analysis and 75% production for T2s
- Not the case in FR-cloud where average is 8.5%. Same as for all ATLAS T2s: general ATLAS problem to get more analysis
jobs entering & running. In FR the problem is even more critical for GRIF sites. Not understood yet.
- LPC many issues with Rucio access, missing files, IPV6 migration. In addition very low production yield this period.
- RO-07 issues with IPV6 (BREN issue). Switch back to IPV4 for the moment.
- Beijing No increase of DATADISK. Critical
- RO-O2 Long UDT & no increase of DATADISK. Critical
- USTC-T2 transfer errors. Responible hard to identify
- Migration to dpm DOME. Some issues (ATLAS aware). To be discussed in detail in LCG-TECH
- UCORE for all sites except Beijing, RO-02, RO-14. Some problems in deployment
- Harvester submission. Some problems in deployment (LAPP)
- CentOS7 migration CentOS7 migration. Deadline from ATLAS June 1st. All but Romania x4, LPNHE, LPSC (not top priority due to manpower), LPC
- SE Dumps: Still to be provided on a regular basis every 1-3 M
- IPV6 migration OK in FR
- Off-site Squids removals (Historical, used as failover). OK for all FR-cloud sites, except RO-02 (offline).
Manoulis will email sites for dedicated ports to be opened
- Lightweight sites Recommendation (ICB-2018 ) to redirect funding from storage to CPUs. Hard limit 460 TB, 2019: 520 TB.
Concerned: Beijing & RO-02
- SCRATCHDISK: New ATLAS recommendation: 100TB per 1000 slots for analysis (ie 25% total slots)
- GRAFANA monitoring. Encourage to use and give feedback
- CAF to foster new S&C activities/interest. From CAF/PAF day. CAF rep. to collect infos
- Sites Jamboree 5-8 March, CERN
2) SQUAD REPORT T2s (Manu)
- Stable and good period
3) GPU AT CC (Nicolas Fournials)
- Difference & gain netween CPU and GPU architecture
- Today 40 GPU cores (36 batch, 4 interactive). Soon 24 more GPU deployed (20 batch, 4 interactive)
- Software available: CUDA 9.2 (NVIDIA proprietary) and OPENCL 1.2
- For each GPU 1 inetractive node for code optimization, + sevral workesr
- Parallel/multi-nodes jobs using OpenMPI
- NB: Queue access granted to local users on demand https://cc-usersupport.in2p3.fr/
- Request: Today none for ATLAS, 2M hours request (biology,...). Only 25% is available today!
- Accounting: more complicated than with CPU.
- Doc: https://doc.cc.in2p3.fr/jobs_gpu
- Each CAF rep. should make a poll in his lab to collect the request for 2019.
- NB: This farm is not intended to become a GPU facility/farm
4) LCG-FR FEEDBACK (Laurent)
- All French sites under IPV6
- LHCOPN at 100Gb/s (but only 40 allocated). Already traffic IPV6 ~ IPV4
- LHCONE not yet to 100Gb/s
- More precise picture of Data Lake model available: Built around archival center, few Data & Compute Centers (DCC):
large disk storage and CPU resources, and Compute Centers (CC): mostly with CPU, and cache+ (CCC))
- proxy-cache with failover functionality, located at CCC
5) TOUR DES LABOS (All)
- RO-02 waiting for chiller repair (delays in funding)
- Problems with pilot handling (LAPP, LAL)
- LAPP/LPSC DOMA new spacetoken (FR_ALPES). Intend doing test with Hammercloud jobs (head server at lapp, rest at LPSC)
- LPSC CRIC network (between LPSC & Renater) at 40Gb/s
- IRFU new switch at 100Gb/s
- LAL only 20TB SCRATCHDISK
- LPNHE Can use old EDF servers -> allow to buy more disks & less servers. Also Cloud computing tests
(400 slots from unpledged resource + OpenStack)
- CPPM decommissioning of CREAM CE soon (only ARC left)
6) INTRO T1 (Luc)
- Very good period (availability & reliability)
- 12.5k slots average. A bit lower than previous period to account for ATLAS share adjustment
- CC delivered 115% of ATLAS request for 2018! Thanks!!
7) SQUAD REPORT T1 (Manu)
- Higher relative contribution of TRIUMF wrt previous periods
- Some issue to use new GRAFANA monitoring. All dashbords migrated by end 2019.
- +250TB deployed in 2018 -> Total 525TB. ~150TB left, (Last 200TB). Mostly used by IRFU (180TB) & LAPP (170TB)
- Actions: Contact individually biggest owners (~7 all active) to ask for some cleaning. In parallel, Manoulis archiving to be run more frequently.
- LGD-MW: Contact with SM group in Sacly but no feedback. Action: Proposal to archive to tape the 75TB, by no means disk renewed from LGD standard.
- Recovery of disk from groups (not ATLAS) moving to new platform. All pledged +200TB recoverd by March
- In parallel, action to set up test with Isilon (replaxemnt of GPFS): 5TB will be available under /sps/atlas_test. In parallel identifiy 2-3 heavy sps users to run their
code both on GPFS & Isilon to compare the performance (eg many jobs in parallel, jobs with many I/O)
- Reorganisaton of quota per group (users request): Not possible
- Some users code use pathresolver which create extra load on sps access. A trick exists from E. SDauvan at LAPP and M. Escalier at LAL.
- A priori no more SL6 in ATLAS 1st June (realistic?)
- A SL6 machine available at CC + Manoulis trick to submit from SL7+ Container
- It works well and resource deployed are adequate
- Discussion to try and improve batch system, like limiting # max of jobs to be submitted by 1 user.
Btw a fairshare/user is impossible to implement (CC statement)
- In parallel, with Manoulis help, try and extract quantitative infos, like waiting time for jobs, ...