Calcul ATLAS France (CAF)

Europe/Paris
CC-IN2P3

CC-IN2P3

(in practice by visio only)
Frederic DERUE (LPNHE Paris)
Description
 

Réunion du groupe Calcul ATLAS France
web wiki

Connexion à la conférence par zoom (défaut)
Connexion depuis un terminal individuel https://ijclab.zoom.us/j/96346840954?pwd=TXFNbi9mOWREbHc2amdrR0xRQmkxd
Code d'accès 160198

 

Connexion à la conférence par visio-renater (backup)
Connexion depuis un terminal individuel (Windows, OS X, tablette, smartphone ...) http://desktop.visio.renater.fr/scopia?ID=727792***5014&autojoin
Manuel d'installation de Scopia Desktop doc_scopia_desktop-fr.pdf
IP 194.214.202.146
Téléphone ou RNIS +33 (0)9 88 83 00 07
SIP sip:727792@195.98.238.109
H.323
  h323:727792@mgmt.visio.renater.fr
Numéro de la conférence 727792 (terminer par #)
Code d'accès 5014 (terminer par #)

 

MINUTES CAF MEETING 29/06/2020
                                    https://indico.in2p3.fr/event/21423


Remote (morning)        Arnaud, Aresh, Catherine, David B., David C., Eric, Fred,
                                      Jean-Pierre, Laurent, Sabine, Stéphane,
Remote (afternoon)       Arnaud, Aresh, Catherine, David B., Eric, Fred,
                                      Laurent, Sabine, Stéphane,
Apologized :  Andrea

Morning session:

1) Intro (Fred)
  - next meetings : most will be remote due to current restrictions
   - ATLAS resources usage since previous CAF (3 months) similar as usual : >400k running
     slots, dominated by MC on grid.
   - (Too) high disk usage due to increase of DAOD production. Need to delete N-1 version
     of DAODs.
   - Status of Covid studies using Folding@Home  : 5-10% of ATLAS ressources, 
     French sites are contributing actively
   - Web site : could use gitlab to migrate/maintain what is in the wiki,
                      but little pressure/manpower to do it
   - OTP as will be reported to ICB
        -> also discussed by email
        -> numbers are close to previous report
   - mailing lists : need to have more mailing lists to disentangle CAF (~CoDir) with
     only formal members of CAF, FRCLOUD for cloud-FR life with CAF+scientists in charge+syst admins, also FRSITE resticted to only French sites

2) FR-T2-cloud (Fred)
  - regular/monthly reports available on
    https://cernbox.cern.ch/index.php/s/vrq0bs2qJGY72NV

Stable and good period
       - FR-cloud = 16-17% of T2s on this period
       - normal profile of jobs received by activities
       - by country in FR-cloud : France=48%, Japan=41%, Romania=7%,
                                                China=2%, Hong-Kong=2%
         non usual ranking in French sites, mostly for IRFU which put in security its cooling system during confinement + pb of accounting of arc-ce ?

       - CPU efficiency shown for all T2-FR cloud, i.e sum of cpu over sum of wallclock time

       - pledges : all numbers are updated in CRIC
       - CPU vs pledge for different sites
            - no problem seen on numbers on French sites, except GRIF-IRFU which put in security its cooling system during confinement + pb of accounting of arc-ce ?
       - Storage vs pledge : some sites still have to deploy their pledges
       - ggus tickets: normal traffic, mostly for tranfer/deletion errors

3) FR-T2-analysis (Fred)
  - already highly discussed in email exchanges
  - good/very good effiency of analysis jobs (wrt production and other sites)
        -> details are given for each site
        -> only GRIF-LAL is low (~50%) since January, but good (75%) last month
        -> need to scrutinize from time to time and understand the effect of a few users
             on global performance

4) Reports         

4a) Report WLCG (Laurent)
    - CERN will no more use DPM, which corresponds to what we need now and for Run3
      but what for Run4 ?
    - LCG-FR investigate also other solutions (see email exchanges)
    - ongoing CE migration - end of Cream by end of 2020
    - LCG-FR sites meeting on 25-27th November
    - LS2 and Run3 which will start in 2022
          -> experiments will have pledges for 2022 and "priorities" for 2021
          -> not easy for sites to prepare efficiently the budget/pledges

4b) Report DOMA-FR (Stéphane, Sabine)
     - FR-ALPAMED : working testbed with DOME
                                Hammer Cloud tests are in place
                                Monitoring tools are deployed
     - tests of different SEs
           -> these are base components of a French participation to DataLake
           -> include CC
           -> need to show what really brings a DataLake wrt current situation
    - DOMA-QoS (storage) to explore new techniques - but not the priority

4c) Tour of sites and AOB

CPPM : RAS
IRFU : put in security its cooling system during confinement + pb of accounting of arc-ce ?
IJCLab : pledges to be installed
LAPP : network with two lines at 10 Gbps, in september new routeur (CC) then could
            go to 2*20 Gbps
            ongoing discussions to spend money to infrastructures instead of growth
            Ian Bird at LAPP for Escape
LPC : RAS
LPNHE : recurrent cooling and UPS pbs
               Aurelien is now at 100% for the lab, 50% on grid (as before),
               50% on other project
LPSC : pledges in place, some issues with namespace singularity (solved)
L2IT : RAS

AOB : need to find contributions for CAF afternoon session.
          ~1h is possible
           can have different subjects turning, invite some people

4d) Next CAF-user workshop (Fred)
       - foreseen to 10th December
              -> afternoon session focused on work of engineers
                      Computing : DOMA
                      Databases + AMI
                      Software (who ?), ACTS, GEANT4 ?
              -> need to find speakers

======
Afternoon session
 

1) FR-T1-cloud (Fred)
- Very good period with 99% availability & reliability)
- CC = 9% of T1 (was 10% last period),
- ATLAS is using less than its pledge at CC
      - 110% in March, 85% in April, 92% in May
      - seems to be better in June
    -> clearly visible in Grafana plots
    -> cannot use cctools graphs due to changes of batch system
     - Storage : stable for ATLASDISK and TAPE, increase +3 PB on MCTAPE

2) Status of T1 (Aresh)
    - come back to low cpu of T1 in this period (see above)
            -> appear with HTCondor introduction and decomissioning of CREAM-CE
            -> in particular not enough score jobs
            -> now all grid jobs are under HT-Condor
            -> still some AGIS parameters which can be tuned, e.g those limiting the number of pilot jobs
             -> also numbers of VO subgroups to share mcore/score and pilots
                ==> need all this as T1 has also other VOs and projects
                         cannot let atlas alone to handle its jobs
                ==> better tuning seems achieved in June
   - details on Frontier tests
   - details on Third Part Copy tests
   - details on Data Carousel Tests - some issues with FTS in different T1s
   - details on tickets, GGUS and ORTS


3) CC usage (Fred)
   - sps : 360 TiB, filled at 60%
   - LOCALGROUPDISK : 525 TB allocated / 401 TB used
   - LOCALGROUPTAPE : 250 TB used
   - share of cpu per user : ATLAS users on batch represent 1.86% of all ATLAS jobs
       but only 0.79% in May. Only 5 users consume 80% of this cpu
 

There are minutes attached to this event. Show them.