Bi-Weekly Datalake DepOps meeting (Paul chairing)

Europe/Paris
Description

Weekly meeting to discuss progress on EDLK JIRA issues: https://jira.skatelescope.org/issues/?filter=15115

Zoom room: https://skatelescope.zoom.us/j/97713259777?pwd=Q2EwSWZ3NkRaazFRSy9YT3Y5UmdJZz09

LSC upgrade

   The problem was caused by the VOMS server's certificate expiring.

   The CA that issues the certificate is having problems and the
   request to renew it became stuck.  As a result, the existing
   certificate expired before the new certificate was issued, creating
   a "down time" in the testbed, as the TLS handshakes (to the VOMS
   server) were failing.

   However, once the certificate was issued, a further problem came to
   light: the subject DN had changed.  This required that all sites
   change their .lsc file for the ESCAPE VO.

   We started an upgrade campaign within ESCAPE for all sites to
   deploy this new .lsc files.  A catch-all JIRA issue,  EDLK-164, was created:  This ticket is currently still open, but without creating sub-issues for each site.  People were also notified via the Rocket-chat
   datalake_status_and_support channel.
 
   The ESCAPE testbed monitoring provides the best feedback on the
   progress of this upgrade.


   Lessons learnt

   Paul Millar described how nobody likes having an unexpected
   intervention; however, this was an excellent opportunity to check
   how we are handling communication with storage provides within our
   testbed.

   After a few days, I contacted the dCache sites in the testbed
   (since I know the people running them, anyway) to request they
   update their .lsc file for ESCAPE.  It seemed this wasn't happening
   otherwise.
    
   Perhaps this is an opportunity to check on how we communicate to
   all sites?  This could be useful if we need to communicate urgently
   during DAC21.

   Paul Musset mentioned that, on rocket chat, if you miss a message
   then you are not notified.
   
   Rosie described how the INFN-ROMA1 admins mentioned that they don't
   see JIRA tickets.

   There was some discussion on how GGUS is the commonly adopted
   solution for notifying sites of problems; however, within ESCAPE,
   we have adopted JIRA as how we track problems.  We may wish to
   revisit that decision.

   Propose a more general solution: we have a list of "Storage
   Point-of-Contacts" (storage PoCs).  These are listed here:

   https://wiki.escape2020.de/index.php/WP2_-_DIOS#Datalake_Status

   If we have an issue with any site then create a JIRA ticket
   specifically for that site and assign it to that site's PoC.  We
   may also need to send out an email to that RSEs Storage PoC(s).

   The Storage PoCs should (in general) take care that the RSEs
   offered by their site are "green" in the ESCAPE grafana monitoring
   pages.

   We should probably also monitor the attendance of Storage PoCs at
   the DepOps meetings, to catch any potential communication problems.
   
AP/ Rosie to contact all Storage PoCs to mention meetings schedules.

   Marek also mentioned how it would be nice if we had some kind of
   alerting mechanism along side monitoring.


Monitoring:

   The testbed is currently being monitored: where Rucio and FTS
   events (from activity) are shown; however, the active probing is
   limited to jobs that exercise Rucio.  The two lower-level services
   (FTS and gfal) are currently not being actively tested.
   
   Rizart described how there are plans to reactivate both the FTS and
   gfal probes; however, they are currently on hold due to
   WLCG-related activity.

   There were also issues with the cluster (last week and this week).

   The recommencing of the active testing is more than simply
   re-enabling the probes: work is needed to clean up stale files; for
   example, if a gfal test fails to delete a file then (currently) it
   will remain on the storage indefinitely.

   The active testing is recorded in the Rucio Events monitoring page
   under a separate activity: "Functional Test", which is available as
   a filter.  Selecting this filter allows for viewing only the result
   from the active testing.

   We could add new activities; for example, a "DAC21" activity.

   There was some discussion about whether we can know if the active
   monitoring is running correctly.  Potentially, we learn of problems
   with an RSE because the active testing shows failures; however, if
   the active testing stops (for whatever reason) then we might not be
   aware of problems with that RSE.

   Rizart thought it might be possible to trigger an alert if the
   monitoring stalls.  This is normally not possible, due to a
   limitation in Grafana (parameterised plots cannot have alerts), but
   it might be possible to have some "hard coded" Grafana panels for
   active monitoring, specifically to support alerts.

   Rosie kindly asked that the SKA VO be added to the dashboard: her
   thanks go to Rizart for doing this.

   Should this be extended to other VOs?  This is a first attempt at
   adding the CTA VO.  It looks like CTA-related traffic is currently
   not being shown, perhaps due to the configuration of the CTA Rucio
   instance.

   AP/ Rosie and Gareth to involve others in CTA in checking if the VO
   selection works correctly for them.


Token update:

   Andrea reported on Frederica's.  She is working on supporting
   tokens has full skeleton version of the conformance tests.

   This fetches the list of endpoints from CRIC.

   Most sites currently fail; however, this is to be expected as they
   haven't been given clear instructions (yet) on what a site must do
   to support token-based authentication.

   Andrea described the various steps he anticipates ESCAPE adopting,
   in order to achieve the goal of token-based authorisation.

   The first step involves having two groups: ordinary people and
   data-managers, identified by group-membership.  Ordinary people
   have read-only access while data-managers have read-write access.

   Paul suggested a step-0: show that token-based authentication works
   without limiting authorisation.  This would be equivalent to what
   is available now, with X.509 + VOMS.

   Andrea agreed this would be a good first (or "zeroth"?) step.


Advice on file sizes:

   Gareth describe a set of tests he would like to conduct, looking
   into how the datalake copes with files of different sizes; in
   particular, over long-haul connections.

   The idea would be to try files at various sizes: 2 GiB up to 100
   GiB.  He was asking whether there are any known limitations?

   Paul mentioned that timeouts might be an issue, particular if the
   network connection is more bandwidth limited than usual.
   
   Rohini described how SKA also have an interest in how file size
   affects the data lake's performance.  Also, that FTS has
   configuration that allows for different number of concurrent
   transfers.

   Rizart confirmed this configuration option, adding that it's also
   possible to assign different priorities to different VOs.  He has
   access to FTS configuration on the FTS3-pilot instance, so can see
   the current values and modify them. 

Il y a un compte-rendu associé à cet événement. Les afficher.
    • 11:00 11:30
      Hot topics
    • 11:30 11:40
      Datalake health
    • 11:40 12:00
      AOB