Bi-Weekly Datalake DepOps meeting (Paul chairing)
Weekly meeting to discuss progress on EDLK JIRA issues: https://jira.skatelescope.org/issues/?filter=15115
Zoom room: https://skatelescope.zoom.us/j/97713259777?pwd=Q2EwSWZ3NkRaazFRSy9YT3Y5UmdJZz09
LSC upgrade
The problem was caused by the VOMS server's certificate expiring.
The CA that issues the certificate is having problems and the
request to renew it became stuck. As a result, the existing
certificate expired before the new certificate was issued, creating
a "down time" in the testbed, as the TLS handshakes (to the VOMS
server) were failing.
However, once the certificate was issued, a further problem came to
light: the subject DN had changed. This required that all sites
change their .lsc file for the ESCAPE VO.
We started an upgrade campaign within ESCAPE for all sites to
deploy this new .lsc files. A catch-all JIRA issue, EDLK-164, was created: This ticket is currently still open, but without creating sub-issues for each site. People were also notified via the Rocket-chat
datalake_status_and_support channel.
The ESCAPE testbed monitoring provides the best feedback on the
progress of this upgrade.
Lessons learnt
Paul Millar described how nobody likes having an unexpected
intervention; however, this was an excellent opportunity to check
how we are handling communication with storage provides within our
testbed.
After a few days, I contacted the dCache sites in the testbed
(since I know the people running them, anyway) to request they
update their .lsc file for ESCAPE. It seemed this wasn't happening
otherwise.
Perhaps this is an opportunity to check on how we communicate to
all sites? This could be useful if we need to communicate urgently
during DAC21.
Paul Musset mentioned that, on rocket chat, if you miss a message
then you are not notified.
Rosie described how the INFN-ROMA1 admins mentioned that they don't
see JIRA tickets.
There was some discussion on how GGUS is the commonly adopted
solution for notifying sites of problems; however, within ESCAPE,
we have adopted JIRA as how we track problems. We may wish to
revisit that decision.
Propose a more general solution: we have a list of "Storage
Point-of-Contacts" (storage PoCs). These are listed here:
https://wiki.escape2020.de/index.php/WP2_-_DIOS#Datalake_Status
If we have an issue with any site then create a JIRA ticket
specifically for that site and assign it to that site's PoC. We
may also need to send out an email to that RSEs Storage PoC(s).
The Storage PoCs should (in general) take care that the RSEs
offered by their site are "green" in the ESCAPE grafana monitoring
pages.
We should probably also monitor the attendance of Storage PoCs at
the DepOps meetings, to catch any potential communication problems.
AP/ Rosie to contact all Storage PoCs to mention meetings schedules.
Marek also mentioned how it would be nice if we had some kind of
alerting mechanism along side monitoring.
Monitoring:
The testbed is currently being monitored: where Rucio and FTS
events (from activity) are shown; however, the active probing is
limited to jobs that exercise Rucio. The two lower-level services
(FTS and gfal) are currently not being actively tested.
Rizart described how there are plans to reactivate both the FTS and
gfal probes; however, they are currently on hold due to
WLCG-related activity.
There were also issues with the cluster (last week and this week).
The recommencing of the active testing is more than simply
re-enabling the probes: work is needed to clean up stale files; for
example, if a gfal test fails to delete a file then (currently) it
will remain on the storage indefinitely.
The active testing is recorded in the Rucio Events monitoring page
under a separate activity: "Functional Test", which is available as
a filter. Selecting this filter allows for viewing only the result
from the active testing.
We could add new activities; for example, a "DAC21" activity.
There was some discussion about whether we can know if the active
monitoring is running correctly. Potentially, we learn of problems
with an RSE because the active testing shows failures; however, if
the active testing stops (for whatever reason) then we might not be
aware of problems with that RSE.
Rizart thought it might be possible to trigger an alert if the
monitoring stalls. This is normally not possible, due to a
limitation in Grafana (parameterised plots cannot have alerts), but
it might be possible to have some "hard coded" Grafana panels for
active monitoring, specifically to support alerts.
Rosie kindly asked that the SKA VO be added to the dashboard: her
thanks go to Rizart for doing this.
Should this be extended to other VOs? This is a first attempt at
adding the CTA VO. It looks like CTA-related traffic is currently
not being shown, perhaps due to the configuration of the CTA Rucio
instance.
AP/ Rosie and Gareth to involve others in CTA in checking if the VO
selection works correctly for them.
Token update:
Andrea reported on Frederica's. She is working on supporting
tokens has full skeleton version of the conformance tests.
This fetches the list of endpoints from CRIC.
Most sites currently fail; however, this is to be expected as they
haven't been given clear instructions (yet) on what a site must do
to support token-based authentication.
Andrea described the various steps he anticipates ESCAPE adopting,
in order to achieve the goal of token-based authorisation.
The first step involves having two groups: ordinary people and
data-managers, identified by group-membership. Ordinary people
have read-only access while data-managers have read-write access.
Paul suggested a step-0: show that token-based authentication works
without limiting authorisation. This would be equivalent to what
is available now, with X.509 + VOMS.
Andrea agreed this would be a good first (or "zeroth"?) step.
Advice on file sizes:
Gareth describe a set of tests he would like to conduct, looking
into how the datalake copes with files of different sizes; in
particular, over long-haul connections.
The idea would be to try files at various sizes: 2 GiB up to 100
GiB. He was asking whether there are any known limitations?
Paul mentioned that timeouts might be an issue, particular if the
network connection is more bandwidth limited than usual.
Rohini described how SKA also have an interest in how file size
affects the data lake's performance. Also, that FTS has
configuration that allows for different number of concurrent
transfers.
Rizart confirmed this configuration option, adding that it's also
possible to assign different priorities to different VOs. He has
access to FTS configuration on the FTS3-pilot instance, so can see
the current values and modify them.