Data Injection Demonstrators

Europe/Paris
    • 11:00 11:05
      ATLAS 5m
    • 11:05 11:10
      CMS 5m
    • 11:10 11:15
    • 11:15 11:20
      EGO/VIRGO 5m

      AUTHENTICATION: OK
      ==============
      user@3ae7a83d158b ~]$ rucio whoami
      /usr/lib/python2.7/site-packages/paramiko/transport.py:33: CryptographyDeprecationWarning: Python 2 is no longer supported by the Python core team. Support for it is now deprecated in cryptography, and will be removed in a future release.
        from cryptography.hazmat.backends import default_backend
      status     : ACTIVE
      account    : pchanial
      account_type : SERVICE
      created_at : 2020-10-13T09:59:50
      updated_at : 2020-10-13T09:59:50
      suspended_at : None
      deleted_at : None
      email      : pierre.chanial@ego-gw.it

      SCOPE CREATION: OK
      ==============
      rucio-admin scope add --account pchanial --scope VIRGO_EGO_CHANIAL

      UPLOAD: OK
      ======
      [user@9c903e7070d5 ~]$ rucio upload --rse EULAKE-1 --scope VIRGO_EGO_CHANIAL V-FakeV1_GWOSC_O2_4KHZ_R1-1185615872-4096.hdf5
      2020-10-15 09:28:22,987    INFO    Preparing upload for file V-FakeV1_GWOSC_O2_4KHZ_R1-100000004-4096.hdf5
      2020-10-15 09:28:23,261    INFO    Successfully added replica in Rucio catalogue at EULAKE-1
      2020-10-15 09:28:23,354    INFO    Successfully added replication rule at EULAKE-1
      2020-10-15 09:28:24,845    INFO    Trying upload with gsiftp to EULAKE-1
      2020-10-15 09:28:28,240    INFO    Successful upload of temporary file. gsiftp://eulakeftp.cern.ch:2811/eos/eulake/tests/rucio_test/eulake_1/VIRGO_EGO_CHANIAL/c2/f6/V-FakeV1_GWOSC_O2_4KHZ_R1-100000004-4096.hdf5.rucio.upload
      2020-10-15 09:28:28,346    INFO    Successfully uploaded file V-FakeV1_GWOSC_O2_4KHZ_R1-100000004-4096.hdf5
       

    • 11:20 11:25
      FAIR 5m

      1. Environment

      No changes.

      2. Test run

       * scheduled run time: 96 hours
       * actual run time: about 60 hours - since around 2020-10-05, 04:00 UTC
      rucio operations have been failing with "unable to get authentication
      token" in spite of voms-proxy-info claiming the proxy certificate to be
      valid
       * upload/download procedure as before, with the following changes:
          1. for each DOWNLOAD, 1 in 5 chance of the DID having been
      downloaded will be scheduled for deletion using 'rucio erase')
          2. new set of replication rules - after each upload, an equal chance
      of requesting:
              - 1 replica at DESY-DCACHE, or
              - 1 replica at QOS=SAFE, or
              - 2 replicas at QOS=SAFE, or
              - 1 replica at QOS=CHEAP-ANALYSIS, or
              - 2 replicas at QOS=CHEAP-ANALYSIS, or
              - no further replicas.
             For the record, GSI-ROOT advertises QOS=CHEAP-ANALYSIS so rules
      requesting this QoS class get one replica without any transfers.

      3. Manual tests

      Uploads of full data sets (via appropriate invocation of 'rucio
      upload'), creation of data sets from previously uploaded files ('rucio
      add-dataset' + 'rucio attach'), removal of files from data sets ('rucio
      detach') and deletion of whole data sets ('rucio erase') have been
      tested manually.

      4. Results

      * 100% success rate for uploads (62 files), downloads (13) and deletions (1)
      * 100% success rate for replication to QOS=SAFE (16x one replica, 9x two
      replicas) and, unsurprisingly, "one replica at QOS=CHEAP-ANALYSIS" (9)
      * 4 out of 10 "two replicas at QOS=CHEAP-ANALYSIS" rules ended up stuck:

         - 2 due to an authentication error on our end. Server logs show
      several occurrences of the error " XrootdResponse: sending err 3006:
      Invalid request; user not authenticated" for connections from
      ccdcalitest11.in2p3.fr at the time the replication was to take place,
      and indeed IN2P3-CC-DCACHE is one of the two RSEs advertising
      QOS=CHEAP-ANALYSIS (the other being LAPP-WEBDAV) which do not presently
      hold any replicas of FAIR data;

         - 1 due to the target RSE having run out of storage space. Server
      logs only showed a WebDAV connection originating from
      fts-pilot-07.cern.ch but having extracted the relevant job ID
      (235923f4-05c4-11eb-8d79-02163e018830) from said logs, I was able to
      determine that the actual recipient of the transfer was LAPP-WEBDAV;

         - 1 due to an allegedly unknown TLS certificate presented by the
      destination. No idea what the destination was this time, none of the
      relevant job IDs have corresponding entries show up in the FTS Monitor.

      * no major issues with data set-related operations but the user
      experience pertaining to direct uploads of data sets leaves something to
      be desired of (see below)

      5. Conclusions

      * We need a way of looking further into the past while analysing
      FTS-transfer errors. Unless I have missed something, FTS Monitor does
      not show more than the last 6 hours of activity and while our Grafana
      dashboard does allow selecting longer time frames, it doesn't seem to
      have any effect on the contents of the failure-log panel;

      * I have found both rucio-clients help messages and Rucio RTD
      documentation woefully inadequate as far as directly uploading a data
      set (in contrast to attaching previously uploaded files to a data set)
      woefully inadequate. In the end I found the necessary syntax in some
      ATLAS tutorials on the Web, and even those I had to modify a bit - for
      some reason the only way this works for me is to have the scope declared
      twice, both via --scope and in the data-set name (e.g. 'rucio upload
      --scope FAIR_GSI_SZUBA --rse GSI-ROOT FAIR_GSI_SZUBA:testDS aaa bbb') -
      if I omit the former Oracle complains.

       

      Further information on this, partly from my own experiments and partly from Martin's private response. Was going to add it to today's Indico entry but it turns out I cannot edit the material there.

      1. The '--scope' argument must be used because it specifies the scope for *uploaded files*, same as when you upload them without creating a data set in the process. Without it Rucio attempts to use the user's personal scope for the files even if the data set itself has got a scope in its prefix, which we haven't got on our testbed - hence the error.

      2. The scope prefix for the data set is strictly speaking not necessary (if absent, Rucio will use the value of --scope) - but it is recommended to use it because it explicitly marks the first argument as data set. If you do not specify it, the first argument will only be treated as data-set name UNLESS THERE IS A FILE OR DIRECTORY WITH THAT NAME IN THE CURRENT WORKING DIRECTORY. Consider the command

      rucio upload [...] testDS aaa bbb

      , where aaa and bbb are files, in the three following cases:

       - testDS does not exist - rucio creates a data set called testDS, creates the initial replication rule for it, uploads both files and attaches them to testDS;

       - testDS is a file - rucio uploads all three files, each one with its own initial replication rule - no data set is created;

       - testDS is a directory - similar to the above but rucio recursively uploads the files found inside testDS/.

    • 11:25 11:30
      LOFAR 5m
    • 11:30 11:35
      LSST 5m

      ### 1. Environment setup
      For this exercise, we're using a shared Python virtual environment installed on CC-IN2P3 interactive machines. All ESCAPE members with a CC-IN2P3 account can load this environment by sourcing the `/pbs/throng/escap/rucio/rucio_escape.sh` file.
      By default, authentication is done with x509_proxy, but this can be configured by the user via environment variables if needed, as well as other Rucio configuration details.
      After loading the environment, the user needs to obtain an ESCAPE certificate proxy via `voms-proxy-init -voms escape`.
      Once this is done, everything is setup accordingly and we can start interacting with Rucio.

      ### 2. Running the exercise
      To make testing quicker and make things reproductible, most actions are done within bash scripts.
      The first script:
      - Generates a random file of a random size between 100MB and 1000MB.
      - Uploads that file to IN2P3-CC-DCACHE RSE via davs protocol
      - Attaches that file to the LSST_CCIN2P3_GOUNON:demo_test01 dataset

      Then, we created the rule with ID **bf0ce8d640964c6cba5baed7f03ceaee** with the following command:
      ```
      rucio add-rule LSST_CCIN2P3_GOUNON:demo_test01 2 QOS=CHEAP-ANALYSIS
      ```
      This will ensure 2 copies of the files in LSST_CCIN2P3_GOUNON:demo_test01 are present at any time, on sites where the `QOS=CHEAP-ANALYSIS` flag is present.
      After a little while, all files in the rule went to OK state.

      The second script will:
      - Download a given DID to local storage
      - Get the md5sum of that file from Rucio with the `rucio get-metadata` command
      - Compare the remote md5sum with the local md5sum for that file and output the result

      We were also able to mark files for deletion with `rucio erase`. We have yet to confirm that those files will be effectively deleted after the 24 hour delay.
      It should be noted that we are now running automatic uploads to that same dataset hourly via a crontab task.

      ### 3. Errors:
      The single warning we've encountered during this exercise is the following:

      ```
      2020-10-02 16:30:47,559    INFO    Using main thread to download 1 file(s)
      2020-10-02 16:30:47,559    DEBUG    Start processing queued downloads
      2020-10-02 16:30:47,560    INFO    Preparing download of LSST_CCIN2P3_GOUNON:6r1XIOlGCsPX2UwaVu0D8wWbQ4WYjp4x
      2020-10-02 16:30:47,610    INFO    Trying to download with davs from LAPP-DCACHE: LSST_CCIN2P3_GOUNON:6r1XIOlGCsPX2UwaVu0D8wWbQ4WYjp4x
      2020-10-02 16:31:47,756    DEBUG    The requested service is not available at the moment.
      Details: An unknown exception occurred.
      Details: Could not open source: Connection timed out
      2020-10-02 16:31:47,757    WARNING    Download attempt failed. Try 1/2
      2020-10-02 16:32:47,869    DEBUG    The requested service is not available at the moment.
      Details: An unknown exception occurred.
      Details: Could not open source: Connection timed out
      2020-10-02 16:32:47,869    WARNING    Download attempt failed. Try 2/2
      2020-10-02 16:32:47,954    INFO    Trying to download with root from IN2P3-CC-DCACHE: LSST_CCIN2P3_GOUNON:6r1XIOlGCsPX2UwaVu0D8wWbQ4WYjp4x
      [...]
      2020-10-02 16:32:54,540    INFO    File LSST_CCIN2P3_GOUNON:6r1XIOlGCsPX2UwaVu0D8wWbQ4WYjp4x successfully downloaded. 896.532 MB in 3.2 seconds = 280.17 MBps
      ```

      This means that, to download `LSST_CCIN2P3_GOUNON:6r1XIOlGCsPX2UwaVu0D8wWbQ4WYjp4x`, Rucio first tries to get it via davs protocol on LAPP-DCACHE, fails twice, and then switches to root protocol on IN2P3-CC-DCACHE before succeeding. This probably means there is an issue with the davs protocol on LAPP-DCACHE, which I haven't investigated more into.

      ### 4. Feedback:
      Everything works as expected at this stage and we didn't have any major issue.
      I think it would be interesting to investigate non-deterministic RSE configurations as well, and our ability to feed existing dataset into the Rucio catalog via file registration.
       

    • 11:35 11:40
    • 11:40 11:45
      SKA 5m
    • 11:45 12:00
      Report, Summary, and Next Steps 15m
      Orateur: Riccardo Di Maria (CERN)