user@3ae7a83d158b ~]$ rucio whoami
/usr/lib/python2.7/site-packages/paramiko/transport.py:33: CryptographyDeprecationWarning: Python 2 is no longer supported by the Python core team. Support for it is now deprecated in cryptography, and will be removed in a future release.
from cryptography.hazmat.backends import default_backend
status : ACTIVE
account : pchanial
account_type : SERVICE
created_at : 2020-10-13T09:59:50
updated_at : 2020-10-13T09:59:50
suspended_at : None
deleted_at : None
email : email@example.com
SCOPE CREATION: OK
rucio-admin scope add --account pchanial --scope VIRGO_EGO_CHANIAL
[user@9c903e7070d5 ~]$ rucio upload --rse EULAKE-1 --scope VIRGO_EGO_CHANIAL V-FakeV1_GWOSC_O2_4KHZ_R1-1185615872-4096.hdf5
2020-10-15 09:28:22,987 INFO Preparing upload for file V-FakeV1_GWOSC_O2_4KHZ_R1-100000004-4096.hdf5
2020-10-15 09:28:23,261 INFO Successfully added replica in Rucio catalogue at EULAKE-1
2020-10-15 09:28:23,354 INFO Successfully added replication rule at EULAKE-1
2020-10-15 09:28:24,845 INFO Trying upload with gsiftp to EULAKE-1
2020-10-15 09:28:28,240 INFO Successful upload of temporary file. gsiftp://eulakeftp.cern.ch:2811/eos/eulake/tests/rucio_test/eulake_1/VIRGO_EGO_CHANIAL/c2/f6/V-FakeV1_GWOSC_O2_4KHZ_R1-100000004-4096.hdf5.rucio.upload
2020-10-15 09:28:28,346 INFO Successfully uploaded file V-FakeV1_GWOSC_O2_4KHZ_R1-100000004-4096.hdf5
2. Test run
* scheduled run time: 96 hours
* actual run time: about 60 hours - since around 2020-10-05, 04:00 UTC
rucio operations have been failing with "unable to get authentication
token" in spite of voms-proxy-info claiming the proxy certificate to be
* upload/download procedure as before, with the following changes:
1. for each DOWNLOAD, 1 in 5 chance of the DID having been
downloaded will be scheduled for deletion using 'rucio erase')
2. new set of replication rules - after each upload, an equal chance
- 1 replica at DESY-DCACHE, or
- 1 replica at QOS=SAFE, or
- 2 replicas at QOS=SAFE, or
- 1 replica at QOS=CHEAP-ANALYSIS, or
- 2 replicas at QOS=CHEAP-ANALYSIS, or
- no further replicas.
For the record, GSI-ROOT advertises QOS=CHEAP-ANALYSIS so rules
requesting this QoS class get one replica without any transfers.
3. Manual tests
Uploads of full data sets (via appropriate invocation of 'rucio
upload'), creation of data sets from previously uploaded files ('rucio
add-dataset' + 'rucio attach'), removal of files from data sets ('rucio
detach') and deletion of whole data sets ('rucio erase') have been
* 100% success rate for uploads (62 files), downloads (13) and deletions (1)
* 100% success rate for replication to QOS=SAFE (16x one replica, 9x two
replicas) and, unsurprisingly, "one replica at QOS=CHEAP-ANALYSIS" (9)
* 4 out of 10 "two replicas at QOS=CHEAP-ANALYSIS" rules ended up stuck:
- 2 due to an authentication error on our end. Server logs show
several occurrences of the error " XrootdResponse: sending err 3006:
Invalid request; user not authenticated" for connections from
ccdcalitest11.in2p3.fr at the time the replication was to take place,
and indeed IN2P3-CC-DCACHE is one of the two RSEs advertising
QOS=CHEAP-ANALYSIS (the other being LAPP-WEBDAV) which do not presently
hold any replicas of FAIR data;
- 1 due to the target RSE having run out of storage space. Server
logs only showed a WebDAV connection originating from
fts-pilot-07.cern.ch but having extracted the relevant job ID
(235923f4-05c4-11eb-8d79-02163e018830) from said logs, I was able to
determine that the actual recipient of the transfer was LAPP-WEBDAV;
- 1 due to an allegedly unknown TLS certificate presented by the
destination. No idea what the destination was this time, none of the
relevant job IDs have corresponding entries show up in the FTS Monitor.
* no major issues with data set-related operations but the user
experience pertaining to direct uploads of data sets leaves something to
be desired of (see below)
* We need a way of looking further into the past while analysing
FTS-transfer errors. Unless I have missed something, FTS Monitor does
not show more than the last 6 hours of activity and while our Grafana
dashboard does allow selecting longer time frames, it doesn't seem to
have any effect on the contents of the failure-log panel;
* I have found both rucio-clients help messages and Rucio RTD
documentation woefully inadequate as far as directly uploading a data
set (in contrast to attaching previously uploaded files to a data set)
woefully inadequate. In the end I found the necessary syntax in some
ATLAS tutorials on the Web, and even those I had to modify a bit - for
some reason the only way this works for me is to have the scope declared
twice, both via --scope and in the data-set name (e.g. 'rucio upload
--scope FAIR_GSI_SZUBA --rse GSI-ROOT FAIR_GSI_SZUBA:testDS aaa bbb') -
if I omit the former Oracle complains.
Further information on this, partly from my own experiments and partly from Martin's private response. Was going to add it to today's Indico entry but it turns out I cannot edit the material there.
1. The '--scope' argument must be used because it specifies the scope for *uploaded files*, same as when you upload them without creating a data set in the process. Without it Rucio attempts to use the user's personal scope for the files even if the data set itself has got a scope in its prefix, which we haven't got on our testbed - hence the error.
2. The scope prefix for the data set is strictly speaking not necessary (if absent, Rucio will use the value of --scope) - but it is recommended to use it because it explicitly marks the first argument as data set. If you do not specify it, the first argument will only be treated as data-set name UNLESS THERE IS A FILE OR DIRECTORY WITH THAT NAME IN THE CURRENT WORKING DIRECTORY. Consider the command
rucio upload [...] testDS aaa bbb
, where aaa and bbb are files, in the three following cases:
- testDS does not exist - rucio creates a data set called testDS, creates the initial replication rule for it, uploads both files and attaches them to testDS;
- testDS is a file - rucio uploads all three files, each one with its own initial replication rule - no data set is created;
- testDS is a directory - similar to the above but rucio recursively uploads the files found inside testDS/.
### 1. Environment setup
For this exercise, we're using a shared Python virtual environment installed on CC-IN2P3 interactive machines. All ESCAPE members with a CC-IN2P3 account can load this environment by sourcing the `/pbs/throng/escap/rucio/rucio_escape.sh` file.
By default, authentication is done with x509_proxy, but this can be configured by the user via environment variables if needed, as well as other Rucio configuration details.
After loading the environment, the user needs to obtain an ESCAPE certificate proxy via `voms-proxy-init -voms escape`.
Once this is done, everything is setup accordingly and we can start interacting with Rucio.
### 2. Running the exercise
To make testing quicker and make things reproductible, most actions are done within bash scripts.
The first script:
- Generates a random file of a random size between 100MB and 1000MB.
- Uploads that file to IN2P3-CC-DCACHE RSE via davs protocol
- Attaches that file to the LSST_CCIN2P3_GOUNON:demo_test01 dataset
Then, we created the rule with ID **bf0ce8d640964c6cba5baed7f03ceaee** with the following command:
rucio add-rule LSST_CCIN2P3_GOUNON:demo_test01 2 QOS=CHEAP-ANALYSIS
This will ensure 2 copies of the files in LSST_CCIN2P3_GOUNON:demo_test01 are present at any time, on sites where the `QOS=CHEAP-ANALYSIS` flag is present.
After a little while, all files in the rule went to OK state.
The second script will:
- Download a given DID to local storage
- Get the md5sum of that file from Rucio with the `rucio get-metadata` command
- Compare the remote md5sum with the local md5sum for that file and output the result
We were also able to mark files for deletion with `rucio erase`. We have yet to confirm that those files will be effectively deleted after the 24 hour delay.
It should be noted that we are now running automatic uploads to that same dataset hourly via a crontab task.
### 3. Errors:
The single warning we've encountered during this exercise is the following:
2020-10-02 16:30:47,559 INFO Using main thread to download 1 file(s)
2020-10-02 16:30:47,559 DEBUG Start processing queued downloads
2020-10-02 16:30:47,560 INFO Preparing download of LSST_CCIN2P3_GOUNON:6r1XIOlGCsPX2UwaVu0D8wWbQ4WYjp4x
2020-10-02 16:30:47,610 INFO Trying to download with davs from LAPP-DCACHE: LSST_CCIN2P3_GOUNON:6r1XIOlGCsPX2UwaVu0D8wWbQ4WYjp4x
2020-10-02 16:31:47,756 DEBUG The requested service is not available at the moment.
Details: An unknown exception occurred.
Details: Could not open source: Connection timed out
2020-10-02 16:31:47,757 WARNING Download attempt failed. Try 1/2
2020-10-02 16:32:47,869 DEBUG The requested service is not available at the moment.
Details: An unknown exception occurred.
Details: Could not open source: Connection timed out
2020-10-02 16:32:47,869 WARNING Download attempt failed. Try 2/2
2020-10-02 16:32:47,954 INFO Trying to download with root from IN2P3-CC-DCACHE: LSST_CCIN2P3_GOUNON:6r1XIOlGCsPX2UwaVu0D8wWbQ4WYjp4x
2020-10-02 16:32:54,540 INFO File LSST_CCIN2P3_GOUNON:6r1XIOlGCsPX2UwaVu0D8wWbQ4WYjp4x successfully downloaded. 896.532 MB in 3.2 seconds = 280.17 MBps
This means that, to download `LSST_CCIN2P3_GOUNON:6r1XIOlGCsPX2UwaVu0D8wWbQ4WYjp4x`, Rucio first tries to get it via davs protocol on LAPP-DCACHE, fails twice, and then switches to root protocol on IN2P3-CC-DCACHE before succeeding. This probably means there is an issue with the davs protocol on LAPP-DCACHE, which I haven't investigated more into.
### 4. Feedback:
Everything works as expected at this stage and we didn't have any major issue.
I think it would be interesting to investigate non-deterministic RSE configurations as well, and our ability to feed existing dataset into the Rucio catalog via file registration.