7–8 sept. 2020
Fuseau horaire Europe/Paris

Summary

Workshop objectives

Provenance information helps to explore the traceability of products, find contact information and acknowledge people. Providing provenance information allows a user to assess the quality and reliability of the products. If the stored information is sufficiently fine grained, it is possible to enable the reproducibility of an activity or sequence of activities. 

Recently the IVOA released a standard to structure provenance metadata and several implementations are in development in order to capture, store, access and visualize the provenance of astronomy data products. The main challenge is to organize and structure the provenance information in order to make it machine readable and easily exploitable. 

This meeting contained presentations on technical solutions, demonstrations, hands-on sessions and discussions. The objective is also to collect the requirements of ESFRI projects in order to build the road map of future developments. 

The IVOA Provenance data model can be found here: 
http://www.ivoa.net/documents/ProvenanceDM


Summary of the meeting

The meeting gathered together about 30 participants, including ESCAPE project representatives and IVOA Provenance experts. The first session exposed the IVOA context and the IVOA Provenance data model. The current developments based on this recent standard were then exposed, proposing solutions to capture, store, access and visualize provenance information. The presentations can be found on the workshop webpage: 
https://indico.in2p3.fr/event/21913/sessions/13678/#20200907

The participants were then invited to present their use cases, with a focus on the requirements or the integration of provenance information management within their project. A questionnaire was sent previously to the ESCAPE WP4 members to get their feedback. More than 10 projects answered, with useful details that helped to prepare the workshop. The answers are attached to the meeting page: 
https://indico.in2p3.fr/event/21913/contributions/85151/#preview:81549

It is interesting to note that there is a generally shared requirement to keep track of the provenance (in order to enrich the metadata, ensure traceability and thus quality), then a more specific requirement to enable reproducibility and debugging of complex pipelines. Using the provenance to find contact information and acknowledge people was seen as secondary, maybe because this goal is already fulfilled without structured provenance. 

About 10 use cases were presented during the meeting (plus initial use cases that led to the definition of the IVOA standard), with the objective to understand the common needs and the relevance of currently available tools and standards for the specific issues of each project: 
https://indico.in2p3.fr/event/21913/sessions/13679/#all.detailed

This led to a general discussion where several topics were covered, including: 

  • Provenance database versus infile provenance 
  • Interesting parts of provenance? science, management, minimum? different categories to be defined 
  • Simplified view of provenance with standard columns (ProvCore?) 
  • Serialization, YAML - VOTable - VOEvent 
  • Graph and relational databases 
  • Provenance and workflows, CWL and description of activities 

The meeting ended with the desire to continue the work, possibly within working groups on dedicated topics identified during the discussion (see proposed topics below). We would then plan specific meetings on those topics, and contribute to the next ESCAPE Tech Forum.  


Session at ADASS XXX

There will be a dedicated session at the next ADASS meeting on "Practical Provenance in Astronomy" (BoF session) to enlarge our discussion to a wider audience, the current schedule indicates

    Tuesday 10 November 2020, 19:00–20:30 (Times in UTC)
    https://schedule.adass2020.es/adass2020/talk/P3FK9U/

The ADASS conference will take place online on November 8 - 12 2020:
    https://adass2020.es/


Provenance, why should we care?

Keeping provenance information may be seen as an additional constraint for a project. However, there are clear advantages to retain this information as structured, machine-readable data, in particular in the context of Open Science. 

  • Generalization of the FAIR principles (Findable, Accessible, Interoperable, Reusable) 
  • Quality / Reliability / Trustworthiness of the products 
    • The simple fact of being able to show its provenance is sufficient to give more value to a product 
    • If the provenance information is detailed, the value will be higher 
  • Reproducibility requirement and Debugging aid 
    • Possibility to rerun each activity (maybe testing and improving each step) 
    • Not necessary to keep every intermediate file that is easily reproducible (possible gain on disk space and costs) 
    • Not necessary to restart from scratch: locate in the provenance tree the faulty parts of a process or the products to be discarded 

We often realize too late that there are missing elements or links in the provenance. The capture of the provenance should thus be as detailed as possible and as naive as possible (simply record what happens). 


Terminology

Following the discussion, a clear need for some terminology and definitions appeared. The word "provenance" is used to refer to different aspects depending on the persons or the goals involved. Indeed, provenance information may be used for internal data management, or to improve the scientific exploitation of a data product. It may be stored inside a file or separately (external file or database). We propose here a base for the definition of provenance categories, to be further discussed. 

Provenance is related by definition to the origin of a product (where does it come from?), but also the path followed to generate this product (what has been done?). We discussed 3 main categories of provenance information: 

  • Full provenance: graph/tree/chain of activities and entities up to the raw data (following a standard data model). This information is not hosted by the entities themselves, it should be stored on an external server, or as separate files.
    Note : depending on the requirements of each project the full provenance may follow or not the execution/description distinction allowed by the standard data model.
     
  • Minimum provenance: attached to an entity, list of keywords that gives some context and info on last activity (general process/workflow, software versions, contacts...).
    Note: it would be interesting to include used entities, so that a full provenance may be reconstructed from each minimum provenance. However, such information on what was used may not be kept, or may not be complete.
     
  • End-user/specific “provenance”: attached to an entity, list of keywords or data that provides key information to use/analyse the entity (e.g. for CTA: event class, event type, telescope configuration, sky conditions, reconstruction method... or for observations: seeing, cloud coverage, telescope filter...).
    Note: may be extracted from full provenance (parameters used or entities generated at a given step), but it is considered as data here. Reversely, this specific "provenance" information may be a source of information to be mapped in the standard in order to fill the full provenance graph with more details.


Proposed topics for working groups

  • Defining the content of a minimum provenance (list of keywords related to the last activity and context) 
  • Serializing provenance (both human and machine readable, explore YAML/VOTABLE/VOEvent formats) 
  • Provenance and workflows (links with CWL, mapping, or workflow information simply attached to provenance as entities) 
  • From provenance "on-top" to provenance "inside" (how to introduce provenance capture inside a pipeline) 
  • Provenance storage (DB and ingestion) 
  • Provenance exploration and visualization (access protocols, voprov or other tools)