This lesson is still being designed and assembled (Pre-Alpha version)

Access to the data

Overview

Teaching: 10 min
Exercises: 15 min
Questions
  • What is meant by ‘accessibility’?

  • What are relevant data repositories in climate science?

  • How to choose a data repository?

Objectives
  • Identify suitable repositories to make your research data acessible.

  • Plan access to the data in a data management plan.

What is access to data?

Accessibility addresses the A in the FAIR principles. Once the user finds the required data, she/he needs to know how can they be accessed, possibly including authentication and authorisation. Accessible data objects can be obtained by humans and machines upon appropriate authorisation and through a well-defined and universally implementable protocol. In other words, anyone with a computer and an internet connection should be able to access at least the metadata.

This is what go-FAIR.org has to say about Accessible

  • A1. (Meta)data are retrievable by their identifier using a standardised communications protocol
  • A1.1 The protocol is open, free, and universally implementable
  • A1.2 The protocol allows for an authentication and authorisation procedure, where necessary
  • A2. Metadata are accessible, even when the data are no longer available

Accessible does not mean open without constraint

Accessibility means that the human or machine is provided - through metadata - with the precise conditions by which the data are accessible and that the mechanisms and technical protocols for data access are implemented such that the data and/or metadata can be accessed and used at scale, by machines, across the web.

FAIRness in climate science

Although FAIRness in climate science is advancing and some clearly accessible repositories exist, it is still also common practice to only make a dataset accessible via the project website. Because projects do not last for ever, some of these websites are not maintained, resulting in inaccessible datasets.

Next to large-scale general-use repositories such as Zenodo, there are also several domain-specific repositories. You may be familiar with ESGF, CDS, or Climate4Impact.

Stages of accessibility

You submitted a paper, and the reviewer asks you to make your data accessible. You consider your options,

  1. You make a note in your paper that data can be requested by sending you an email
  2. You provide the data as supplementary information to your paper
  3. You upload the data to some cloud storage and put a shareable link in your paper
  4. You upload the data to a university drive and request a DOI from the library
  5. You upload the data to a generic repository such as zenodo
  6. You upload the data to a domain-specific repository such as Hydroshare

Describe which of these are adhering to the FAIR principles, focusing on the findability and accessibility criteria.

Solution

F1 F2 F3 F4 A1 A1.1 A1.2 A2 Verdict
1. 😢❌ 😢❌ 😢❌ 😢❌ 😢❌ 😢❌ 😢❌ 😢❌ Hard to find and difficult to access.
2. 🤔❓ 🤔❓ 😢❌ 😢❌ 😢❌ 🤔❓ 🤔❓ 😢❌ Data may be found indirectly through the papers DOI, but there is no clear access protocol.
3. 😢❌ 🤔❓ 🤔❓ 🤔❓ 😢❌ 🤔❓ 🤔❓ 😢❌ Both data findability and accessibility will certainly take (human) effort.
4. 🥳✔ 🤔❓ 🤔❓ 🤔❓ 🥳✔ 🥳✔ 🥳✔ 🤔❓ This could probably work if the university offers enough support.
5. 🥳✔ 🥳✔ 🥳✔ 🥳✔ 🥳✔ 🥳✔ 🥳✔ 🥳✔ The best solution if there is no suitable domain-specific repository.
6. 🥳✔ 🥳✔ 🥳✔ 🥳✔ 🥳✔ 🥳✔ 🥳✔ 🥳✔ The additional focus makes it even easier for you to describe your data according to the relevant standards, and for others to find it.

Data repositories

Let’s examine a couple of different data repositories (maybe datasets in different repos?) For each of the following, Zenodo, ESGF, CDS, answer the following questions:

  • Is this repository publicly accessible?
  • Is it free?
  • Is it proprietary?
  • Does it allow for authentication?
  • Is there any quality control for this repository?
  • Are metadata accessible, even if the data is not?
  • How long will the data be maintained?
  • Is there any backup system in place?
  • Is it easy to find data that is stored in this repository?
  • Is it easy to download data from this repository?
  • Is it easy to upload data to this repository?

Solution

Zenodo ESGF CDS
1. Is this repository publicly accessible? 🙂Most (meta)data are publicly accessible, but there are also close/restricted datasets. 🥳Yes but account registration is required. 🥳Yes but account registration is required.
2. Is it free? 🥳✔ 🥳✔ 🥳✔
3. Is it proprietary? 🥳No. Non-proprietary format is mandatory on Zenodo. 🥳No. Data are mostly available in NetCDF format. 🥳No. Most datasets are in NetCDF or GRIB format.
4. Does it allow for authentication? 🙂A light authentication mechanism, such as a token (via OAuth 2.0 access token), is acceptable in some certain cases, e.g. high-traffic access. This authentication is acceptable as long as there is a totally open/anonymous route too. 🙂ESGF uses the OpenID and OAuth2 authentication system. Login credentials are required. An open source authentication client is also available. 🙂CDS uses its own API and authentication key. Login credentials are required. The API is open source.
5. Is there any quality control for this repository? 🙂Although not mandatory, it is highly recommended by Zenodo to include qualified references to other (meta)data. This is a quite common practice on Zenodo. 😢Citation links and references are usually missing for many datasets. 🙂Quite some datasets are provided with citation/reference information.
6. Are metadata accessible, even if the data is not? 🙂Yes for some datasets which apply this principle. 🤔Unknown. 🤔Unknown.
7. How long will the data be maintained? 🥳Lifetime of the host laboratory CERN. 🤔Unknown. 🤔Unknown.
8. Is there any backup system in place? 🥳12-hourly backup for Metadata and persistent identifiers. 🥳Yes. Subsets of the data are replicated at Lawrence Livermore National Laboratory (LLNL) for backup. 🥳Yes for at least all ECMWF data.
9. Is it easy to find data that is stored in this repository? 🥳Yes. All data are findable via DOI and well documented. There are also plenty filter available for searching. 🤔PID is assigned per dataset. But the UI for data browsing is not friendly. Although metadata are provided, the datasets lack documentation. 🥳Yes. All data are findable via DOI. The documentations are rich. The searching UI is quite friendly.
10. Is it easy to download data from this repository? 🥳Yes for data with open access. 🥳Yes. 🥳Yes.
11. Is it easy to upload data to this repository? 🥳Yes. Although one should use a Zenodo account, or an GitHub account, or an ORCID account. 🤔In principle yes, but the procedures are complicated. To upload data to a certain project, one need a manual approval on joining that project. 🤔In principle yes, but the procedures are complicated. One needs to contact Copernicus Climate Change Service (C3S) and fill in a confirmation form first.

What about your data?

When it comes to making data accessible, there are some easy steps that can by applied by any researcher to their own data. Some steps, however, have to be developed accross a scientific domain. Particularly, the establishment of domain-specific data repositories.

  • What is the status in your specific (sub-)domain?
  • Are there suitable repositories?
  • Are they easy to use? What are the advantages/disadvantage?

Key Points

  • Accessible does not mean open without constraint.

  • Metadata can still be accessible, even if the data itself is not (anymore).

  • EU-funded projects are expected to make generated data accessible to the public.