Introduction

Overview

Teaching: 5 min
Exercises: 10 min

Questions

What are the FAIR principles?

Why should I care to be FAIR?

How do I get started?

Objectives

Identify the FAIR principles

Recognize the importance of moving towards FAIR in research

Relate the components of this lesson to the FAIR principles

What is FAIR?

The FAIR principles for research data, originally published in a 2016 Nature paper, are intended as “a guideline for those wishing to enhance the reusability of their data holdings.” This guideline has subsequently been endorsed by working groups, funding bodies and institutions.

FAIR is an acronym for Findable, Accessible, Interoperable, Reusable.

Findable: others (both human and machines) can discover the data
Accessible: others can access the data
Interoperable: the data can easily be used by machines or in data analysis workflows.
Re-usable: the data can easily be used by others for new research

The FAIR principles have a strong focus on “machine-actionability”. This means that the data should be easily readable by computers (and not only by humans). This is particularly relevant for working with and discovering new data.

What the FAIR principles are not

A standard: The FAIR principles need to be adopted and followed as much as possible by considering the research practices in your field.

All or nothing: making a dataset (more) FAIR can be done in small, incremental steps.

Open data: FAIR data does not necessarily mean openly available. For example, some data cannot be shared openly because of privacy considerations. As a rule of thumb, data should be “as open as possible, as closed as necessary.”

Tied to a particular technology or tool. There might be different tools that enable FAIR data within different disciplines or research workflows.

FAIR principles illustration by Scriberia This image was created by Scriberia for The Turing Way community and is used under a CC-BY licence. Source: https://doi.org/10.5281/zenodo.3695300

Discuss the different principles

Read the summary table of the 15 FAIR principles in Wilkinson et al. (See “Box 2: The FAIR Guiding Principles”). After reading this, answer the following questions:

Are there any terms that you are unfamiliar with?

Are there any principles that you could implement right away?

What do you think would be the most challenging principle to implement?

Why FAIR?

The original authors of the FAIR principles had a strong focus on enhancing reusability of data. This ambition is embedded in a broader view on knowledge creation and scientific exchange. If research data are easily discoverable and re-usable, this lowers the barriers to repeat, verify, and build upon previous work. The authors also state that this vision applies not just to data, but to all aspects of the research process. This is visualized in the image below.

Reproducibility illustration by Scriberia This image was created by Scriberia for The Turing Way community and is used under a CC-BY licence. Source: https://doi.org/10.5281/zenodo.3695300

What’s in it for you?

FAIR data sounds like a lot of work. Is it worth it? Here are some of the benefits:

Funder requirements

It makes your work more visible

Increase the reproducibility of your work

If others can use it easily, you will get cited more often

You can create more impact if it’s easier for others to use your data

…

Getting started with FAIR (climate) data

As mentioned above, the FAIR principles are intended as guidelines to increase the reusability of research data. However, how they are applied in practice depends very much on the domain and the specific use case at hand.

For the domain of climate sciences, some standards have already been developed that you can use right away. In fact, you might already be using some of them without realizing it. NetCDF files, for example, already implement some of the FAIR principles around data modeling. But sometimes you need to find your own way.

This tutorial contains six thematic episodes that cover the basic steps to making data FAIR. We will explain how these topics relate to the FAIR principles, what you can already do today, and where appropriate, we will point out existing standards for climate data.

The episodes are:

Documentation: helps to make data reusable, especially for humans
Metadata: similar to documentation, although typically more structured and standardized
Data formats: highly relevant for interoperability and long-term preservation of data
Data access: data repositories as well as software can greatly increase data access and findability
Identifiers: very important for reproducibility, but also for data access and findability
Licences: well chosen licences facilitate re-use and make sure you get the credits you deserve (if you so desire)

Evaluate one of your own datasets

Pick one dataset that you’ve created or worked with recently, and answer the following questions:

If somebody gets this dataset from you, would they be able to understand the structure and content without asking you?

Do you know who has access to this dataset? Could somebody easily have access to this dataset? How?

Does this dataset needs proprietary software to be used?

Does this dataset have a persistent identifier or usage licence?

Documentation

Overview

Teaching: 5 min
Exercises: 15 min

Questions

What is documentation?

Where to document my data?

What is the difference between documentation and metadata?

Objectives

Evaluate existing data documentation with a critical view towards reusability

Ensure that others can re-use your data by writing clear documentation.

What is documentation?

Source: https://xkcd.com/833/

Data documentation is all relevant information needed to properly interpret a dataset. Documentation can be very specific about the data. For example, what does this dataset represent, what units is it in, is it a time average, what is the relation between two data points (subsequent in time, distinct in space, same or different variable), etc. The documentation can also answer some questions about the project: why were the data collected, by whom, and what questions do they address? Was there any kind of quality assurance?

Where to document my data?

Documentation can take many forms: a data management plan, papers published, (jupyter) notebooks, lab journals, provenance information, etc. It is a good idea to bundle these with the data when you publish the data, so that people that find your data will not have to search for it elsewhere, and that access to the documentation is guaranteed. Most data repositories have a ‘description’ field, which is a suitable place to provide any information that is not covered by the other metadata fields.

Grouping related files with Zenodo communities

Data documentation in your project may be scattered over many files of different types. To bundle this information, Zenodo allows you to upload a set of related files (hundreds if needed) as a single dataset.

In addition, Zenodo provides ‘communities’. You can upload all your datasets to Zenodo and add them to a community, thus keeping track of all relevant data for an entire project. You can also track versions. Each dataset gets its own DOI. So you could e.g. cite your data management plan in the description field of your dataset.

What is the difference between metadata and documentation?

Following https://howtofair.dk, we discuss data documentation and metadata separately. However, the distinction between the two is not always clear. In this episode, we will focus on the rich, descriptive kind of documentation that is mostly relevant for human interpretation. In the next episodes, we will focus on more formalized metadata that is also, or even mainly, intended for machine readability.

Exercises

Checking data documentation

Visit a data repository of your choosing (e.g. HydroShare, 4TU.ResearchData, Zenodo, …). Select one or more datasets that you find interesting, and answer the following questions:

Is the title of the dataset entry in the repository clear and informative?

Is there a general description of the data?

Is there a reference to some external documentation of the data (e.g. a journal publication)?

Would you be able to reproduce the research based on the provided information?

Would you be able to use this dataset in your own (hypothetical) research project?

Are contact details provided that you can use in case you have questions?

Is there any description of the context in which this dataset was created?

What could be improved about the documentation of this dataset?

Writing your own data documentation

For the real use case you selected in the previous episode, write a general description of the dataset(s) that will be produced. This description may include references to external documents. Also include the steps taken to obtain the final result (‘data flow’). Make sure it will be adequate for others to understand how the data have been created. Also try to formulate a suitable title for the dataset(s).

You may also write down some more technical aspects of the data, such as variable names, units, grids, creation date, software versions, etc.

Limit your description to a single page (max).

Metadata

Overview

Teaching: 15 min
Exercises: 15 min

Questions

What are metadata?

What is an ontology or controlled vocabulary?

What are metadata standards?

Objectives

Understand metadata and their role in FAIR data

Recognize different types of metadata

Choose an appropriate metadata standard for your data

Metadata

Metadata are data about data. In other words, metadata is the underlying definition or description of data. For example, author name, date created, date modified and file size are examples of very basic metadata for a file.

Metadata make finding and working with the data easier. Therefore, they are essential components in making your data FAIR. One could argue that metadata are more important than your data. Without metadata, the data would be just numbers. But, without the original data, the metadata can still be useful to track people, institutions or publications associated with the original research. From a FAIR perspective, metadata would always be openly available.

Metadata can be created manually, or automatically generated by the software or equipment used and preferably according to a disciplinary standard. While data documentation is meant to be read and understood by humans, metadata are primarily meant to be processed by machines. There is no FAIR data without machine-actionable metadata.

FAIR principles about metadata

Let’s have a look at FAIR principles.

Which principles focus on metadata?

Solution

Because metadata are data about data, all of the principles i.e. Findable, Accessible, Interoperable and Reusable apply to metadata.

Three types of metadata

We focus on three main types of metadata:

Administrative metadata helps manage a resource or a project and indicates when and how the data were created. For example, the project/resource owner, principal investigator, project collaborators, funder, project period, permissions, etc. They are usually assigned to the data before you collect or create them.
Descriptive or citation metadata help to discover and identify data. A very good example of these are keywords, which are often added to data or publications with the only purpose to make them more findable (i.e. with a search engine). Other examples are the authors, title, abstract, keywords, persistent identifier, related publications, etc.
Structural metadata describe how a dataset or resource came about, but also how it is internally structured. They address the ‘I’ and ‘R’ in FAIR. For example, measurement units, data collection method, sampling procedure, sample size, categories, variables, etc. Structural metadata have to be created according to best practices in a research community and will be published together with the data.

Keep your metadata up-to-date!

Descriptive and structural metadata should be added continuously throughout the project.

Different types of metadata apply not only to a database, but also to individual sets of data, e.g. images/plots, tables, files, etc.

Types of metadata for geospatial data in your community/research team

Here are some questions about the use case you chose in the introduction of this tutorial, here.

What is the type of metadata in your use case?

What information are described by that metadata?

Ontology

An ontology (or controlled vocabulary) is a standard definition of key concepts in your community/research team and focuses on how those concepts are related to one another. A controlled vocabulary is a set of terms that you have to pick from. Using an ontology:

helps others to understand the structure and content of your data,
makes your data findable, interoperable and reusable.

A controlled vocabulary for climate and forecast data

In climate-related domains, many variables depend on the type of surface. How can you specify the surface type in the metadata?

Solution

Climate and Forecast metadata (CF conventions) maintains a vocabulary specifically for specifying surface and area types. The vocabulary is available on the CF site as the Area Type Table.

Metadata standards

A metadata standard (or convention) is a subject-specific guide to your data. Rules on what content must be included, which syntax should be used, or a controlled vocabulary are included in a metadata standard. The quality of your metadata has a huge impact on the reusability of your research data. It is best practices to use metadata standard and/or an ontology commonly used in your community/research team.

Some of the recognized metadata standards for climate-related domains are:

Climate and Forecast metadata (CF conventions)
World Meteorological Organization Core Metadata Profile (WMO-CMP)
Generic Earth Observation Metadata Standard (GEOMS)
Cooperative Ocean-Atmosphere Research Data Service Conventions (COARDS)
Water Markup Language (WaterML)
Shoreline Metadata Profile of the Content Standards for Digital Geospatial Metadata (SMP-CSDGM)

A recognized metadata convention for climate data

Let’s do a search for the keyword climate in FAIR standards.

Which metadata convention did you find?

Which file formats allow you to include the metadata?

What are other domains that use this convention?

Solution

Climate and Forecast metadata (CF conventions)

NetCDF

Atmospheric science, earth science, natural science, and oceanography

No/incomplete metadata standards in your research team

Imagine you are working in a lab, and you want to use metadata describing processes that produce data. However, the available standards in your research team are not specifically suited for that purpose.

How can you define a relevant metadata scheme? What would you do if there was no standard in your research team?

Sensitive data

You cannot openly publish sensitive data. However, you can always publish rich metadata about your data. Publishing metadata helps you to make clear under which conditions the data can be accessed and how they may be reused.

Key Points

From a FAIR perspective, metadata are more important than your data.

Metadata are preferably created according to a disciplinary standard.

To be FAIR, metadata must have a findable persistent identifier.

File format

Overview

Teaching: 15 min
Exercises: 20 min

Questions

What is a file format?

What file formats should I use?

Objectives

Name common formats used in climate-related domains.

Understand different types of data.

Choose correct file formats for your work.

File format

A file format is the structure of a file. It determines how the data within the file is organized. Files are usually named as filename.format. For example, climate.jpeg is a file named climate with jpeg format. You might know that jpeg is a commonly used format for digital images. Therefore, the file climate is probably an image.

Each file type like images, video, and text can be stored in several formats. For example, climate.jpeg, climate.png, and climate.gif. All file formats are used to store image data, but they have differences in compression, available colors, resulting filesize, etc (for more information see this page). When deciding which one to use, it’s important to note the advantages and disadvantages of each.

Also, formats may be dependent on particular software. When data is stored from a software program, it is usually saved in that program’s standard file format. One example is creating tabular data using spreadsheet software. Tabular data have specific properties that are better supported by the spreadsheet software than a word processor.

No extension or multiple extensions!

In UNIX-like operating systems, a file can have no extensions, or more than one extension. For example, in filename.tar.gz, the .tar indicates that the file is a tar archive of one or more files, and the .gz indicates that the tar archive file is compressed with gzip. Also, programs reading files usually ignore the format; it is mostly intended for the human user. In Windows, a file should have at least one extension.

File formats for geospatial data in your community/research team

Here are some questions about the use case you chose in the introduction.

What is the format of the geospatial data?

Do you know why the data is in that format?

This is what go-FAIR.org has to say about Format

I1:(meta) data use a formal, accessible, shared, and broadly applicable language for knowledge representation. For example, if data is provided in commonly understood and preferably open formats.

R1.3: (meta)data meet domain-relevant community standards e.g. those for data formats.

Recommended formats for different types

There are several data types like geospatial, tabular, storyline, documentation and paper, textual, video/audio, and image/figure. Some of them are more in line with the FAIR principles than some others. Here are some recommended formats for different types:

Data type	Data format
Geospatial data	NetCDF
Tabular data	SQLite
Textual data	Markdown
Image data	TIFF 6.0 uncompressed
Audio data	Lossless Audio Codec (FLAC)
Video data	MPEG-4
Documentation	Microsoft Word or PDF
Papers & Articles	LaTeX
Storyline Definitions	JSON

Other data types

Have a look at the list of the recommended formats for different types (mentioned above) and pick a data type.

What are the other formats for the type that you selected?

Which of those formats are common in your community?

Discuss some of the advantages and disadvantages of those formats.

Data Management Plan

Data Management Plan (DMP) covers how data can be stored, described and reused. For example, see DMPonline or DSW.

Different formats for different purposes

In the previous sections, we learned that different file formats have different properties. The purpose of a file should help determine which file format to choose. In the example of creating tabular data using spreadsheet software, there is no guarantee that the tabular data can be used or displayed in the future. Because the software can become obsolete or only support a specific version of formats.

It is good to plan at the beginning of your project, what file formats to use for each purpose:

data collection / processing / analysis,
reuse: the longer you want to use the data, the more you have to use open, standard and well-documented file formats to avoid obsolescence.
preservation: many journals, archives and data repositories require that data are uploaded in certain file formats. Therefore, you may have to keep some data files in multiple formats.

format_support_matrix If we store the data in a more open or widely supported format, it will have the greatest re-usability in the future.

Non/proprietary format

File formats may be either proprietary or non-proprietary (open or free):

The proprietary format is owned by a company, organization or individual. Their specifications are usually not publicly available and the risk of obsolescence is high. If you want to store data in a proprietary format for a reasonable time, consider including a readme.txt file that documents the name and version of the software used to generate the file, as well as the information of the company that made the software.

The open format is a file format that is published and free to be used by everybody.

Good practice for format selection

We want to create a checklist for choosing a format that improves the FAIRness of the data. What items should be included?

Solution

non-proprietary

open or widely supported format

standard and well-documented

supports relevant metadata

commonly used by your research community

Common file formats for geospatial data

In climate-related domains e.g. weather and climate science, earth observation science, or hydrology, data can be in many types and for different purposes. In this section, we will introduce some common and acceptable data formats.

NetCDF:

NetCDF

NetCDF was originally developed for the Earth science community, but it can be used for many kinds of data. It views the world of scientific data in the same way that a geo-scientist might: there are various quantities such as temperature or elevation located at points at particular coordinates in space and time. The quantities (here temperature or elevation) are stored as netCDF variables whereas coordinates information is stored as netCDF dimensions. The metadata, such as the units, is stored as netCDF attributes.

GRIB:

GRIB stands for general regularly-distributed information in binary. It is commonly used by the World Meteorological Organization (WMO) for weather model data. It is also used operationally worldwide by most meteorological centers, for Numerical Weather Prediction output. Some of the second-generation GRIB are used in Eumetcast of Meteosat Second Generation. Another example is the North American Mesoscale model.

GeoTIFF:

GeoTIFF is a standard image file format to describe and store raster image data with geographic information. So it can be used by Geographic Information System (GIS) applications. It is suitable for a wide range of applications worldwide. For example, satellite imaging systems, scanned aerial photography, scanned maps, digital elevation models, or as a result of geographic analyses. As an example, GeoTIFF 1.1 is an approved NASA Earth Science Data Systems standard (see NASA Standards and Practices).

HDF5:

The Hierarchical Data Format Version 5 (HDF5) implements a model for managing and storing data, developed by the National Center for Supercomputing Applications (NCSA). HDF5 is a general-purpose, machine-independent standard for storing scientific data in files. An HDF5 structure is self-describing, allowing an application to interpret the structure and contents of a file without any outside information. As an example, NASA’s Earth Observing System, the primary data repository for understanding global climate change uses HDF5, for more information see this page.

Other formats for geospatial data

There are many other formats to store geospatial data like SHP(shapefiles) for vector data, DBF(database file), and NetCDF ZARR Data, etc.

Select a data format

Let’s have a look at case-study that you selected in introduction of this tutorial, here. Assume that the authors want to publish their data. What suggestions would you give the authors for data format?

Key Points

Choose formats that are common to your field/community to ensure the interoperability and reusability of your data.

Make sure that the file formats you choose can hold the necessary data elements and information.

Decide on how long do you intend to preserve your data.

Make sure to check requirements of the repository where data is stored.

Access to the data

Overview

Teaching: 10 min
Exercises: 15 min

Questions

What is meant by ‘accessibility’?

What are relevant data repositories in climate science?

How to choose a data repository?

Objectives

Identify suitable repositories to make your research data acessible.

Plan access to the data in a data management plan.

What is access to data?

Accessibility addresses the A in the FAIR principles. Once the user finds the required data, she/he needs to know how can they be accessed, possibly including authentication and authorisation. Accessible data objects can be obtained by humans and machines upon appropriate authorisation and through a well-defined and universally implementable protocol. In other words, anyone with a computer and an internet connection should be able to access at least the metadata.

This is what go-FAIR.org has to say about Accessible

A1. (Meta)data are retrievable by their identifier using a standardised communications protocol

A1.1 The protocol is open, free, and universally implementable

A1.2 The protocol allows for an authentication and authorisation procedure, where necessary

A2. Metadata are accessible, even when the data are no longer available

Accessible does not mean open without constraint

Accessibility means that the human or machine is provided - through metadata - with the precise conditions by which the data are accessible and that the mechanisms and technical protocols for data access are implemented such that the data and/or metadata can be accessed and used at scale, by machines, across the web.

FAIRness in climate science

Although FAIRness in climate science is advancing and some clearly accessible repositories exist, it is still also common practice to only make a dataset accessible via the project website. Because projects do not last for ever, some of these websites are not maintained, resulting in inaccessible datasets.

Next to large-scale general-use repositories such as Zenodo, there are also several domain-specific repositories. You may be familiar with ESGF, CDS, or Climate4Impact.

Stages of accessibility

You submitted a paper, and the reviewer asks you to make your data accessible. You consider your options,

You make a note in your paper that data can be requested by sending you an email

You provide the data as supplementary information to your paper

You upload the data to some cloud storage and put a shareable link in your paper

You upload the data to a university drive and request a DOI from the library

You upload the data to a generic repository such as zenodo

You upload the data to a domain-specific repository such as Hydroshare

Describe which of these are adhering to the FAIR principles, focusing on the findability and accessibility criteria.

Solution

F1 F2 F3 F4 A1 A1.1 A1.2 A2 Verdict

1. 😢❌ 😢❌ 😢❌ 😢❌ 😢❌ 😢❌ 😢❌ 😢❌ Hard to find and difficult to access.

2. 🤔❓ 🤔❓ 😢❌ 😢❌ 😢❌ 🤔❓ 🤔❓ 😢❌ Data may be found indirectly through the papers DOI, but there is no clear access protocol.

3. 😢❌ 🤔❓ 🤔❓ 🤔❓ 😢❌ 🤔❓ 🤔❓ 😢❌ Both data findability and accessibility will certainly take (human) effort.

4. 🥳✔ 🤔❓ 🤔❓ 🤔❓ 🥳✔ 🥳✔ 🥳✔ 🤔❓ This could probably work if the university offers enough support.

5. 🥳✔ 🥳✔ 🥳✔ 🥳✔ 🥳✔ 🥳✔ 🥳✔ 🥳✔ The best solution if there is no suitable domain-specific repository.

6. 🥳✔ 🥳✔ 🥳✔ 🥳✔ 🥳✔ 🥳✔ 🥳✔ 🥳✔ The additional focus makes it even easier for you to describe your data according to the relevant standards, and for others to find it.

	F1	F2	F3	F4	A1	A1.1	A1.2	A2	Verdict
1.	😢❌	😢❌	😢❌	😢❌	😢❌	😢❌	😢❌	😢❌	Hard to find and difficult to access.
2.	🤔❓	🤔❓	😢❌	😢❌	😢❌	🤔❓	🤔❓	😢❌	Data may be found indirectly through the papers DOI, but there is no clear access protocol.
3.	😢❌	🤔❓	🤔❓	🤔❓	😢❌	🤔❓	🤔❓	😢❌	Both data findability and accessibility will certainly take (human) effort.
4.	🥳✔	🤔❓	🤔❓	🤔❓	🥳✔	🥳✔	🥳✔	🤔❓	This could probably work if the university offers enough support.
5.	🥳✔	🥳✔	🥳✔	🥳✔	🥳✔	🥳✔	🥳✔	🥳✔	The best solution if there is no suitable domain-specific repository.
6.	🥳✔	🥳✔	🥳✔	🥳✔	🥳✔	🥳✔	🥳✔	🥳✔	The additional focus makes it even easier for you to describe your data according to the relevant standards, and for others to find it.

Data repositories

Let’s examine a couple of different data repositories (maybe datasets in different repos?) For each of the following, Zenodo, ESGF, CDS, answer the following questions:

Is this repository publicly accessible?

Is it free?

Is it proprietary?

Does it allow for authentication?

Is there any quality control for this repository?

Are metadata accessible, even if the data is not?

How long will the data be maintained?

Is there any backup system in place?

Is it easy to find data that is stored in this repository?

Is it easy to download data from this repository?

Is it easy to upload data to this repository?

Solution

Zenodo ESGF CDS

1. Is this repository publicly accessible? 🙂Most (meta)data are publicly accessible, but there are also close/restricted datasets. 🥳Yes but account registration is required. 🥳Yes but account registration is required.

2. Is it free? 🥳✔ 🥳✔ 🥳✔

3. Is it proprietary? 🥳No. Non-proprietary format is mandatory on Zenodo. 🥳No. Data are mostly available in NetCDF format. 🥳No. Most datasets are in NetCDF or GRIB format.

4. Does it allow for authentication? 🙂A light authentication mechanism, such as a token (via OAuth 2.0 access token), is acceptable in some certain cases, e.g. high-traffic access. This authentication is acceptable as long as there is a totally open/anonymous route too. 🙂ESGF uses the OpenID and OAuth2 authentication system. Login credentials are required. An open source authentication client is also available. 🙂CDS uses its own API and authentication key. Login credentials are required. The API is open source.

5. Is there any quality control for this repository? 🙂Although not mandatory, it is highly recommended by Zenodo to include qualified references to other (meta)data. This is a quite common practice on Zenodo. 😢Citation links and references are usually missing for many datasets. 🙂Quite some datasets are provided with citation/reference information.

6. Are metadata accessible, even if the data is not? 🙂Yes for some datasets which apply this principle. 🤔Unknown. 🤔Unknown.

7. How long will the data be maintained? 🥳Lifetime of the host laboratory CERN. 🤔Unknown. 🤔Unknown.

8. Is there any backup system in place? 🥳12-hourly backup for Metadata and persistent identifiers. 🥳Yes. Subsets of the data are replicated at Lawrence Livermore National Laboratory (LLNL) for backup. 🥳Yes for at least all ECMWF data.

9. Is it easy to find data that is stored in this repository? 🥳Yes. All data are findable via DOI and well documented. There are also plenty filter available for searching. 🤔PID is assigned per dataset. But the UI for data browsing is not friendly. Although metadata are provided, the datasets lack documentation. 🥳Yes. All data are findable via DOI. The documentations are rich. The searching UI is quite friendly.

10. Is it easy to download data from this repository? 🥳Yes for data with open access. 🥳Yes. 🥳Yes.

11. Is it easy to upload data to this repository? 🥳Yes. Although one should use a Zenodo account, or an GitHub account, or an ORCID account. 🤔In principle yes, but the procedures are complicated. To upload data to a certain project, one need a manual approval on joining that project. 🤔In principle yes, but the procedures are complicated. One needs to contact Copernicus Climate Change Service (C3S) and fill in a confirmation form first.

	Zenodo	ESGF	CDS
1. Is this repository publicly accessible?	🙂Most (meta)data are publicly accessible, but there are also close/restricted datasets.	🥳Yes but account registration is required.	🥳Yes but account registration is required.
2. Is it free?	🥳✔	🥳✔	🥳✔
3. Is it proprietary?	🥳No. Non-proprietary format is mandatory on Zenodo.	🥳No. Data are mostly available in NetCDF format.	🥳No. Most datasets are in NetCDF or GRIB format.
4. Does it allow for authentication?	🙂A light authentication mechanism, such as a token (via OAuth 2.0 access token), is acceptable in some certain cases, e.g. high-traffic access. This authentication is acceptable as long as there is a totally open/anonymous route too.	🙂ESGF uses the OpenID and OAuth2 authentication system. Login credentials are required. An open source authentication client is also available.	🙂CDS uses its own API and authentication key. Login credentials are required. The API is open source.
5. Is there any quality control for this repository?	🙂Although not mandatory, it is highly recommended by Zenodo to include qualified references to other (meta)data. This is a quite common practice on Zenodo.	😢Citation links and references are usually missing for many datasets.	🙂Quite some datasets are provided with citation/reference information.
6. Are metadata accessible, even if the data is not?	🙂Yes for some datasets which apply this principle.	🤔Unknown.	🤔Unknown.
7. How long will the data be maintained?	🥳Lifetime of the host laboratory CERN.	🤔Unknown.	🤔Unknown.
8. Is there any backup system in place?	🥳12-hourly backup for Metadata and persistent identifiers.	🥳Yes. Subsets of the data are replicated at Lawrence Livermore National Laboratory (LLNL) for backup.	🥳Yes for at least all ECMWF data.
9. Is it easy to find data that is stored in this repository?	🥳Yes. All data are findable via DOI and well documented. There are also plenty filter available for searching.	🤔PID is assigned per dataset. But the UI for data browsing is not friendly. Although metadata are provided, the datasets lack documentation.	🥳Yes. All data are findable via DOI. The documentations are rich. The searching UI is quite friendly.
10. Is it easy to download data from this repository?	🥳Yes for data with open access.	🥳Yes.	🥳Yes.
11. Is it easy to upload data to this repository?	🥳Yes. Although one should use a Zenodo account, or an GitHub account, or an ORCID account.	🤔In principle yes, but the procedures are complicated. To upload data to a certain project, one need a manual approval on joining that project.	🤔In principle yes, but the procedures are complicated. One needs to contact Copernicus Climate Change Service (C3S) and fill in a confirmation form first.

What about your data?

When it comes to making data accessible, there are some easy steps that can by applied by any researcher to their own data. Some steps, however, have to be developed accross a scientific domain. Particularly, the establishment of domain-specific data repositories.

What is the status in your specific (sub-)domain?

Are there suitable repositories?

Are they easy to use? What are the advantages/disadvantage?

Key Points

Accessible does not mean open without constraint.

Metadata can still be accessible, even if the data itself is not (anymore).

EU-funded projects are expected to make generated data accessible to the public.

Persistent identifiers

Overview

Teaching: 10 min
Exercises: 5 min

Questions

What is a persistent identifier (PID)?

Which data need a PID?

How do you get a PID?

Objectives

Obtain a PID for your data

What is a persistent identifier?

Persistent identifiers (PIDs) are required to address the F in FAIR. They are used to identify digital resources using a unique and standardized label. PIDs are persistent, meaning that we will be able to find the data long into the future, even if the actual location (e.g. URL) or name of the resource changes. Well known PIDs are the digital object identifiers (DOI) and the Handle System. In fact, a DOI is a type of Handle System handle.

Assigning a PID to your data makes sense. It will make the data easier to find, promotes reproducibility and reuse, makes it possible to track the impact, and enables long-term availability of the data.

Anatomy of a DOI

A DOI takes the form of a string with two parts, a prefix and a suffix, divided by a slash. Usually, a prefix corresponds to the publisher, and the suffix to the resource.

Exercise

Where does the above linked DOI point to?

Do you think this DOI will still be valid in 50 years?

Who is responsible to update the DOI if you get a dead link?

What do you think should happen if the journal ceases to exist?

Solution

The DOI above points to an article: The ERA‐Interim reanalysis: configuration and performance of the data assimilation system, by D. P. Dee et al.

I expect the DOI to be valid, but not the URL it pointing to.

The journal who registered the DOI, in this case Royal Meteorological Society, should update the URL linking to the DOI. DOI itself is persistent.

If the journal ceases to exist, one can still access the metadata assigned to this DOI. With this metadata, e.g. the author list or linked publications, it may still be possible to retrieve this paper from other resources.

Which data need a PID?

Ideally, all research objects (publications, software, observations, model output, figures, etc.) have PIDs, so that the entire workflow can be traced through the links between them. Additionally, it often makes sense to make separate entries for raw and processed data.

Making raw data available enables other researchers to re-use your data, and to develop their own models and ideas. Making the processed data available serves to promote reproducibility, and enables others to build upon your results. In climate science the large intercomparison projects are a prime example of the latter.

One of the challenges is how to deal with ‘dynamic data’, which is also highly relevant for climatological data. How do you make PIDs for data that is continuously evolving? For the CMIP6 project, a pragmatic approach was taken, consisting of a global PID for the entire, evolving data superset, and specific PIDs for citing specific subsets.

Discussion

Consider your own use case (that you selected in episode 1):

Is there any input data? Does it have a DOI?

Is the output data part of a (large) project that has some central policy on DOIs (such as CMIP6?)

Did you (or do you plan to) make a DOI available for the (anticipated) output data?

Why/Why not?

How do you get a PID?

Getting a PID (usually a DOI) for your data is as easy as publishing your data to a data archive. After the upload, a PID is automatically assigned. Note that just getting a PID is not all for making the data FAIR. Without metadata it will still be difficult to find the data. That’s why many data archives make it easy to add metadata, such as a list of authors, related projects, linked publications or other related items.

Below you will find a list of resources to find a trusted or recommended repository.

OpenAIRE: Find the appropriate repository to deposit your research products of any type (publication, data, software, other) or to include in your data management plan. Search and browse for OpenAIRE compliant repositories registered in OpenDOAR and re3data. OpenAIRE is an EC funded network of Open Access repositories, archives and journals that support Open Access policies.
Zenodo: Zenodo is a general-purpose open-access repository developed under the European OpenAIRE program and operated by CERN. It allows researchers to deposit data sets, research software, reports, and any other research related digital artifacts.
FAIRsharing catalogue of databases: FAIRsharing is a web-based, searchable portal of three interlinked registries, containing both in-house and crowdsourced manually curated descriptions of standards, databases and data policies, combined with an integrated view across all three types of resource.
Registry of Research Data Repositories (re3data.org): re3data.org is an Open Science tool that offers researchers, funding organizations, libraries and publishers an overview of existing international repositories for research data.
European Research Council (ERC) Open Access: ERC curated webpage on policies on open access and data storage, including a list of recommended repositories.

Linked data

Remember that we said that data should be fair to computers? By assigning PIDs to our datasets (and publications), large networks of linked data and information can be created automatically.

Data licenses

Overview

Teaching: 20 min
Exercises: 10 min

Questions

What is a data license?

Which data license should I use for climate data?

What are the consequences of (not) using a data license?

Who decides which license I can use?

Objectives

Find existing license policies that may apply to your data

Choose a suitable license given these constraints

Apply a license to your data

General info on licenses

Let’s get started with a small exercise.

Choosing a license

Go to https://chooser-beta.creativecommons.org/ and follow the steps.

What license did you arrive at?

Does the license fit with the FAIR principles?

Discuss.

One of the key parts of FAIR data is that they are reusable not only practically, but also legally. By attaching a license to your work, you specify who can reuse your data, for which purposes, and what should be done with derived work. Creative Commons (CC) are the most used license in research. They are widely recognized, easy to apply, and juridically sound. The CC license is built on four cornerstones:

Attribution (BY): Credit must be given to you, the creator.
Share-Alike (SA): Adaptations must be shared under the same terms.
Non-Commercial (NC): Only noncommercial use of your work is permitted.
Non-Derivative (ND): No derivatives or adaptations of your work are permitted.

The six CC licenses are combinations of these cornerstones.

Note

Without a license you keep the copyright to yourself. This prohibits anyone from using the data for their own work.

These are some points to consider when choosing a license:

A license cannot be revoked once the data has been shared.
As the original author, you can change the license for future publications.
Sensivite data / embargo
The less restrictions you add to your data, the more it can be re-used

licenses in Geoscience

In general, data sources in geoscience such as CMIP5 and CMIP6 are available under FAIR licenses. That is, you can freely use the data, but there are still some requirements. For example, if you base your research on CMIP6 data, you must give proper credit to the project if you deposit your research data somewhere. This is specified in the Terms of Use.

You can find the Terms of Use for some widely used projects here:

Discussion

How do the terms of use for CMIP6 data differ from the Creative Commons licenses?

Can you reshare the data?

Should you attribute the authors of the data? If so, how?

Can derived data be used commercially?

Can you find additional terms of licenses that are applicable to your situation?

FIXME: check what the answer should be

Data ownership

One of the major challenges with data licenses is data ownership. Who actually owns the data? Can other parties use the data and what control do you have over the data as the owner? These points are addressed by the license file.

For any academic institute, there are three possible owners of the data:

You
Institute you work for
Funding agency

Data ownership is defined in the data management plan. You may be restricted by your institutions’ policies on licenses.

Excercise

Take a few minutes to find out whether your institute has a license policy.

What does it say?

Who owns your data?

What do you do if your institution does not have a policy?

Excercise

Imagine, some researchers have SOME CRAZY EXAMPLE

They have deposited the data on their institute’s website under the data under CC-BY 4.0.

Do you see any problems with the license they chose?

Which data license should you put on it when you distribute the data?

Which license can you not put on it when you distribute the data?

What does the EU require?

The European Commission has fully embraced FAIR principles. Horizon2020 already mandates open access to all scientific publications, and is now running a pilot from 2017 for research data to be open by default with possibilities to opt-out.

How the data will be licensed is part of the data management plan. Specifically:

How will the data be licensed to permit the widest re-use possible?
When will the data be made available for re-use (i.e. are the data under embargo)?
Are the data usable by third parties (i.e. are there restrictions)?

Excercise

Which license does the European Commission recommend for data acquired under the Horizon 2020 Programme?

Solution

The European Commission encourages researchers to provide access to their data in the broadest sense. They encourage authors to retain their copyright, but grant adequate licenses to publishers. They recommend the Creative Common licenses as a useful legal tool for providing open access.

How are licenses applied to research data?

In principle, a license is valid when it is mentioned or referred to in the same place where you upload your data. Some repositories like Zenodo allow you to select a license, and they make sure it is clear under which license the data are available.

You can apply a license by one of these ways:

Choosing a license when you upload your data to a repository
Referring to the license on the webpage where the data are hosted
Attaching the license to the metadata
Adding a readme file / license file to your data

The license chooser can help you generate a plain text or html snippet to add to your readme or website.

Note

Specific to climate-related research, CMIP6 explicitly requires you to add the license to the metadata

Key Points

A permissive license ensures the re-usability of your data.

Many big inter-comparison projects already have suitable licenses.

For derived work, existing licenses may restrict your choice of license.

Ownership of data (FIXME)

Conclusion

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How can I assess the FAIRness of my data?

What do I do now to improve the FAIRness of my data?

Objectives

Relate each FAIRification practice with FAIR principles.

Be aware of available tools/resources for assessing FAIRness of your data.

Understand the next steps you can take to improve FAIRness of your data.

FAIRification practices and FAIR principles

We discussed each FAIRification practice, i.e. documentation, metadata, file formats, access to data, persistent identifiers, and data licenses. We also explained what a FAIRification practice means in a FAIR context. In summary:

Make your research findable

Publish your data and/or metadata in a searchable resource such as a repository that assigns a persistent identifier.

Include rich accurate machine-readable descriptive metadata and keywords to your data, preferably according to your community standards.

Make your data acessible

Attach a data license or clear data accessibility statement in your openly available administrative metadata.

Ensure your data are archived in long-term storage and retrievable by their persistent identifier using a standard protocol.

Give access to the metadata, even if the data are closed.

Make your data interoperable

Include sufficient and structural metadata in accordance with your research community’s standards.

Use common standards, terminologies, vocabularies, and ontologies for the data.

Prefer open, long-term viable file formats for your data and metadata.

Make your data reusable

Attach all the relevant contextual information in either the documentation or metadata.

Include sufficient and structural metadata in accordance with your research community’s standards.

Prefer open, long-term viable file formats for your data and metadata.

Apply a machine-readable data license.

FAIR principles illustration by Scriberia This image was created by Scriberia for The Turing Way community and is used under a CC-BY license. Source: https://doi.org/10.5281/zenodo.3695300

Assessing FAIRness

FAIR is a journey. The FAIR principles are guidelines. A FAIR assessment now is a snapshot in time. Nevertheless, individuals and communities look to how FAIR their data are. The reasons are various, including gaining a better understanding and making improvements. Ultimately, an assessment can be a helpful guide on the path to becoming more FAIR.

The following are some resources and tools to help you get started.

FAIR-Aware assessment tool.
Zenodo overview of how the service responds to the FAIR principles.
A list of resources for the assessment of digital objects provided by FAIRassist.
Mirror, mirror on the wall, who is the FAIRest one of all?
The Australian Research Data Commons’ FAIR data self-assessment tool.

Improving FAIRness

We provided recommendations, examples of steps you can take before, during and after your research project to make your research data more FAIR. FAIR sounds like a lot of work. However, here is how you can get started:

Understand how FAIR translates to your field/community.
Use standards that are common to your field/community.
Collaborate with other communities to define a relevant metadata scheme/standards.
Check the requirements of the repository where data is stored.

Make a checklist for your data

Here are some questions about the use case you chose in the introduction of this tutorial, here.

We have explored FAIR principles. Now, try to make a checklist to discuss:

how FAIR your research data are.

what measures could be taken to improve FAIRness of your data.

Finding more guidance to go FAIR

Follow a framework guiding FAIRification e.g. Three-point FAIRification Framework.

Have a look at Research Data Management starter kit.

Ask librarians/expertise in information management for help with the process.

FAIR data for climate sciences

Introduction

Overview

What is FAIR?

What the FAIR principles are not

Discuss the different principles

Why FAIR?

What’s in it for you?

Getting started with FAIR (climate) data

Evaluate one of your own datasets

Further reading

Key Points

Documentation

Overview

What is documentation?

Where to document my data?

Grouping related files with Zenodo communities

What is the difference between metadata and documentation?

Exercises

Checking data documentation

Writing your own data documentation

Further reading

Key Points

Metadata

Overview

Metadata

FAIR principles about metadata

Solution

Three types of metadata

Keep your metadata up-to-date!

Types of metadata for geospatial data in your community/research team

Ontology

A controlled vocabulary for climate and forecast data

Solution

Metadata standards

A recognized metadata convention for climate data

Solution

No/incomplete metadata standards in your research team

Sensitive data

Key Points

File format

Overview

File format

No extension or multiple extensions!

File formats for geospatial data in your community/research team

This is what go-FAIR.org has to say about Format

Recommended formats for different types

Other data types

Data Management Plan

Different formats for different purposes

Non/proprietary format

Good practice for format selection

Solution

Common file formats for geospatial data

NetCDF:

GRIB:

GeoTIFF:

HDF5:

Other formats for geospatial data

Select a data format

Key Points

Access to the data

Overview

What is access to data?

This is what go-FAIR.org has to say about Accessible

Accessible does not mean open without constraint

FAIRness in climate science

Stages of accessibility

Solution

Data repositories

Solution

What about your data?

Key Points

Persistent identifiers

Overview

What is a persistent identifier?

Anatomy of a DOI

Exercise

Solution

Which data need a PID?