This lesson is still being designed and assembled (Pre-Alpha version)

File format

Overview

Teaching: 15 min
Exercises: 20 min
Questions
  • What is a file format?

  • What file formats should I use?

Objectives
  • Name common formats used in climate-related domains.

  • Understand different types of data.

  • Choose correct file formats for your work.

File format

A file format is the structure of a file. It determines how the data within the file is organized. Files are usually named as filename.format. For example, climate.jpeg is a file named climate with jpeg format. You might know that jpeg is a commonly used format for digital images. Therefore, the file climate is probably an image.

Each file type like images, video, and text can be stored in several formats. For example, climate.jpeg, climate.png, and climate.gif. All file formats are used to store image data, but they have differences in compression, available colors, resulting filesize, etc (for more information see this page). When deciding which one to use, it’s important to note the advantages and disadvantages of each.

Also, formats may be dependent on particular software. When data is stored from a software program, it is usually saved in that program’s standard file format. One example is creating tabular data using spreadsheet software. Tabular data have specific properties that are better supported by the spreadsheet software than a word processor.

No extension or multiple extensions!

In UNIX-like operating systems, a file can have no extensions, or more than one extension. For example, in filename.tar.gz, the .tar indicates that the file is a tar archive of one or more files, and the .gz indicates that the tar archive file is compressed with gzip. Also, programs reading files usually ignore the format; it is mostly intended for the human user. In Windows, a file should have at least one extension.

File formats for geospatial data in your community/research team

Here are some questions about the use case you chose in the introduction.

  1. What is the format of the geospatial data?
  2. Do you know why the data is in that format?

This is what go-FAIR.org has to say about Format

  • I1:(meta) data use a formal, accessible, shared, and broadly applicable language for knowledge representation. For example, if data is provided in commonly understood and preferably open formats.
  • R1.3: (meta)data meet domain-relevant community standards e.g. those for data formats.

There are several data types like geospatial, tabular, storyline, documentation and paper, textual, video/audio, and image/figure. Some of them are more in line with the FAIR principles than some others. Here are some recommended formats for different types:

Data type Data format
Geospatial data NetCDF
Tabular data SQLite
Textual data Markdown
Image data TIFF 6.0 uncompressed
Audio data Lossless Audio Codec (FLAC)
Video data MPEG-4
Documentation Microsoft Word or PDF
Papers & Articles LaTeX
Storyline Definitions JSON

Other data types

Have a look at the list of the recommended formats for different types (mentioned above) and pick a data type.

  • What are the other formats for the type that you selected?
  • Which of those formats are common in your community?
  • Discuss some of the advantages and disadvantages of those formats.

Data Management Plan

Data Management Plan (DMP) covers how data can be stored, described and reused. For example, see DMPonline or DSW.

Different formats for different purposes

In the previous sections, we learned that different file formats have different properties. The purpose of a file should help determine which file format to choose. In the example of creating tabular data using spreadsheet software, there is no guarantee that the tabular data can be used or displayed in the future. Because the software can become obsolete or only support a specific version of formats.

It is good to plan at the beginning of your project, what file formats to use for each purpose:

format_support_matrix If we store the data in a more open or widely supported format, it will have the greatest re-usability in the future.

Non/proprietary format

File formats may be either proprietary or non-proprietary (open or free):

  • The proprietary format is owned by a company, organization or individual. Their specifications are usually not publicly available and the risk of obsolescence is high. If you want to store data in a proprietary format for a reasonable time, consider including a readme.txt file that documents the name and version of the software used to generate the file, as well as the information of the company that made the software.

  • The open format is a file format that is published and free to be used by everybody.

Good practice for format selection

We want to create a checklist for choosing a format that improves the FAIRness of the data. What items should be included?

Solution

  • non-proprietary
  • open or widely supported format
  • standard and well-documented
  • supports relevant metadata
  • commonly used by your research community

Common file formats for geospatial data

In climate-related domains e.g. weather and climate science, earth observation science, or hydrology, data can be in many types and for different purposes. In this section, we will introduce some common and acceptable data formats.

NetCDF:

NetCDF

NetCDF was originally developed for the Earth science community, but it can be used for many kinds of data. It views the world of scientific data in the same way that a geo-scientist might: there are various quantities such as temperature or elevation located at points at particular coordinates in space and time. The quantities (here temperature or elevation) are stored as netCDF variables whereas coordinates information is stored as netCDF dimensions. The metadata, such as the units, is stored as netCDF attributes.

GRIB:

GRIB stands for general regularly-distributed information in binary. It is commonly used by the World Meteorological Organization (WMO) for weather model data. It is also used operationally worldwide by most meteorological centers, for Numerical Weather Prediction output. Some of the second-generation GRIB are used in Eumetcast of Meteosat Second Generation. Another example is the North American Mesoscale model.

GeoTIFF:

GeoTIFF is a standard image file format to describe and store raster image data with geographic information. So it can be used by Geographic Information System (GIS) applications. It is suitable for a wide range of applications worldwide. For example, satellite imaging systems, scanned aerial photography, scanned maps, digital elevation models, or as a result of geographic analyses. As an example, GeoTIFF 1.1 is an approved NASA Earth Science Data Systems standard (see NASA Standards and Practices).

HDF5:

The Hierarchical Data Format Version 5 (HDF5) implements a model for managing and storing data, developed by the National Center for Supercomputing Applications (NCSA). HDF5 is a general-purpose, machine-independent standard for storing scientific data in files. An HDF5 structure is self-describing, allowing an application to interpret the structure and contents of a file without any outside information. As an example, NASA’s Earth Observing System, the primary data repository for understanding global climate change uses HDF5, for more information see this page.

Other formats for geospatial data

There are many other formats to store geospatial data like SHP(shapefiles) for vector data, DBF(database file), and NetCDF ZARR Data, etc.

Select a data format

Let’s have a look at case-study that you selected in introduction of this tutorial, here. Assume that the authors want to publish their data. What suggestions would you give the authors for data format?

Key Points

  • Choose formats that are common to your field/community to ensure the interoperability and reusability of your data.

  • Make sure that the file formats you choose can hold the necessary data elements and information.

  • Decide on how long do you intend to preserve your data.

  • Make sure to check requirements of the repository where data is stored.