Content from Introduction to Raster Data


Last updated on 2024-06-25 | Edit this page

Estimated time: 20 minutes

Overview

Questions

  • What format should I use to represent my data?
  • What are the main data types used for representing geospatial data?
  • What are the main attributes of raster data?

Objectives

  • Describe the difference between raster and vector data.
  • Describe the strengths and weaknesses of storing data in raster format.
  • Distinguish between continuous and categorical raster data and identify types of datasets that would be stored in each format.

Introduction


This episode introduces the two primary types of data models that are used to digitally represent the earth’s surface: raster and vector. After briefly introducing these data models, this episode focuses on the raster representation, describing some major features and types of raster data. This workshop will focus on how to work with both raster and vector data sets, therefore it is essential that we understand the basic structures of these types of data and the types of phenomena that they can represent.

Data Structures: Raster and Vector


The two primary data models that are used to represent the earth’s surface digitally are the raster and vector. Raster data is stored as a grid of values which are rendered on a map as pixels—also known as cells—where each pixel—or cell—represents a value of the earth’s surface. Examples of raster data are satellite images or aerial photographs. Data stored according to the vector data model are represented by points, lines, or polygons. Examples of vector representation are points of interest, buildings—often represented as building footprints—or roads.

Representing phenomena as vector data allows you to add attribute information to them. For instance, a polygon of a house can contain multiple attributes containing information about the address like the street name, zip code, city, and number. More explanations about vector data will be discussed in the next episode.

When working with spatial information, you will experience that many phenomena can be represented as vector data and raster data. A house, for instance, can be represented by a set of cells in a raster having all the same value or by a polygon as vector containing attribute information (figure 1). It depends on the purpose for which the data is collected and intended to be used which data model it is stored in. But as a rule of thumb, you can apply that discrete phenomena like buildings, roads, trees, signs are represented as vector data, whereas continuous phenomena like temperature, wind speed, elevation are represented as raster data. Yet, one of the things a spatial data analyst often has to do is to transform data from vector to raster or the other way around. Keep in mind that this can cause problems in the data quality.

Raster Data

Raster data is any pixelated (or gridded) data where each pixel has a value and is associated with a specific geographic location. The value of a pixel can be continuous (e.g., elevation, temperature) or categorical (e.g., land-use type). If this sounds familiar, it is because this data structure is very common: it’s how we represent any digital image. A geospatial raster is only different from a digital photo in that it is accompanied by spatial information that connects the data to a particular location. This includes the raster’s extent and cell size, the number of rows and columns, and its Coordinate Reference System (CRS), which will be explained in episode 3 of this workshop.

raster concept
Raster Concept (Source: National Ecological Observatory Network (NEON))

Some examples of continuous rasters include:

  1. Precipitation maps.
  2. Elevation maps.

A map of elevation for Harvard Forest derived from the NEON AOP LiDAR sensor is below. Elevation is represented as a continuous numeric variable in this map. The legend shows the continuous range of values in the data from around 300 to 420 meters.

elevation Harvard forest
Continuous Elevation Map: HARV Field Site

Some rasters contain categorical data where each pixel represents a discrete class such as a landcover type (e.g., “forest” or “grassland”) rather than a continuous value such as elevation or temperature. Some examples of classified maps include:

  1. Landcover / land-use maps.
  2. Elevation maps classified as low, medium, and high elevation.
USA landcover classification
USA landcover classification

The map above shows the contiguous United States with landcover as categorical data. Each color is a different landcover category. (Source: Homer, C.G., et al., 2015, Completion of the 2011 National Land Cover Database for the conterminous United States-Representing a decade of land cover change information. Photogrammetric Engineering and Remote Sensing, v. 81, no. 5, p. 345-354)

Advantages and Disadvantages

With your neighbor, brainstorm potential advantages and disadvantages of storing data in raster format. Add your ideas to the Etherpad. The Instructor will discuss and add any points that weren’t brought up in the small group discussions.

Raster data has some important advantages:

  • representation of continuous surfaces
  • potentially very high levels of detail
  • data is ‘unweighted’ across its extent - the geometry doesn’t implicitly highlight features
  • cell-by-cell calculations can be very fast and efficient

The downsides of raster data are:

  • very large file sizes as cell size gets smaller
  • currently popular formats don’t embed metadata well (more on this later!)
  • can be difficult to represent complex information

Important Attributes of Raster Data

Extent

The spatial extent is the geographic area that the raster data covers. The spatial extent of an object represents the geographic edge or location that is the furthest north, south, east and west. In other words, extent represents the overall geographic coverage of the spatial object.

spatial extent objects
Spatial extent image (Image Source: National Ecological Observatory Network (NEON))

Extent Challenge

In the image above, the dashed boxes around each set of objects seems to imply that the three objects have the same extent. Is this accurate? If not, which object(s) have a different extent?

The lines and polygon objects have the same extent. The extent for the points object is smaller in the vertical direction than the other two because there are no points on the line at y = 8.

Resolution

A resolution of a raster represents the area on the ground that each pixel of the raster covers. The image below illustrates the effect of changes in resolution.

resolution image
Resolution image (Source: National Ecological Observatory Network (NEON))

Raster Data Format for this Workshop

Raster data can come in many different formats. For this workshop, we will use the GeoTIFF format which has the extension .tif, since this is one of the most common formats to be used. A .tif file stores metadata or attributes about the file as embedded tif tags. For instance, your camera might store a tag that describes the make and model of the camera or the date the photo was taken when it saves a .tif. A GeoTIFF is a standard .tif image format with additional spatial (georeferencing) information embedded in the file as tags. These tags include the following raster metadata:

  1. Extent
  2. Resolution
  3. Coordinate Reference System (CRS) - we will introduce this concept in a later episode
  4. Values that represent missing data (NoDataValue) - we will introduce this concept in a later episode.

We will discuss these attributes in more detail in a later episode. In that episode, we will also learn how to use Python to extract raster attributes from a GeoTIFF file.

More Resources on the .tif format

Multi-band Raster Data

A raster can contain one or more bands. One type of multi-band raster dataset that is familiar to many of us is a color image. A basic color image often consists of three bands: red, green, and blue (RGB). Each band represents light reflected from the red, green or blue portions of the electromagnetic spectrum. The pixel brightness for each band, when composited creates the colors that we see in an image.

multi-band raster
RGB multi-band raster image (Source: National Ecological Observatory Network (NEON).)

We can plot each band of a multi-band image individually.

Or we can composite all three bands together to make a color image.

In a multi-band dataset, the rasters will always have the same extent, resolution, and CRS.

Other Types of Multi-band Raster Data

Multi-band raster data might also contain: 1. Time series: the same variable, over the same area, over time. 2. Multi or hyperspectral imagery: image rasters that have 4 or more (multi-spectral) or more than 10-15 (hyperspectral) bands. We won’t be working with this type of data in this workshop, but you can check out the NEON Data Skills Imaging Spectroscopy HDF5 in R tutorial if you’re interested in working with hyperspectral data cubes.

Key Points

  • Raster data is pixelated data where each pixel is associated with a specific location.
  • Raster data always has an extent and a resolution.
  • The extent is the geographical area covered by a raster.
  • The resolution is the area covered by each pixel of a raster.

Content from Introduction to Vector Data


Last updated on 2024-06-25 | Edit this page

Estimated time: 15 minutes

Overview

Questions

  • What are the main attributes of vector data?

Objectives

  • Describe the strengths and weaknesses of storing data in vector format.
  • Describe the three types of vectors and identify types of data that would be stored in each.

About Vector Data


Vector data structures represent specific features on the Earth’s surface, and assign attributes to those features. Vectors are composed of discrete geometric locations (x, y values) known as vertices that define the shape of the spatial object. The organization of the vertices determines the type of vector that we are working with: point, line or polygon.

vector data types
Types of vector objects (Image Source: National Ecological Observatory Network (NEON))
  • Points: Each point is defined by a single x, y coordinate. There can be many points in a vector point file. Examples of point data include: sampling locations, the location of individual trees, or the location of survey plots.

  • Lines: Lines are composed of many (at least 2) points that are connected. For instance, a road or a stream may be represented by a line. This line is composed of a series of segments, each “bend” in the road or stream represents a vertex that has a defined x, y location.

  • Polygons: A polygon consists of 3 or more vertices that are connected and closed. The outlines of survey plot boundaries, lakes, oceans, and states or countries are often represented by polygons. Note, that polygons can also contain one or multiple holes, for instance a plot boundary with a lake in it. These polygons are considered complex or donut polygons.

Data Tip

Sometimes, boundary layers such as states and countries, are stored as lines rather than polygons. However, these boundaries, when represented as a line, will not create a closed object with a defined area that can be filled.

Identify Vector Types

The plot below includes examples of two of the three types of vector objects. Use the definitions above to identify which features are represented by which vector type.

vector type examples
Vector Type Examples

State boundaries are shown as polygons. The Fisher Tower location is represented by a purple point. There are no line features shown. Note, that at a different scale the Fischer Tower coudl also have been represented as a polygon. Keep in mind that the purpose for which the dataset is created and aimed to be used for determines which vector type it uses.

Vector data has some important advantages:

  • The geometry itself contains information about what the dataset creator thought was important
  • The geometry structures hold information in themselves - why choose point over polygon, for instance?
  • Each geometry feature can carry multiple attributes instead of just one, e.g. a database of cities can have attributes for name, country, population, etc
  • Data storage can, depending on the scale, be very efficient compared to rasters
  • When working with network analysis, for instance to calculate the shortest route between A and B, topologically correct lines are essential. This is not possible through raster data.

The downsides of vector data include:

  • Potential bias in datasets - what didn’t get recorded? Often vector data are interpreted datasets like topographical maps and have been collected by someone else, for another purpose.
  • Calculations involving multiple vector layers need to do math on the geometry as well as the attributes, which potentially can be slow compared to raster calculations.

Vector datasets are in use in many industries besides geospatial fields. For instance, computer graphics are largely vector-based, although the data structures in use tend to join points using arcs and complex curves rather than straight lines. Computer-aided design (CAD) is also vector- based. The difference is that geospatial datasets are accompanied by information tying their features to real-world locations.

Vector Data Format for this Workshop


Like raster data, vector data can also come in many different formats. For this workshop, we will use the GeoPackage format. GeoPackage is developed by the Open Geospatial Consortium and is is an open, standards-based, platform-independent, portable, self-describing, compact format for transferring geospatial information. (source: https://www.geopackage.org/ ) A GeoPackage file, .gpkg, is a single file that contains the geometries of features, their attributes and information about the CRS used.

Another vector format that you will probably come accross quite often is a Shapefile. Although we will not be using that format in this workshop we do believe it is useful to understand how that format works. A Shapefile format consists of multiple files in the same directory, of which .shp, .shx, and .dbf files are mandatory. Other non-mandatory but very important files are .prj and shp.xml files.

  • The .shp file stores the feature geometry itself
  • .shx is a positional index of the feature geometry to allow quickly searching forwards and backwards the geographic coordinates of each vertex in the vector
  • .dbf contains the tabular attributes for each shape.
  • .prj file indicates the Coordinate reference system (CRS)
  • .shp.xml contains the Shapefile metadata.

Together, the Shapefile includes the following information:

  • Extent - the spatial extent of the shapefile (i.e. geographic area that the shapefile covers). The spatial extent for a shapefile represents the combined extent for all spatial objects in the shapefile.
  • Object type - whether the shapefile includes points, lines, or polygons.
  • Coordinate reference system (CRS)
  • Other attributes - for example, a line shapefile that contains the locations of streams, might contain the name of each stream.

Because the structure of points, lines, and polygons are different, each individual shapefile can only contain one vector type (all points, all lines or all polygons). You will not find a mixture of point, line and polygon objects in a single shapefile.

More Resources on Shapefiles

More about shapefiles can be found on Wikipedia. Shapefiles are often publicly available from government services, such as this page containing all administrative boundaries for countries in the world or topographical vector data from Open Street Maps.

Why not both?

Very few formats can contain both raster and vector data - in fact, most are even more restrictive than that. Vector datasets are usually locked to one geometry type, e.g. points only. Raster datasets can usually only encode one data type, for example you can’t have a multiband GeoTIFF where one layer is integer data and another is floating-point. There are sound reasons for this - format standards are easier to define and maintain, and so is metadata. The effects of particular data manipulations are more predictable if you are confident that all of your input data has the same characteristics.

Key Points

  • Vector data structures represent specific features on the Earth’s surface along with attributes of those features.
  • Vector data is often interpreted data and collected for a different purpose than you would want to use it for.
  • Vector objects are either points, lines, or polygons.

Content from Coordinate Reference Systems


Last updated on 2024-06-25 | Edit this page

Estimated time: 25 minutes

Overview

Questions

  • What is a coordinate reference system and how do I interpret one?

Objectives

  • Name some common schemes for describing coordinate reference systems.
  • Interpret a PROJ4 coordinate reference system description.

Coordinate Reference Systems


A data structure cannot be considered geospatial unless it is accompanied by coordinate reference system (CRS) information, in a format that geospatial applications can use to display and manipulate the data correctly. CRS information connects data to the Earth’s surface using a mathematical model.

CRS vs SRS

CRS (coordinate reference system) and SRS (spatial reference system) are synonyms and are commonly interchanged. We will use only CRS throughout this workshop.

The CRS associated with a dataset tells your mapping software where the raster is located in geographic space. It also tells the mapping software what method should be used to flatten or project the raster in geographic space.

The image below (figure 3.1) shows maps of the United States in different projections. Notice the differences in shape associated with each projection. These differences are a direct result of the calculations used to flatten the data onto a 2-dimensional map.

US difference projections
Figure 3.1: Maps of the United States in different projections (Source: opennews.org)

There are lots of great resources that describe coordinate reference systems and projections in greater detail. For the purposes of this workshop, what is important to understand is that data from the same location but saved in different projections will not line up. Thus, it is important when working with spatial data to identify the coordinate reference system applied to the data and retain it throughout data processing and analysis.

Components of a CRS


CRS information has three components:

  • Datum: A model of the shape of the earth. It has angular units (i.e. degrees) and defines the starting point (i.e. where is [0,0]?) so the angles reference a meaningful spot on the earth. Common global datums are WGS84 and NAD83. Datums can also be local - fit to a particular area of the globe, but ill-fitting outside the area of intended use For instance local Cadastre, Land Registry and Mapping Agencies require a high quality of their datasets, which can be obtained using a local system. In this workshop, we will use the WGS84 datum. The datum is often also refered to as the Geographical Coordinate System.

  • Projection: A mathematical transformation of the angular measurements on a round earth to a flat surface (i.e. paper or a computer screen). The units associated with a given projection are usually linear (feet, meters, etc.). In this workshop, we will see data in two different projections. Note that the projection is also often refered to as Projected Coordinate System.

  • Additional Parameters: Additional parameters are often necessary to create the full coordinate reference system. One common additional parameter is a definition of the center of the map. The number of required additional parameters depends on what is needed by each specific projection.

Orange Peel Analogy

A common analogy employed to teach projections is the orange peel analogy. If you imagine that the Earth is an orange, how you peel it and then flatten the peel is similar to how projections get made.

  • A datum is the choice of fruit to use. Is the Earth an orange, a lemon, a lime, a grapefruit?
datum fruit example
Datum Fruit Example (Image source)

A projection is how you peel your orange and then flatten the peel.

projection citrus peel
Projection Citrus Peel Example (Image from Prof Drika Geografia, Projeções Cartográficas)
  • An additional parameter could include a definition of the location of the stem of the fruit. What other parameters could be included in this analogy?

Which projection should I use?


A well know projection is the Mercator projection introduced by the Flamisch Cartographer Gerardus Mercator in the 16th Century. This so-called cilindrical projection, meaning that a virtual cilinder is place on the globe to flatten it, it relatively accurate near to the equator, but towards the poles blows things up see:Cylindrical projection. The main advantage of the Mercator projection is that it is very suitable for navigation purpuses since it always north as up and south and as down, in the 17th century this projection was essential for sailors to navigate the oceans.

To decide if a projection is right for your data, answer these questions:

  • What is the area of minimal distortion?
  • What aspect of the data does it preserve?

Peter Dana from the University of Colorado at Boulder and the Department of Geo-Information Processing have a good discussion of these aspects of projections. Online tools like Projection Wizard can also help you discover projections that might be a good fit for your data.

Data Tip

Take the time to identify a projection that is suited for your project. You don’t have to stick to the ones that are popular.

Describing Coordinate Reference Systems


There are several common systems in use for storing and transmitting CRS information, as well as translating among different CRSs. These systems generally comply with ISO 19111. Common systems for describing CRSs include EPSG, OGC WKT, and PROJ strings.

EPSG

The EPSG system is a database of CRS information maintained by the International Association of Oil and Gas Producers. The dataset contains both CRS definitions and information on how to safely convert data from one CRS to another. Using EPSG is easy as every CRS has an integer identifier, e.g. WGS84 is EPSG:4326. The downside is that you can only use the CRSs defined by EPSG and cannot customise them (some datasets do not have EPSG codes). epsg.io is an excellent website for finding suitable projections by location or for finding information about a particular EPSG code.

Well-Known Text

The Open Geospatial Consortium WKT standard is used by a number of important geospatial apps and software libraries. WKT is a nested list of geodetic parameters. The structure of the information is defined on their website. WKT is valuable in that the CRS information is more transparent than in EPSG, but can be more difficult to read and compare than PROJ since it is meant to necessarily represent more complex CRS information. Additionally, the WKT standard is implemented inconsistently across various software platforms, and the spec itself has some known issues.

PROJ

PROJ is an open-source library for storing, representing and transforming CRS information. PROJ strings continue to be used, but the format is deprecated by the PROJ C maintainers due to inaccuracies when converting to the WKT format. The data and python libraries we will be working with in this workshop use different underlying representations of CRSs under the hood for reprojecting. CRS information can still be represented with EPSG, WKT, or PROJ strings without consequence, but it is best to only use PROJ strings as a format for viewing CRS information, not for reprojecting data.

PROJ represents CRS information as a text string of key-value pairs, which makes it easy to read and interpret.

A PROJ4 string includes the following information:

  • proj: the projection of the data
  • zone: the zone of the data (this is specific to the UTM projection)
  • datum: the datum used
  • units: the units for the coordinates of the data
  • ellps: the ellipsoid (how the earth’s roundness is calculated) for the data

Note that the zone is unique to the UTM projection. Not all CRSs will have a zone.

UTM zones across the CONUS
The UTM zones across the continental United States (Chrismurf at English Wikipedia, via Wikimedia Commons (CC-BY))

Reading a PROJ4 String

Here is a PROJ4 string for one of the datasets we will use in this workshop:

+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0

  • What projection, zone, datum, and ellipsoid are used for this data?
  • What are the units of the data?
  • Using the map above, what part of the United States was this data collected from?
  • Projection is UTM, zone 18, datum is WGS84, ellipsoid is WGS84.
  • The data is in meters.
  • The data comes from the eastern US seaboard.

Format interoperability


Many existing file formats were invented by GIS software developers, often in a closed-source environment. This led to the large number of formats on offer today, and considerable problems transferring data between software environments. The Geospatial Data Abstraction Library (GDAL) is an open-source answer to this issue.

GDAL is a set of software tools that translate between almost any geospatial format in common use today (and some not so common ones). GDAL also contains tools for editing and manipulating both raster and vector files, including reprojecting data to different CRSs. GDAL can be used as a standalone command-line tool, or built in to other GIS software. Several open-source GIS programs use GDAL for all file import/export operations.

Metadata


Spatial data is useless without metadata. Essential metadata includes the CRS information, but proper spatial metadata encompasses more than that. History and provenance of a dataset (how it was made), who is in charge of maintaining it, and appropriate (and inappropriate!) use cases should also be documented in metadata. This information should accompany a spatial dataset wherever it goes. In practice this can be difficult, as many spatial data formats don’t have a built-in place to hold this kind of information. Metadata often has to be stored in a companion file, and generated and maintained manually.

More Resources on CRS

Key Points

  • All geospatial datasets (raster and vector) are associated with a specific coordinate reference system.
  • A coordinate reference system includes datum, projection, and additional parameters specific to the dataset.
  • All maps are distored because of the projection.

Content from The Geospatial Landscape


Last updated on 2024-06-25 | Edit this page

Estimated time: 10 minutes

Overview

Questions

  • What programs and applications are available for working with geospatial data?

Objectives

  • Describe the difference between various approaches to geospatial computing, and their relative strengths and weaknesses.
  • Name some commonly used GIS applications.
  • Name some commonly used Python packages that can access and process spatial data.
  • Describe pros and cons for working with geospatial data using a command-line versus a graphical user interface.

Standalone Software Packages


Most traditional GIS work is carried out in standalone applications that aim to provide end-to-end geospatial solutions. These applications are available under a wide range of licenses and price points. Some of the most common are listed below.

Open-source software

The Open Source Geospatial Foundation (OSGEO) supports several actively managed GIS platforms:

  • QGIS is a professional GIS application that is built on top of and proud to be itself Free and Open Source Software (FOSS). QGIS is written in Python and C++, has a python console interface, allows to develop plugins and has several interfaces written in R including RQGIS.
  • GRASS GIS, commonly referred to as GRASS (Geographic Resources Analysis Support System), is a FOSS-GIS software suite used for geospatial data management and analysis, image processing, graphics and maps production, spatial modeling, and visualization. GRASS GIS is currently used in academic and commercial settings around the world, as well as by many governmental agencies and environmental consulting companies. It is a founding member of the Open Source Geospatial Foundation (OSGeo). GRASS GIS can be installed along with and made accessible within QGIS 3.
  • GDAL is a multiplatform set of tools for translating between geospatial data formats. It can also handle reprojection and a variety of geoprocessing tasks. GDAL is built in to many applications both FOSS and commercial, including GRASS and QGIS.
  • SAGA-GIS, or System for Automated Geoscientific Analyses, is a FOSS-GIS application developed by a small team of researchers from the Dept. of Physical Geography, Göttingen, and the Dept. of Physical Geography, Hamburg. SAGA has been designed for an easy and effective implementation of spatial algorithms, offers a comprehensive, growing set of geoscientific methods, provides an easily approachable user interface with many visualisation options, and runs under Windows and Linux operating systems. Like GRASS GIS, it can also be installed and made accessible in QGIS3.
  • PostGIS is a geospatial extension to the PostGreSQL relational database and is especially suited to work with large vector datasets.
  • GeoDMS is a powerful Open sources GIS which allows for fast calculations and calculations with large datasets. Furthermore it allows for complex scenario analyses.

Commercial software

Online + Cloud computing

  • PANGEO is a community organization dedicated to open and reproducible data science with python. They focus on the Pangeo software ecosystem for working with big data in the geosciences.
  • Google developed Google Earth Engine which combines a multi-petabyte catalog of satellite imagery and geospatial datasets with planetary-scale analysis capabilities and makes it available for scientists, researchers, and developers to detect changes, map trends, and quantify differences on the Earth’s surface. Earth Engine API runs in both Python and JavaScript.
  • ArcGIS Online provides access to thousands of maps and base layers.

Private companies have released SDK platforms for large scale GIS analysis:

  • Kepler.gl is Uber’s toolkit for handling large datasets (i.e. Uber’s data archive).
  • Boundless Geospatial is built upon OSGEO software for enterprise solutions.

Publicly funded open-source platforms for large scale GIS analysis:

  • PANGEO for the Earth Sciences. This community organization also supports python libraries like xarray, iris, dask, jupyter, and many other packages.
  • Sepal.io by FAO Open Foris utilizing EOS satellite imagery and cloud resources for global forest monitoring.

GUI vs CLI


The earliest computer systems operated without a graphical user interface (GUI), relying only on the command-line interface (CLI). Since mapping and spatial analysis are strongly visual tasks, GIS applications benefited greatly from the emergence of GUIs and quickly came to rely heavily on them. Most modern GIS applications have very complex GUIs, with all common tools and procedures accessed via buttons and menus.

Benefits of using a GUI include:

  • Tools are all laid out in front of you
  • Complex commands are easy to build
  • Don’t need to learn a coding language
  • Cartography and visualisation is more intuitive and flexible

Downsides of using a GUI include:

  • Low reproducibility - you can record your actions and replay, but this is limited to the functionalities of the software
  • Batch-processing is possible, but limited to the funstionalities of the software and hard to be integrated with other workflows
  • Limited ability to customise functions or write your own
  • Intimidating interface for new users - so many buttons!

In scientific computing, the lack of reproducibility in point-and-click software has come to be viewed as a critical weakness. As such, scripted CLI-style workflows are becoming popular, which leads us to another approach to doing GIS — via a programming language. Therefore this is the approach we will be using throughout this workshop.

GIS in programming languages


A number of powerful geospatial processing libraries exist for general-purpose programming languages like Java and C++. However, the learning curve for these languages is steep and the effort required is excessive for users who only need a subset of their functionality.

Higher-level scripting languages like Python and R are considered easier to learn and use. Both now have their own packages that wrap up those geospatial processing libraries and make them easy to access and use safely. A key example is the Java Topology Suite (JTS), which is implemented in C++ as GEOS. GEOS is accessible in Python via the shapely package (and geopandas, which makes use of shapely) and in R via sf. R and Python also have interface packages for GDAL, and for specific GIS apps.

This last point is a huge advantage for GIS-by-programming; these interface packages give you the ability to access functions unique to particular programs, but have your entire workflow recorded in a central document - a document that can be re-run at will. Below are lists of some of the key spatial packages for Python, which we will be using in the remainder of this workshop.

  • geopandas and geocube for working with vector data
  • rasterio and rioxarray for working with raster data

These packages along with the matplotlib package are all we need for spatial data visualisation. Python also has many fundamental scientific packages that are relevant in the geospatial domain. Below is a list of particularly fundamental packages. numpy, scipy, and scikit-image are all excellent options for working with rasters, as arrays.

An overview of these and other Python spatial packages can be accessed here.

As a programming language, Python can be a CLI tool. However, using Python together with an Integrated Development Environment (IDE) application allows some GUI features to become part of your workflow. IDEs allow the best of both worlds. They provide a place to visually examine data and other software objects, interact with your file system, and draw plots and maps, but your activities are still command-driven: recordable and reproducible. There are several IDEs available for Python. JupyterLab is well-developed and the most widely used option for data science in Python. VSCode and Spyder are other popular options for data science.

Traditional GIS apps are also moving back towards providing a scripting environment for users, further blurring the CLI/GUI divide. ESRI have adopted Python into their software by introducing arcpy, and QGIS is both Python and R-friendly.

GIS File Types


There are a variety of file types that are used in GIS analysis. Depending on the program you choose to use some file types can be used while others are not readable. Below is a brief table describing some of the most common vector and raster file types.

Vector

File Type Extensions Description
Esri Shapefile .SHP .DBF .SHX The most common geospatial file type. This has become the industry standard. The three required files are: SHP is the feature geometry. SHX is the shape index position. DBF is the attribute data.
GeoPackage .gpkg As an alternative for a Shapfile. This open file format is gaining terrain and exists of one file containing all necessary attribute information.
Geographic JavaScript Object Notation (GeoJSON) .GEOJSON .JSON Used for web-based mapping and uses JavaScript Object Notation to store the coordinates as text.
Google Keyhole Markup Language (KML) .KML .KMZ KML stands for Keyhole Markup Language. This GIS format is XML-based and is primarily used for Google Earth.
GPX or GPS Exchange Format .gpx Is an XML schema designed as a common GPS data format for software applications. This format is often used for tracking activities e.g. hiking, cycling, running etc.
OpenStreetMap .OSM OSM files are the native file for OpenStreetMap which had become the largest crowdsourcing GIS data project in the world. These files are a collection of vector features from crowd-sourced contributions from the open community.

Raster

File Type Extensions Description
ERDAS Imagine .IMG ERDAS Imagine IMG files is a proprietary file format developed by Hexagon Geospatial. IMG files are commonly used for raster data to store single and multiple bands of satellite data.Each raster layer as part of an IMG file contains information about its data values. For example, this includes projection, statistics, attributes, pyramids and whether or not it’s a continuous or discrete type of raster.
GeoTIFF .TIF .TIFF .OVR The GeoTIFF has become an industry image standard file for GIS and satellite remote sensing applications. GeoTIFFs may be accompanied by other files:TFW is the world file that is required to give your raster geolocation.XML optionally accompany GeoTIFFs and are your metadata.AUX auxiliary files store projections and other information.OVR pyramid files improves performance for raster display.
Cloud Optimized GeoTIFF (COG) .TIF .TIFF Based on the GeoTIFF standard, COGs incorporate tiling and overviews to support HTTP range requests where users can query and load subsets of the image without having to transfer the entire file.

Key Points

  • Many software packages exist for working with geospatial data.
  • Command-line programs allow you to automate and reproduce your work.
  • JupyterLab provides a user-friendly interface for working with Python.

Content from Access satellite imagery using Python


Last updated on 2025-01-15 | Edit this page

Estimated time: 45 minutes

Overview

Questions

  • Where can I find open-access satellite data?
  • How do I search for satellite imagery with the STAC API?
  • How do I fetch remote raster datasets using Python?

Objectives

  • Search public STAC repositories of satellite imagery using Python.
  • Inspect search result’s metadata.
  • Download (a subset of) the assets available for a satellite scene.
  • Open satellite imagery as raster data and save it to disk.

Considerations for the position of this episode in the workshop


When this workshop is taught to learners with limited prior knowledge of Python, it might be better to place this episode after episode 11 and before episode 12. This episode contains an introduction to working with APIs and dictionaries, which can be perceived as challenging by some learners. Another consideration for placing this episode later in the workshop is when it is taught to learners with prior GIS knowledge who want to perform GIS-like operations with data they have already collected or for learners interested in working with raster data but less interested in satellite images.

Introduction


A number of satellites take snapshots of the Earth’s surface from space. The images recorded by these remote sensors represent a very precious data source for any activity that involves monitoring changes on Earth. Satellite imagery is typically provided in the form of geospatial raster data, with the measurements in each grid cell (“pixel”) being associated to accurate geographic coordinate information.

In this episode we will explore how to access open satellite data using Python. In particular, we will consider the Sentinel-2 data collection that is hosted on Amazon Web Services (AWS). This dataset consists of multi-band optical images acquired by the constellation of two satellites from the Sentinel-2 mission and it is continuously updated with new images.

Search for satellite imagery


The SpatioTemporal Asset Catalog (STAC) specification

Current sensor resolutions and satellite revisit periods are such that terabytes of data products are added daily to the corresponding collections. Such datasets cannot be made accessible to users via full-catalog download. Therefore, space agencies and other data providers often offer access to their data catalogs through interactive Graphical User Interfaces (GUIs), see for instance the Copernicus Browser for the Sentinel missions. Accessing data via a GUI is a nice way to explore a catalog and get familiar with its content, but it represents a heavy and error-prone task that should be avoided if carried out systematically to retrieve data.

A service that offers programmatic access to the data enables users to reach the desired data in a more reliable, scalable and reproducible manner. An important element in the software interface exposed to the users, which is generally called the Application Programming Interface (API), is the use of standards. Standards, in fact, can significantly facilitate the reusability of tools and scripts across datasets and applications.

The SpatioTemporal Asset Catalog (STAC) specification is an emerging standard for describing geospatial data. By organizing metadata in a form that adheres to the STAC specifications, data providers make it possible for users to access data from different missions, instruments and collections using the same set of tools.

STAC browser screenshots
Views of the STAC browser

Search a STAC catalog


The STAC browser is a good starting point to discover available datasets, as it provides an up-to-date list of existing STAC catalogs. From the list, let’s click on the “Earth Search” catalog, i.e. the access point to search the archive of Sentinel-2 images hosted on AWS.

Exercise: Discover a STAC catalog

Let’s take a moment to explore the Earth Search STAC catalog, which is the catalog indexing the Sentinel-2 collection that is hosted on AWS. We can interactively browse this catalog using the STAC browser at this link.

  1. Open the link in your web browser. Which (sub-)catalogs are available?
  2. Open the Sentinel-2 Level 2A collection, and select one item from the list. Each item corresponds to a satellite “scene”, i.e. a portion of the footage recorded by the satellite at a given time. Have a look at the metadata fields and the list of assets. What kind of data do the assets represent?
earth-search stac catalog views
Views of the Earth Search STAC endpoint
  1. 8 sub-catalogs are available. In the STAC nomenclature, these are actually “collections”, i.e. catalogs with additional information about the elements they list: spatial and temporal extents, license, providers, etc. Among the available collections, we have Landsat Collection 2, Level-2 and Sentinel-2 Level 2A (see left screenshot in the figure above).
  2. When you select the Sentinel-2 Level 2A collection, and randomly choose one of the items from the list, you should find yourself on a page similar to the right screenshot in the figure above. On the left side you will find a list of the available assets: overview images (thumbnail and true color images), metadata files and the “real” satellite images, one for each band captured by the Multispectral Instrument on board Sentinel-2.

When opening a catalog with the STAC browser, you can access the API URL by clicking on the “Source” button on the top right of the page. By using this URL, you have access to the catalog content and, if supported by the catalog, to the functionality of searching its items. For the Earth Search STAC catalog the API URL is:

PYTHON

api_url = "https://earth-search.aws.element84.com/v1"

You can query a STAC API endpoint from Python using the pystac_client library. To do so we will first import Client from pystac_client and use the method open from the Client object:

PYTHON

from pystac_client import Client

client = Client.open(api_url)

For this episode we will focus at scenes belonging to the sentinel-2-l2a collection. This dataset is useful for our case and includes Sentinel-2 data products pre-processed at level 2A (bottom-of-atmosphere reflectance).

In order to see which collections are available in the provided api_url the get_collections method can be used on the Client object.

PYTHON

collections = client.get_collections()

To print the collections we can make a for loop doing:

PYTHON

for collection in collections:
    print(collection)

OUTPUT

<CollectionClient id=cop-dem-glo-30>
<CollectionClient id=naip>
<CollectionClient id=sentinel-2-l2a>
<CollectionClient id=sentinel-2-l1c>
<CollectionClient id=cop-dem-glo-90>
<CollectionClient id=landsat-c2-l2>
<CollectionClient id=sentinel-1-grd>
<CollectionClient id=sentinel-2-c1-l2a>

As said, we want to focus to the sentinel-2-l2a collection. To do so, we set this collection into a variable:

PYTHON

collection_sentinel_2_l2a = "sentinel-2-l2a"

The data in this collection is stored in the Cloud Optimized GeoTIFF (COG) format and as JPEG2000 images. In this episode we will focus at COGs, as these offer useful functionalities for our purpose.

Cloud Optimized GeoTIFFs

Cloud Optimized GeoTIFFs (COGs) are regular GeoTIFF files with some additional features that make them ideal to be employed in the context of cloud computing and other web-based services. This format builds on the widely-employed GeoTIFF format, already introduced in Episode 1: Introduction to Raster Data. In essence, COGs are regular GeoTIFF files with a special internal structure. One of the features of COGs is that data is organized in “blocks” that can be accessed remotely via independent HTTP requests. Data users can thus access the only blocks of a GeoTIFF that are relevant for their analysis, without having to download the full file. In addition, COGs typically include multiple lower-resolution versions of the original image, called “overviews”, which can also be accessed independently. By providing this “pyramidal” structure, users that are not interested in the details provided by a high-resolution raster can directly access the lower-resolution versions of the same image, significantly saving on the downloading time. More information on the COG format can be found here.

In order to get data for a specific location you can add longitude latitude coordinates (World Geodetic System 1984 EPSG:4326) in your request. In order to do so we are using the shapely library to define a geometrical point. Below we have included a point on the island of Rhodes, which is the location of interest for our case study (i.e. Longitude: 27.95 | Latitude 36.20).

PYTHON

from shapely.geometry import Point
point = Point(27.95, 36.20)  # Coordinates of a point on Rhodes

Note: at this stage, we are only dealing with metadata, so no image is going to be downloaded yet. But even metadata can be quite bulky if a large number of scenes match our search! For this reason, we limit the search by the intersection of the point (by setting the parameter intersects) and assign the collection (by setting the parameter collections). More information about the possible parameters to be set can be found in the pystac_client documentation for the Client’s search method.

We now set up our search of satellite images in the following way:

PYTHON

search = client.search(
    collections=[collection_sentinel_2_l2a],
    intersects=point,
)

Now we submit the query in order te find out how many scenes match our search criteria with the parameters assigned above (please note that this output can be different as more data is added to the catalog to when this episode was created):

PYTHON

print(search.matched())

OUTPUT

611

You will notice that more than 500 scenes match our search criteria. We are however interested in the period right before and after the wildfire of Rhodes. In the following exercise you will therefore have to add a time filter to our search criteria to narrow down our search for images of that period.

Exercise: Search satellite scenes with a time filter

Search for all the available Sentinel-2 scenes in the sentinel-2-c1-l2a collection that have been recorded between 1st of July 2023 and 31st of August 2023 (few weeks before and after the time in which the wildfire took place).

Hint: You can find the input argument and the required syntax in the documentation of client.search (which you can access from Python or online)

How many scenes are available?

PYTHON

search = client.search(
    collections=[collection_sentinel_2_l2a],
    intersects=point,
    datetime='2023-07-01/2023-08-31'
)
print(search.matched())

OUTPUT

12

This means that 12 scenes satisfy the search criteria.

Now that we have added a time filter, we retrieve the metadata of the search results by calling the method item_collection:

PYTHON

items = search.item_collection()

The variable items is an ItemCollection object. More information can be found at the pystac documentation

Now let us check the size using len:

PYTHON

print(len(items))

OUTPUT

12

which is consistent with the number of scenes matching our search results as found with search.matched(). We can iterate over the returned items and print these to show their IDs:

PYTHON

for item in items:
    print(item)

OUTPUT

<Item id=S2A_35SNA_20230827_0_L2A>
<Item id=S2B_35SNA_20230822_0_L2A>
<Item id=S2A_35SNA_20230817_0_L2A>
<Item id=S2B_35SNA_20230812_0_L2A>
<Item id=S2A_35SNA_20230807_0_L2A>
<Item id=S2B_35SNA_20230802_0_L2A>
<Item id=S2A_35SNA_20230728_0_L2A>
<Item id=S2B_35SNA_20230723_0_L2A>
<Item id=S2A_35SNA_20230718_0_L2A>
<Item id=S2B_35SNA_20230713_0_L2A>
<Item id=S2A_35SNA_20230708_0_L2A>
<Item id=S2B_35SNA_20230703_0_L2A>

Each of the items contains information about the scene geometry, its acquisition time, and other metadata that can be accessed as a dictionary from the properties attribute. To see which information Item objects can contain you can have a look at the pystac documentation.

Let us inspect the metadata associated with the first item of the search results. Let us first look at the collection date of the first item::

PYTHON

item = items[0]
print(item.datetime)

OUTPUT

2023-08-27 09:00:21.327000+00:00

Let us now look at the geometry and other properties as well.

PYTHON

print(item.geometry)
print(item.properties)

OUTPUT

{'type': 'Polygon', 'coordinates': [[[27.290401625602243, 37.04621863329741], [27.23303872472207, 36.83882218126937], [27.011145718480538, 36.05673246264742], [28.21878905911668, 36.05053734221328], [28.234426643135546, 37.04015200857309], [27.290401625602243, 37.04621863329741]]]}
{'created': '2023-08-27T18:15:43.106Z', 'platform': 'sentinel-2a', 'constellation': 'sentinel-2', 'instruments': ['msi'], 'eo:cloud_cover': 0.955362, 'proj:epsg': 32635, 'mgrs:utm_zone': 35, 'mgrs:latitude_band': 'S', 'mgrs:grid_square': 'NA', 'grid:code': 'MGRS-35SNA', 'view:sun_azimuth': 144.36354987218, 'view:sun_elevation': 59.06665363921, 's2:degraded_msi_data_percentage': 0.0126, 's2:nodata_pixel_percentage': 12.146327, 's2:saturated_defective_pixel_percentage': 0, 's2:dark_features_percentage': 0.249403, 's2:cloud_shadow_percentage': 0.237454, 's2:vegetation_percentage': 6.073786, 's2:not_vegetated_percentage': 18.026696, 's2:water_percentage': 74.259061, 's2:unclassified_percentage': 0.198216, 's2:medium_proba_clouds_percentage': 0.613614, 's2:high_proba_clouds_percentage': 0.341423, 's2:thin_cirrus_percentage': 0.000325, 's2:snow_ice_percentage': 2.3e-05, 's2:product_type': 'S2MSI2A', 's2:processing_baseline': '05.09', 's2:product_uri': 'S2A_MSIL2A_20230827T084601_N0509_R107_T35SNA_20230827T115803.SAFE', 's2:generation_time': '2023-08-27T11:58:03.000000Z', 's2:datatake_id': 'GS2A_20230827T084601_042718_N05.09', 's2:datatake_type': 'INS-NOBS', 's2:datastrip_id': 'S2A_OPER_MSI_L2A_DS_2APS_20230827T115803_S20230827T085947_N05.09', 's2:granule_id': 'S2A_OPER_MSI_L2A_TL_2APS_20230827T115803_A042718_T35SNA_N05.09', 's2:reflectance_conversion_factor': 0.978189079756816, 'datetime': '2023-08-27T09:00:21.327000Z', 's2:sequence': '0', 'earthsearch:s3_path': 's3://sentinel-cogs/sentinel-s2-l2a-cogs/35/S/NA/2023/8/S2A_35SNA_20230827_0_L2A', 'earthsearch:payload_id': 'roda-sentinel2/workflow-sentinel2-to-stac/af0287974aaa3fbb037c6a7632f72742', 'earthsearch:boa_offset_applied': True, 'processing:software': {'sentinel2-to-stac': '0.1.1'}, 'updated': '2023-08-27T18:15:43.106Z'}

You can access items from the properties dictionary as usual in Python. For instance, for the EPSG code of the projected coordinate system:

PYTHON

print(item.properties['proj:epsg'])

Exercise: Search satellite scenes using metadata filters

Let’s add a filter on the cloud cover to select the only scenes with less than 1% cloud coverage. How many scenes do now match our search?

Hint: generic metadata filters can be implemented via the query input argument of client.search, which requires the following syntax (see docs): query=['<property><operator><value>'].

PYTHON

search = client.search(
    collections=[collection_sentinel_2_l2a],
    intersects=point,
    datetime='2023-07-01/2023-08-31',
    query=['eo:cloud_cover<1']
)
print(search.matched())

OUTPUT

11

Once we are happy with our search, we save the search results in a file:

PYTHON

items = search.item_collection()
items.save_object("rhodes_sentinel-2.json")

This creates a file in GeoJSON format, which we can reuse here and in the next episodes. Note that this file contains the metadata of the files that meet out criteria. It does not include the data itself, only their metadata.

To load the saved search results as a ItemCollection we can use pystac.ItemCollection.from_file(). Through this, we are instructing Python to use the from_file method of the ItemCollection class from the pystac library to load data from the specified GeoJSON file:

PYTHON

import pystac
items_loaded = pystac.ItemCollection.from_file("rhodes_sentinel-2.json")

The loaded item collection (items_loaded) is equivalent to the one returned earlier by search.item_collection() (items). You can thus perform the same actions on it: you can check the number of items (len(items_loaded)), you can loop over items (for item in items_loaded: ...), and you can access individual elements using their index (items_loaded[0]).

Access the assets


So far we have only discussed metadata - but how can one get to the actual images of a satellite scene (the “assets” in the STAC nomenclature)? These can be reached via links that are made available through the item’s attribute assets. Let’s focus on the last item in the collection: this is the oldest in time, and it thus corresponds to an image taken before the wildfires.

PYTHON

assets = items[-1].assets  # last item's asset dictionary
print(assets.keys())

OUTPUT

dict_keys(['aot', 'blue', 'coastal', 'granule_metadata', 'green', 'nir', 'nir08', 'nir09', 'red', 'rededge1', 'rededge2', 'rededge3', 'scl', 'swir16', 'swir22', 'thumbnail', 'tileinfo_metadata', 'visual', 'wvp', 'aot-jp2', 'blue-jp2', 'coastal-jp2', 'green-jp2', 'nir-jp2', 'nir08-jp2', 'nir09-jp2', 'red-jp2', 'rededge1-jp2', 'rededge2-jp2', 'rededge3-jp2', 'scl-jp2', 'swir16-jp2', 'swir22-jp2', 'visual-jp2', 'wvp-jp2'])

We can print a minimal description of the available assets:

PYTHON

for key, asset in assets.items():
    print(f"{key}: {asset.title}")

OUTPUT

aot: Aerosol optical thickness (AOT)
blue: Blue (band 2) - 10m
coastal: Coastal aerosol (band 1) - 60m
granule_metadata: None
green: Green (band 3) - 10m
nir: NIR 1 (band 8) - 10m
nir08: NIR 2 (band 8A) - 20m
nir09: NIR 3 (band 9) - 60m
red: Red (band 4) - 10m
rededge1: Red edge 1 (band 5) - 20m
rededge2: Red edge 2 (band 6) - 20m
rededge3: Red edge 3 (band 7) - 20m
scl: Scene classification map (SCL)
swir16: SWIR 1 (band 11) - 20m
swir22: SWIR 2 (band 12) - 20m
thumbnail: Thumbnail image
tileinfo_metadata: None
visual: True color image
wvp: Water vapour (WVP)
aot-jp2: Aerosol optical thickness (AOT)
blue-jp2: Blue (band 2) - 10m
coastal-jp2: Coastal aerosol (band 1) - 60m
green-jp2: Green (band 3) - 10m
nir-jp2: NIR 1 (band 8) - 10m
nir08-jp2: NIR 2 (band 8A) - 20m
nir09-jp2: NIR 3 (band 9) - 60m
red-jp2: Red (band 4) - 10m
rededge1-jp2: Red edge 1 (band 5) - 20m
rededge2-jp2: Red edge 2 (band 6) - 20m
rededge3-jp2: Red edge 3 (band 7) - 20m
scl-jp2: Scene classification map (SCL)
swir16-jp2: SWIR 1 (band 11) - 20m
swir22-jp2: SWIR 2 (band 12) - 20m
visual-jp2: True color image
wvp-jp2: Water vapour (WVP)

Among the other data files, assets include multiple raster data files (one per optical band, as acquired by the multi-spectral instrument), a thumbnail, a true-color image (“visual”), instrument metadata and scene-classification information (“SCL”). Let’s get the URL link to the thumbnail, which gives us a glimpse of the Sentinel-2 scene:

PYTHON

print(assets["thumbnail"].href)

OUTPUT

https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/35/S/NA/2023/7/S2A_35SNA_20230708_0_L2A/thumbnail.jpg

This can be used to download the corresponding file:

thumbnail of the sentinel-2 scene before the wildfires
Overview of the true-color image (“thumbnail”) before the wildfires on Rhodes

For comparison, we can check out the thumbnail of the most recent scene of the sequence considered (i.e. the first item in the item collection), which has been taken after the wildfires:

PYTHON

print(items[0].assets["thumbnail"].href)

OUTPUT

https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/35/S/NA/2023/8/S2A_35SNA_20230827_0_L2A/thumbnail.jpg
thumbnail of the sentinel-2 scene after the wildfires
Overview of the true-color image (“thumbnail”) after the wildfires on Rhodes

From the thumbnails alone we can already observe some dark spots on the island of Rhodes at the bottom right of the image!

In order to open the high-resolution satellite images and investigate the scenes in more detail, we will be using the rioxarray library. Note that this library can both work with local and remote raster data. At this moment, we will only quickly look at the functionality of this library. We will learn more about it in the next episode.

Now let us focus on the red band by accessing the item red from the assets dictionary and get the Hypertext Reference (also known as URL) attribute using .href after the item selection.

For now we are using rioxarray to open the raster file.

PYTHON

import rioxarray
red_href = assets["red"].href
red = rioxarray.open_rasterio(red_href)
print(red)

OUTPUT

<xarray.DataArray (band: 1, y: 10980, x: 10980)> Size: 241MB
[120560400 values with dtype=uint16]
Coordinates:
  * band         (band) int32 4B 1
  * x            (x) float64 88kB 5e+05 5e+05 5e+05 ... 6.098e+05 6.098e+05
  * y            (y) float64 88kB 4.1e+06 4.1e+06 4.1e+06 ... 3.99e+06 3.99e+06
    spatial_ref  int32 4B 0
Attributes:
    AREA_OR_POINT:       Area
    OVR_RESAMPLING_ALG:  AVERAGE
    _FillValue:          0
    scale_factor:        1.0
    add_offset:          0.0

Now we want to save the data to our local machine using the to_raster method:

PYTHON

# save whole image to disk
red.rio.to_raster("red.tif")

That might take a while, given there are over 10000 x 10000 = a hundred million pixels in the 10-meter NIR band. But we can take a smaller subset before downloading it. Because the raster is a COG, we can download just what we need!

In order to do that, we are using rioxarray´s clip_box with which you can set a bounding box defining the area you want.

PYTHON

red_subset = red.rio.clip_box(
    minx=560900,
    miny=3995000,
    maxx=570900,
    maxy=4015000
)

Next, we save the subset using to_raster again.

PYTHON

red_subset.rio.to_raster("red_subset.tif")

The difference is 241 Megabytes for the full image vs less than 10 Megabytes for the subset.

Exercise: Downloading Landsat 8 Assets

In this exercise we put in practice all the skills we have learned in this episode to retrieve images from a different mission: Landsat 8. In particular, we browse images from the Harmonized Landsat Sentinel-2 (HLS) project, which provides images from NASA’s Landsat 8 and ESA’s Sentinel-2 that have been made consistent with each other. The HLS catalog is indexed in the NASA Common Metadata Repository (CMR) and it can be accessed from the STAC API endpoint at the following URL: https://cmr.earthdata.nasa.gov/stac/LPCLOUD.

  • Using pystac_client, search for all assets of the Landsat 8 collection (HLSL30.v2.0) from February to March 2021, intersecting the point with longitude/latitute coordinates (-73.97, 40.78) deg.
  • Visualize an item’s thumbnail (asset key browse).

PYTHON

# connect to the STAC endpoint
cmr_api_url = "https://cmr.earthdata.nasa.gov/stac/LPCLOUD"
client = Client.open(cmr_api_url)

# setup search
search = client.search(
    collections=["HLSL30.v2.0"],
    intersects=Point(-73.97, 40.78),
    datetime="2021-02-01/2021-03-30",
) # nasa cmr cloud cover filtering is currently broken: https://github.com/nasa/cmr-stac/issues/239

# retrieve search results
items = search.item_collection()
print(len(items))

OUTPUT

5

PYTHON

items_sorted = sorted(items, key=lambda x: x.properties["eo:cloud_cover"]) # sorting and then selecting by cloud cover
item = items_sorted[0]
print(item)

OUTPUT

<Item id=HLS.L30.T18TWL.2021039T153324.v2.0>

PYTHON

print(item.assets["browse"].href)

OUTPUT

'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-public/HLSL30.020/HLS.L30.T18TWL.2021039T153324.v2.0/HLS.L30.T18TWL.039T153324.v2.0.jpg'
thumbnail of the landsat-8 scene
Thumbnail of the Landsat-8 scene

Public catalogs, protected data

Publicly accessible catalogs and STAC endpoints do not necessarily imply publicly accessible data. Data providers, in fact, may limit data access to specific infrastructures and/or require authentication. For instance, the NASA CMR STAC endpoint considered in the last exercise offers publicly accessible metadata for the HLS collection, but most of the linked assets are available only for registered users (the thumbnail is publicly accessible).

The authentication procedure for dataset with restricted access might differ depending on the data provider. For the NASA CMR, follow these steps in order to access data using Python:

  • Create a NASA Earthdata login account here;
  • Set up a netrc file with your credentials, e.g. by using this script;
  • Define the following environment variables:

PYTHON

import os
os.environ["GDAL_HTTP_COOKIEFILE"] = "./cookies.txt"
os.environ["GDAL_HTTP_COOKIEJAR"] = "./cookies.txt"

Key Points

  • Accessing satellite images via the providers’ API enables a more reliable and scalable data retrieval.
  • STAC catalogs can be browsed and searched using the same tools and scripts.
  • rioxarray allows you to open and download remote raster files.

Content from Read and visualize raster data


Last updated on 2024-06-30 | Edit this page

Estimated time: 100 minutes

Overview

Questions

  • How is a raster represented by rioxarray?
  • How do I read and plot raster data in Python?
  • How can I handle missing data?

Objectives

  • Describe the fundamental attributes of a raster dataset.
  • Explore raster attributes and metadata using Python.
  • Read rasters into Python using the rioxarray package.
  • Visualize single/multi-band raster data.

Raster Data


In the first episode of this course we provided an introduction on what Raster datasets are and how these divert from vector data. In this episode we will dive more into raster data and focus on how to work with them. We introduce fundamental principles, python packages, metadata and raster attributes for working with this type of data. In addition, we will explore how Python handles missing and bad data values.

The Python package we will use throughout this episode to handle raster data is rioxarray. This package is based on the popular rasterio (which is build upon the GDAL library) for working with raster data and xarray for working with multi-dimensional arrays.

rioxarray extends xarray by providing top-level functions like the open_rasterio function to open raster datasets. Furthermore, it adds a set of methods to the main objects of the xarray package like the Dataset and the DataArray. These methods are made available via the rio accessor and become available from xarray objects after importing rioxarray.

Exploring rioxarray and getting help

Since a lot of the functions, methods and attributes from rioxarray originate from other packages (mostly rasterio), the documentation is in some cases limited and requires a little puzzling. It is therefore recommended to foremost focus at the notebook´s functionality to use tab completion and go through the various functionalities. In addition, adding a question mark ? after every function or method offers the opportunity to see the available options.

For instance if you want to understand the options for rioxarray´s open_rasterio function:

PYTHON

rioxarray.open_rasterio?

Introduce the data

In this episode, we will use satellite images from the search that we have carried out in the episode: “Access satellite imagery using Python”. Briefly, we have searched for Sentinel-2 scenes of Rhodes from July 1st to August 31st 2023 that have less than 1% cloud coverage. The search resulted in 11 scenes. We focus here on the most recent scene (August 27th), since that would show the situation after the wildfire, and use this as an example to demonstrate raster data loading and visualization.

For your convenience, we included the scene of interest among the datasets that you have already downloaded when following the setup instructions (the raster data files should be in the data/sentinel2 directory). You should, however, be able to download the same datasets “on-the-fly” using the JSON metadata file that was created in the previous episode (the file rhodes_sentinel-2.json).

If you choose to work with the provided data (which is advised in case you are working offline or have a slow/unstable network connection) you can skip the remaining part of the block and continue with the following section: Load a Raster and View Attributes.

If you want instead to experiment with downloading the data on-the-fly, you need to load the file rhodes_sentinel-2.json, which contains information on where and how to access the target images from the remote repository:

PYTHON

import pystac
items = pystac.ItemCollection.from_file("rhodes_sentinel-2.json")

The loaded item collection is equivalent to the one returned by pystac_client when querying the API in the the episode: “Access satellite imagery using Python”. You can thus perform the same actions on it, like accessing the individual items using their index. Here we select the first item in the collection, which is the most recent:

PYTHON

item = items[0]
print(item)

OUTPUT

<Item id=S2A_35SNA_20230827_0_L2A>

In this episode we will consider the red band and the true color image associated with this scene. They are labelled with the red and visual keys, respectively, in the asset dictionary. For each asset, we extract the URL / href (Hypertext Reference) that point to the file, and store it in a variable that we can use later on to access the data instead of the raster data paths:

PYTHON

rhodes_red_href = item.assets["red"].href  # red band
rhodes_visual_href = item.assets["visual"].href  # true color image

Load a Raster and View Attributes


To analyse the burned areas, we are interested in the red band of the satellite scene. In episode 9 we will further explain why the characteristics of that band are interesting in relation to wildfires. For now, we can load the red band using the function rioxarray.open_rasterio():

PYTHON

import rioxarray
rhodes_red = rioxarray.open_rasterio("data/sentinel2/red.tif")

The first call to rioxarray.open_rasterio() opens the file and it returns a xarray.DataArray object. The object is stored in a variable, i.e. rhodes_red. Reading in the data with xarray instead of rioxarray also returns a xarray.DataArray, but the output will not contain the geospatial metadata (such as projection information). You can use numpy functions or built-in Python math operators on a xarray.DataArray just like a numpy array. Calling the variable name of the DataArray also prints out all of its metadata information.

By printing the variable we can get a quick look at the shape and attributes of the data.

PYTHON

print(rhodes_red)

OUTPUT

<xarray.DataArray (band: 1, y: 10980, x: 10980)> Size: 241MB
[120560400 values with dtype=uint16]
Coordinates:
  * band         (band) int32 4B 1
  * x            (x) float64 88kB 5e+05 5e+05 5e+05 ... 6.098e+05 6.098e+05
  * y            (y) float64 88kB 4.1e+06 4.1e+06 4.1e+06 ... 3.99e+06 3.99e+06
    spatial_ref  int32 4B 0
Attributes:
    AREA_OR_POINT:       Area
    OVR_RESAMPLING_ALG:  AVERAGE
    _FillValue:          0
    scale_factor:        1.0
    add_offset:          0.0

The output tells us that we are looking at an xarray.DataArray, with 1 band, 10980 rows, and 10980 columns. We can also see the number of pixel values in the DataArray, and the type of those pixel values, which is unsigned integer (or uint16). The DataArray also stores different values for the coordinates of the DataArray. When using rioxarray, the term coordinates refers to spatial coordinates like x and y but also the band coordinate. Each of these sequences of values has its own data type, like float64 for the spatial coordinates and int64 for the band coordinate.

This DataArray object also has a couple of attributes that are accessed like .rio.crs, .rio.nodata, and .rio.bounds() (in jupyter you can browse through these attributes by using tab for auto completion or have a look at the documentation here), which contains the metadata for the file we opened. Note that many of the metadata are accessed as attributes without (), however since bounds() is a method (i.e. a function in an object) it requires these parentheses this is also the case for .rio.resolution().

PYTHON

print(rhodes_red.rio.crs)
print(rhodes_red.rio.nodata)
print(rhodes_red.rio.bounds())
print(rhodes_red.rio.width)
print(rhodes_red.rio.height)
print(rhodes_red.rio.resolution())

OUTPUT

EPSG:32635
0
(499980.0, 3990240.0, 609780.0, 4100040.0)
10980
10980
(10.0, -10.0)

The Coordinate Reference System, or rhodes_red.rio.crs, is reported as the string EPSG:32635. The nodata value is encoded as 0 and the bounding box corners of our raster are represented by the output of .bounds() as a tuple (like a list but you can’t edit it). The height and width match what we saw when we printed the DataArray, but by using .rio.width and .rio.height we can access these values if we need them in calculations.

Visualize a Raster


After viewing the attributes of our raster, we can examine the raw values of the array with .values:

PYTHON

rhodes_red.values

OUTPUT

array([[[    0,     0,     0, ...,  8888,  9075,  8139],
        [    0,     0,     0, ..., 10444, 10358,  8669],
        [    0,     0,     0, ..., 10346, 10659,  9168],
        ...,
        [    0,     0,     0, ...,  4295,  4289,  4320],
        [    0,     0,     0, ...,  4291,  4269,  4179],
        [    0,     0,     0, ...,  3944,  3503,  3862]]], dtype=uint16)

This can give us a quick view of the values of our array, but only at the corners. Since our raster is loaded in python as a DataArray type, we can plot this in one line similar to a pandas DataFrame with DataArray.plot().

PYTHON

rhodes_red.plot()
raster plot with defualt setting
Raster plot with rioxarray

Notice that rioxarray helpfully allows us to plot this raster with spatial coordinates on the x and y axis (this is not the default in many cases with other functions or libraries). Nice plot! However, it probably took a while for it to load therefore it would make sense to resample it.

Resampling the raster image

The red band image is available as a raster file with 10 m resolution, which makes it a relatively large file (few hundreds MBs). In order to keep calculations “manageable” (reasonable execution time and memory usage) we select here a lower resolution version of the image, taking advantage of the so-called “pyramidal” structure of cloud-optimized GeoTIFFs (COGs). COGs, in fact, typically include multiple lower-resolution versions of the original image, called “overviews”, in the same file. This allows us to avoid downloading high-resolution images when only quick previews are required.

Overviews are often computed using powers of 2 as down-sampling (or zoom) factors. So, typically, the first level overview (index 0) corresponds to a zoom factor of 2, the second level overview (index 1) corresponds to a zoom factor of 4, and so on. Here, we open the third level overview (index 2, zoom factor 8) and check that the resolution is about 80 m:

PYTHON

import rioxarray
rhodes_red_80 = rioxarray.open_rasterio("data/sentinel2/red.tif", overview_level=2)
print(rhodes_red_80.rio.resolution())

OUTPUT

(79.97086671522214, -79.97086671522214)

Lets plot this one.

PYTHON

rhodes_red_80.plot()
raster plot with defualt setting
Raster plot 80 x 80 meter resolution with rioxarray

This plot shows the satellite measurement of the band red for Rhodes before the wildfire. According to the Sentinel-2 documentaion, this is a band with the central wavelength of 665nm. It has a spatial resolution of 10m. Note that the band=1 in the image title refers to the ordering of all the bands in the DataArray, not the Sentinel-2 band number 04 that we saw in the pystac search results.

Tool Tip

The option robust=True always forces displaying values between the 2nd and 98th percentile. Of course, this will not work for every case.

PYTHON

rhodes_red_80.plot(robust=True)
raster plot with robust setting
Raster plot using the “robust” setting

Now the color limit is set in a way fitting most of the values in the image. We have a better view of the ground pixels.

For a customized displaying range, you can also manually specifying the keywords vmin and vmax. For example ploting between 100 and 2000:

PYTHON

rhodes_red_80.plot(vmin=100, vmax=2000)
raster plot with robust setting
Raster plot using vmin 100 and vmax 2000

More options can be consulted here. You will notice that these parameters are part of the imshow method from the plot function. Since plot originates from matplotlib and is so widely used, your python environment helps you to interpret the parameters without having to specify the method. It is a service to help you, but can be confusing when teaching it. We will explain more about this below.

View Raster Coordinate Reference System (CRS) in Python

Another information that we’re interested in is the CRS, and it can be accessed with .rio.crs. We introduced the concept of a CRS in an earlier episode. Now we will see how features of the CRS appear in our data file and what meanings they have. We can view the CRS string associated with our DataArray’s rio object using the crs attribute.

PYTHON

print(rhodes_red_80.rio.crs)

OUTPUT

EPSG:32635

To print the EPSG code number as an int, we use the .to_epsg() method (which originally is part of rasterio to_epsg):

PYTHON

rhodes_red_80.rio.crs.to_epsg()

OUTPUT

32635

EPSG codes are great for succinctly representing a particular coordinate reference system. But what if we want to see more details about the CRS, like the units? For that, we can use pyproj , a library for representing and working with coordinate reference systems.

PYTHON

from pyproj import CRS
epsg = rhodes_red_80.rio.crs.to_epsg()
crs = CRS(epsg)
crs

OUTPUT

<Projected CRS: EPSG:32635>
Name: WGS 84 / UTM zone 35N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Between 24°E and 30°E, northern hemisphere between equator and 84°N, onshore and offshore. Belarus. Bulgaria. Central African Republic. Democratic Republic of the Congo (Zaire). Egypt. Estonia. Finland. Greece. Latvia. Lesotho. Libya. Lithuania. Moldova. Norway. Poland. Romania. Russian Federation. Sudan. Svalbard. Türkiye (Turkey). Uganda. Ukraine.
- bounds: (24.0, 0.0, 30.0, 84.0)
Coordinate Operation:
- name: UTM zone 35N
- method: Transverse Mercator
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

The CRS class from the pyproj library allows us to create a CRS object with methods and attributes for accessing specific information about a CRS, or the detailed summary shown above.

A particularly useful attribute is area_of_use, which shows the geographic bounds that the CRS is intended to be used.

PYTHON

crs.area_of_use

OUTPUT

AreaOfUse(west=24.0, south=0.0, east=30.0, north=84.0, name='Between 24°E and 30°E, northern hemisphere between equator and 84°N, onshore and offshore. Belarus. Bulgaria. Central African Republic. Democratic Republic of the Congo (Zaire). Egypt. Estonia. Finland. Greece. Latvia. Lesotho. Libya. Lithuania. Moldova. Norway. Poland. Romania. Russian Federation. Sudan. Svalbard. Türkiye (Turkey). Uganda. Ukraine.')

Exercise: find the axes units of the CRS

What units are our data in? See if you can find a method to examine this information using help(crs) or dir(crs)

crs.axis_info tells us that the CRS for our raster has two axis and both are in meters. We could also get this information from the attribute rhodes_red_80.rio.crs.linear_units.

Understanding pyproj CRS Summary

Let’s break down the pieces of the pyproj CRS summary. The string contains all of the individual CRS elements that Python or another GIS might need, separated into distinct sections, and datum.

OUTPUT

<Projected CRS: EPSG:32635>
Name: WGS 84 / UTM zone 35N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Between 24°E and 30°E, northern hemisphere between equator and 84°N, onshore and offshore. Belarus. Bulgaria. Central African Republic. Democratic Republic of the Congo (Zaire). Egypt. Estonia. Finland. Greece. Latvia. Lesotho. Libya. Lithuania. Moldova. Norway. Poland. Romania. Russian Federation. Sudan. Svalbard. Türkiye (Turkey). Uganda. Ukraine.
- bounds: (24.0, 0.0, 30.0, 84.0)
Coordinate Operation:
- name: UTM zone 35N
- method: Transverse Mercator
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich
  • Name of the projection is UTM zone 35N (UTM has 60 zones, each 6-degrees of longitude in width). The underlying datum is WGS84.
  • Axis Info: the CRS shows a Cartesian system with two axes, easting and northing, in meter units.
  • Area of Use: the projection is used for a particular range of longitudes 24°E to 30°E in the northern hemisphere (0.0°N to 84.0°N)
  • Coordinate Operation: the operation to project the coordinates (if it is projected) onto a cartesian (x, y) plane. Transverse Mercator is accurate for areas with longitudinal widths of a few degrees, hence the distinct UTM zones.
  • Datum: Details about the datum, or the reference point for coordinates. WGS 84 and NAD 1983 are common datums. NAD 1983 is set to be replaced in 2022.

Note that the zone is unique to the UTM projection. Not all CRSs will have a zone. Below is a simplified view of US UTM zones.

UTM zones across the CONUS
The UTM zones across the continental United States (Chrismurf at English Wikipedia, via Wikimedia Commons (CC-BY))

Calculate Raster Statistics

It is useful to know the minimum or maximum values of a raster dataset. We can compute these and other descriptive statistics with min, max, mean, and std.

PYTHON

print(rhodes_red_80.min())
print(rhodes_red_80.max())
print(rhodes_red_80.mean())
print(rhodes_red_80.std())

OUTPUT

<xarray.DataArray ()> Size: 2B
array(0, dtype=uint16)
Coordinates:
    spatial_ref  int32 4B 0
<xarray.DataArray ()> Size: 2B
array(7277, dtype=uint16)
Coordinates:
    spatial_ref  int32 4B 0
<xarray.DataArray ()> Size: 8B
array(404.07532588)
Coordinates:
    spatial_ref  int32 4B 0
<xarray.DataArray ()> Size: 8B
array(527.5557502)
Coordinates:
    spatial_ref  int32 4B 0

The information above includes a report of the min, max, mean, and standard deviation values, along with the data type. If we want to see specific quantiles, we can use xarray’s .quantile() method. For example for the 25% and 75% quantiles:

PYTHON

print(rhodes_red_80.quantile([0.25, 0.75]))

OUTPUT

<xarray.DataArray (quantile: 2)> Size: 16B
array([165., 315.])
Coordinates:
  * quantile  (quantile) float64 16B 0.25 0.75

Data Tip - NumPy methods

You could also get each of these values one by one using numpy.

PYTHON

import numpy
print(numpy.percentile(rhodes_red_80, 25))
print(numpy.percentile(rhodes_red_80, 75))

OUTPUT

165.0
315.0

You may notice that rhodes_red_80.quantile and numpy.percentile didn’t require an argument specifying the axis or dimension along which to compute the quantile. This is because axis=None is the default for most numpy functions, and therefore dim=None is the default for most xarray methods. It’s always good to check out the docs on a function to see what the default arguments are, particularly when working with multi-dimensional image data. To do so, we can usehelp(rhodes_red_80.quantile) or ?rhodes_red_80.percentile if you are using jupyter notebook or jupyter lab.

Dealing with Missing Data

So far, we have visualized a band of a Sentinel-2 scene and calculated its statistics. However, as you can see on the image it also contains an artificial band to the top left where data is missing. In order to calculate meaningfull statistics, we need to take missing data into account. Raster data often has a “no data value” associated with it and for raster datasets read in by rioxarray. This value is referred to as nodata. This is a value assigned to pixels where data is missing or no data were collected. There can be different cases that cause missing data, and it’s common for other values in a raster to represent different cases. The most common example is missing data at the edges of rasters.

By default the shape of a raster is always rectangular. So if we have a dataset that has a shape that isn’t rectangular, like most satellite images, some pixels at the edge of the raster will have no data values. This often happens when the data were collected by a sensor which only flew over some part of a defined region and is also almost by default because of the fact that the earth is not flat and that we work with geographic and projected coordinate system.

To check the value of nodata of this dataset you can use:

PYTHON

rhodes_red_80.rio.nodata

OUTPUT

0

You will find out that this is 0. When we have plotted the band data, or calculated statistics, the missing value was not distinguished from other values. Missing data may cause some unexpected results.

To distinguish missing data from real data, one possible way is to use nan(which stands for Not a Number) to represent them. This can be done by specifying masked=True when loading the raster. Let us reload our data and put it into a different variable with the mask:

PYTHON

rhodes_red_mask_80 = rioxarray.open_rasterio("data/sentinel2/red.tif", masked=True, overview_level=2)

Let us have a look at the data.

PYTHON

print(rhodes_red_mask_80)

One can also use the where function, which is standard python functionality, to select all the pixels which are different from the nodata value of the raster:

PYTHON

rhodes_red_altmask_80 = rhodes_red_80.where(rhodes_red_80!=rhodes_red_80.rio.nodata)

Either way will change the nodata value from 0 to nan. Now if we compute the statistics again, the missing data will not be considered. Let´s compare them:

PYTHON

print(rhodes_red_80.min())
print(rhodes_red_mask_80.min())
print(rhodes_red_80.max())
print(rhodes_red_mask_80.max())
print(rhodes_red_80.mean())
print(rhodes_red_mask_80.mean())
print(rhodes_red_80.std())
print(rhodes_red_mask_80.std())

OUTPUT

<xarray.DataArray ()> Size: 2B
array(0, dtype=uint16)
Coordinates:
    spatial_ref  int32 4B 0
<xarray.DataArray ()> Size: 4B
array(1., dtype=float32)
Coordinates:
    spatial_ref  int32 4B 0
<xarray.DataArray ()> Size: 2B
array(7277, dtype=uint16)
Coordinates:
    spatial_ref  int32 4B 0
<xarray.DataArray ()> Size: 4B
array(7277., dtype=float32)
Coordinates:
    spatial_ref  int32 4B 0
<xarray.DataArray ()> Size: 8B
array(404.07532588)
Coordinates:
    spatial_ref  int32 4B 0
<xarray.DataArray ()> Size: 4B
array(461.78833, dtype=float32)
Coordinates:
    spatial_ref  int32 4B 0
<xarray.DataArray ()> Size: 8B
array(527.5557502)
Coordinates:
    spatial_ref  int32 4B 0
<xarray.DataArray ()> Size: 4B
array(539.82855, dtype=float32)
Coordinates:
    spatial_ref  int32 4B 0

And if we plot the image, the nodata pixels are not shown because they are not 0 anymore:

raster plot masking missing values
Raster plot after masking out missing values

One should notice that there is a side effect of using nan instead of 0 to represent the missing data: the data type of the DataArray was changed from integers to float (as can be seen when we printed the statistics). This needs to be taken into consideration when the data type matters in your application.

Raster Bands

So far we looked into a single band raster, i.e. the red band of a Sentinel-2 scene. However, for certain applications it is helpful to visualize the true-color image of the region. This is provided as a multi-band raster – a raster dataset that contains more than one band.

multi-band raster
Sketch of a multi-band raster image

The visual asset in the Sentinel-2 scene is a multiband asset. Similar to the red band, we can load it by:

PYTHON

rhodes_visual = rioxarray.open_rasterio('data/sentinel2/visual.tif', overview_level=2)
rhodes_visual

OUTPUT

<xarray.DataArray (band: 3, y: 1373, x: 1373)> Size: 6MB
[5655387 values with dtype=uint8]
Coordinates:
  * band         (band) int32 12B 1 2 3
  * x            (x) float64 11kB 5e+05 5.001e+05 ... 6.097e+05 6.097e+05
  * y            (y) float64 11kB 4.1e+06 4.1e+06 4.1e+06 ... 3.99e+06 3.99e+06
    spatial_ref  int32 4B 0
Attributes:
    AREA_OR_POINT:       Area
    OVR_RESAMPLING_ALG:  AVERAGE
    _FillValue:          0
    scale_factor:        1.0
    add_offset:          0.0

The band number comes first when GeoTiffs are read with the .open_rasterio() function. As we can see in the xarray.DataArray object, the shape is now (band: 3, y: 1373, x: 1373), with three bands in the band dimension. It’s always a good idea to examine the shape of the raster array you are working with and make sure it’s what you expect. Many functions, especially the ones that plot images, expect a raster array to have a particular shape. One can also check the shape using the .shape attribute:

PYTHON

rhodes_visual.shape

OUTPUT

(3, 1373, 1373)

One can visualize the multi-band data with the DataArray.plot.imshow() function:

PYTHON

rhodes_visual.plot.imshow()
true-color image overview
Overview of the true-color image (multi-band raster)

Note that the DataArray.plot.imshow() function makes assumptions about the shape of the input DataArray, that since it has three channels, the correct colormap for these channels is RGB. It does not work directly on image arrays with more than 3 channels. One can replace one of the RGB channels with another band, to make a false-color image.

Exercise: set the plotting aspect ratio

As seen in the figure above, the true-color image is stretched. Let’s visualize it with the right aspect ratio. You can use the documentation of DataArray.plot.imshow().

Since we know the height/width ratio is 1:1 (check the rio.height and rio.width attributes), we can set the aspect ratio to be 1. For example, we can choose the size to be 5 inches, and set aspect=1. Note that according to the documentation of DataArray.plot.imshow(), when specifying the aspect argument, size also needs to be provided.

PYTHON

rhodes_visual.plot.imshow(size=5, aspect=1)
raster plot with correct aspect ratio
Overview of the true-color image with the correct aspect ratio

Key Points

  • rioxarray and xarray are for working with multidimensional arrays like pandas is for working with tabular data.
  • rioxarray stores CRS information as a CRS object that can be converted to an EPSG code or PROJ4 string.
  • Missing raster data are filled with nodata values, which should be handled with care for statistics and visualization.

Content from Vector data in Python


Last updated on 2024-06-25 | Edit this page

Estimated time: 50 minutes

Overview

Questions

  • How can I read, inspect, and process spatial objects, such as points, lines, and polygons?

Objectives

  • Load spatial objects.
  • Select the spatial objects within a bounding box.
  • Perform a CRS conversion of spatial objects.
  • Select features of spatial objects.
  • Match objects in two datasets based on their spatial relationships.

Introduction


In the preceding episodes, we have prepared, selected and downloaded raster data from before and after the wildfire event in the summer of 2023 on the Greek island of Rhodes. To evaluate the impact of this wildfire on the vital infrastructure and built-up areas we are going to create a subset of vector data representing these assets. In this episode you will learn how to extract vector data with specific characteristics like the type of attributes or their locations. The dataset that we will generate in this episode can lateron be confronted with scorched areas which we determine by analyzing the satellite images Episode 9: Raster Calculations in Python.

We’ll be examining vector datasets that represent the valuable assests of Rhodes. As mentioned in Episode 2: Introduction to Vector Data, vector data uses points, lines, and polygons to depict specific features on the Earth’s surface. These geographic elements can have one or more attributes, like ‘name’ and ‘population’ for a city. In this episode we’ll be using two open data sources: the Database of Global Administrative Areas (GADM) dataset to generate a polygon for the island of Rhodes and and Open Street Map data for the vital infrastructure and valuable assets.

To handle the vector data in python we use the package geopandas. This package allows us to open, manipulate, and write vector dataset through python.

Pandas and Geopandas

geopandas enhances the widely-used pandas library for data analysis by extending its functionality to geospatial applications. The primary pandas objects (Series and DataFrame) are extended to geopandas objects (GeoSeries and GeoDataFrame). This extension is achieved by incorporating geometric types, represented in Python using the shapely library, and by offering dedicated methods for spatial operations like union, spatial joins and intersect. In order to understand how geopandas works, it is good to provide a brief explanation of the relationship between Series, a DataFrame, GeoSeries, and a GeoDataFrame:

  • A Series is a one-dimensional array with an axis that can hold any data type (integers, strings, floating-point numbers, Python objects, etc.)
  • A DataFrame is a two-dimensional labeled data structure with columns that can potentially hold different types of data.
  • A GeoSeries is a Series object designed to store shapely geometry objects.
  • A GeoDataFrame is an extended pandas.DataFrame that includes a column with geometry objects, which is a GeoSeries.

Introduce the Vector Data

In this episode, we will use the downloaded vector data from the data directory. Please refer to the setup page on where to download the data. Note that we manipulated that data a little for the purposes of this workshop. The link to the original source can be found on the setup page.

Get the administration boundary of study area


The first thing we want to do is to extract a polygon containing the boundary of the island of Rhodes from Greece. For this we will use the GADM dataset layer ADM_ADM_3.gpkg for Greece. For your convenience we saved a copy at: data/data/gadm/ADM_ADM_3.gpkg We will use the geopandas package to load the file and use the read_file function see. Note that geopandas is often abbreviated as gpd.

PYTHON

import geopandas as gpd
gdf_greece = gpd.read_file('../data/gadm/ADM_ADM_3.gpkg')

We can print out the gdf_greecevariable:

PYTHON

gdf_greece

OUTPUT

GID_3 GID_0 COUNTRY    GID_1                       NAME_1  \
0     GRC.1.1.1_1   GRC  Greece  GRC.1_1                       Aegean
1     GRC.1.1.2_1   GRC  Greece  GRC.1_1                       Aegean
2     GRC.1.1.3_1   GRC  Greece  GRC.1_1                       Aegean
..            ...   ...     ...      ...                          ...
324  GRC.8.2.24_1   GRC  Greece  GRC.8_1  Thessaly and Central Greece
325  GRC.8.2.25_1   GRC  Greece  GRC.8_1  Thessaly and Central Greece

                      NL_NAME_1      GID_2        NAME_2      NL_NAME_2  \
0                        Αιγαίο  GRC.1.1_1  North Aegean  Βόρειο Αιγαίο
1                        Αιγαίο  GRC.1.1_1  North Aegean  Βόρειο Αιγαίο
2                        Αιγαίο  GRC.1.1_1  North Aegean  Βόρειο Αιγαίο
..                          ...        ...           ...            ...
324  Θεσσαλία και Στερεά Ελλάδα  GRC.8.2_1      Thessaly       Θεσσαλία
325  Θεσσαλία και Στερεά Ελλάδα  GRC.8.2_1      Thessaly       Θεσσαλία
...
324  POLYGON ((22.81903 39.27344, 22.81884 39.27332...
325  POLYGON ((23.21375 39.36514, 23.21272 39.36469...

[326 rows x 17 columns]

The data are read into the variable fields as a GeoDataFrame. This is an extened data format of pandas.DataFrame, with an extra column geometry. To explore the dataframe you can call this variable just like a pandas dataframe by using functions like .shape, .head and .tail etc.

To visualize the polygons we can use the plot() function to the GeoDataFrame we have loaded gdf_greece:

PYTHON

gdf_greece.plot()
greece_administrations

If you want to interactively explore your data you can use the .explore function in geopandas:

PYTHON

gdf_greece.explore()

In this interactive map you can easily zoom in and out and hover over the polygons to see which attributes, stored in the rows of your GeoDataFrame, are related to each polygon.

Next, we’ll focus on isolating the administrative area of Rhodes Island. Once you hover over the polygon of Rhodos (the relatively big island to the east) you will find out that the label Rhodos is stored in the NAME_3 column of gdf_greece, where Rhodes Island is listed as Rhodos. Since our goal is to have a boundary of Rhodes, we’ll now create a new variable that exclusively represents Rhodes Island.

To select an item in our GeoDataFrame with a specific value is done the same way in which this is done in a pandas DataFrame using .loc.

PYTHON

gdf_rhodes = gdf_greece.loc[gdf_greece['NAME_3']=='Rhodos']

And we can plot the overview by (or show it interactively using .explore):

PYTHON

gdf_rhodes.plot()
rhodes_administrations

Now that we have the administrative area of Rhodes Island. We can use the to_file() function save this file for future use.

PYTHON

# Save the rhodes_boundary to gpkg
gdf_rhodes.to_file('rhodes.gpkg')

Get the vital infrastructure and built-up areas


Road data from Open Street Map (OSM)

Now that we have the boundary of our study area, we will make use this to select the main roads in our study area. We will make the following processing steps:

  1. Select roads of study area
  2. Select key infrastruture: ‘primary’, ‘secondary’, ‘tertiary’
  3. Create a 100m buffer around the rounds. This buffer will be regarded as the infrastructure region. (note that this buffer is arbitrary and can be changed afterwards if you want!)

Step 1: Select roads of study area

For this workshop, in particular to not have everyone downloading too much data, we created a subset of the Openstreetmap data we downloaded for Greece from the Geofabrik. This data comes in the form of a shapefile (see episode 2) from which we extracted all the roads for Rhodes and some surrounding islands. The data is stored in the osm folder as osm_roads.gpkg, but contains all the roads on the island (so also hiking paths, private roads etc.), whereas we in particular are interested in the key infrastructure which we consider to be roads classified as primary, secondary or tertiary roads.

Let’s load the file and plot it:

PYTHON

gdf_roads = gpd.read_file('../data/osm/osm_roads.gpkg')

We can explore it using the same commands as above:

PYTHON

gdf_roads.plot()
greece_highways

As you may have noticed, loading and plotting osm_roads.gpkg takes a bit long. This is because the file contains all the roads of Rhodos and some surrounding islands as well. Since we are only interested in the roads on Rhodes Island. We will use the mask parameter of the read_file() function to load only the roads on Rhodes Island.

Now let us overwrite the GeoDataframe gdf_roads using the mask with the GeoDataFrame gdf_rhodes we created above.

PYTHON

gdf_roads = gpd.read_file('../data/osm/osm_roads.gpkg', mask=gdf_rhodes)

Now let us explore these roads using .explore (or .plot):

PYTHON

gdf_roads.explore()
rhodes_highways

Step 2: Select key infrastruture

As you will find out while exploring the roads dataset, information about the type of roads is stored in the fclass column. To get an overview of the different values that are present in the collumn fclass , we can use the unique() function from pandas:

PYTHON

gdf_roads['fclass'].unique()

OUTPUT

array(['residential', 'service', 'unclassified', 'footway',
       'track_grade4', 'primary', 'track', 'tertiary', 'track_grade3',
       'path', 'track_grade5', 'steps', 'secondary', 'primary_link',
       'track_grade2', 'track_grade1', 'pedestrian', 'tertiary_link',
       'secondary_link', 'living_street', 'cycleway'], dtype=object)

It seems the variable gdf_roads contains all kind of hiking paths and footpaths as well. Since we are only interested in vital infrastructure, classified as “primary”, “secondary” and “tertiary” roads, we need to make a subselection.

Let us first create a list with the labels we want to select.

PYTHON

key_infra_labels = ['primary', 'secondary', 'tertiary']

Now we are using this list make a subselection of the key infrastructure using pandas´ .isin function.

PYTHON

key_infra = gdf_roads.loc[gdf_roads['fclass'].isin(key_infra_labels)]

We can plot the key infrastructure :

PYTHON

key_infra.plot()
rhodes_infra_highways

Step 3: Create a 100m buffer around the key infrastructure

Now that we selected the key infrastructure, we want to create a 100m buffer around them. This buffer will be regarded as the infrastructure region.

As you might have notice, the numbers on the x and y axis of our plots represent Lon Lat coordinates, meaning that the data is not yet projected. The current data has a geographic coordinate system with measures in degrees but not meter. Creating a buffer of 100 meters is not possible. Therfore, in order to create a 100m buffer, we first need to project our data. In our case we decided to project the data as WGS 84 / UTM zone 31N, with EPSG code 32631 (see chapter 03 for more information about the CRS and EPSG codes.

To project our data we use .to_crs. We first define a variable with the EPSG value (in our case 32631), which we then us in the to_crs function.

PYTHON

epsg_code = 32631
key_infra_meters = key_infra.to_crs(epsg_code)

Now that our data is projected, we can create a buffer. For this we make use of geopandas´ .buffer function :

PYTHON

key_infra_meters_buffer = key_infra_meters.buffer(100)
key_infra_meters_buffer

OUTPUT

53       POLYGON ((2779295.383 4319805.295, 2779317.029...
54       POLYGON ((2779270.962 4319974.441, 2779272.393...
55       POLYGON ((2779172.341 4319578.062, 2779165.312...
84       POLYGON ((2779615.109 4319862.058, 2779665.519...
140      POLYGON ((2781330.698 4320046.538, 2781330.749...
                               ...
19020    POLYGON ((2780193.230 4337691.133, 2780184.279...
19021    POLYGON ((2780330.823 4337772.262, 2780324.966...
19022    POLYGON ((2780179.850 4337917.135, 2780188.871...
19024    POLYGON ((2780516.550 4339028.863, 2780519.340...
19032    POLYGON ((2780272.050 4338213.937, 2780274.519...
Length: 1386, dtype: geometry

Note that the type of the key_infra_meters_buffer is a GeoSeries and not a GeoDataFrame. This is because the buffer() function returns a GeoSeries object. You can check that by calling the type of the variable.

PYTHON

type(key_infra_meters_buffer)

OUTPUT

geopandas.geoseries.GeoSeries

Now that we have a buffer, we can convert it back to the geographic coordinate system to keep the data consistent. Note that we are now using the crs information from the key_infra, instead of using the EPSG code directly (EPSG:4326):

PYTHON

key_infra_buffer = key_infra_meters_buffer.to_crs(key_infra.crs)
key_infra_buffer

OUTPUT

45       POLYGON ((27.72826 36.12409, 27.72839 36.12426...
58       POLYGON ((27.71666 36.11678, 27.71665 36.11678...
99       POLYGON ((27.75485 35.95242, 27.75493 35.95248...
100      POLYGON ((27.76737 35.95086, 27.76733 35.95086...
108      POLYGON ((27.76706 35.95199, 27.76702 35.95201...
                               ...
18876    POLYGON ((28.22855 36.41890, 28.22861 36.41899...
18877    POLYGON ((28.22819 36.41838, 28.22825 36.41845...
18878    POLYGON ((28.22865 36.41904, 28.22871 36.41912...
18879    POLYGON ((28.23026 36.41927, 28.23034 36.41921...
18880    POLYGON ((28.23020 36.41779, 28.23007 36.41745...
Length: 1369, dtype: geometry

As you can see, the buffers created in key_infra_buffer have the Polygon geometry type.

To double check the EPSG code of key_infra:

PYTHON

print(key_infra.crs)

OUTPUT

EPSG:4326

Reprojecting and buffering our data is something that we are going to do multiple times during this episode. To avoid have to call the same functions multiple times it would make sense to create a function. Therefore, let us create a function in which we can add the buffer as a variable.

PYTHON

def buffer_crs(gdf, size, meter_crs=32631, target_crs=4326):
    return gdf.to_crs(meter_crs).buffer(size).to_crs(target_crs)

For example, we can use this function to create a 200m buffer around the infrastructure the key infrastructure by doing:

PYTHON

key_infra_buffer_200 = buffer_crs(key_infra, 200)

Get built-up regions from Open Street Map (OSM)

Now that we have a buffered dataset for the key infrastructure of Rhodes, our next step is to create a dataset with all the built-up areas. To do so we will use the land use data from OSM, which we prepared for you in the file data/osm_landuse.gpkg. This file includes the land use data for the entire Greece. We assume the built-up regions to be the union of three types of land use: “commercial”, “industrial”, and “residential”.

Note that for the simplicity of this course, we limit the built-up regions to these three types of land use. In reality, the built-up regions can be more complex also there is definately more high quality (e.g. local government).

Now it will be up to you to create a dataset with valueable assets. You should be able to complete this task by yourself with the knowledge you have gained from the previous steps and links to the documentation we provided.

Exercise: Get the built-up regions

Create a builtup_buffer from the file data/osm/osm_landuse.gpkg by the following steps:

  1. Load the land use data from data/osm/osm_landuse.gpkg and mask it with the administrative boundary of Rhodes Island (gdf_rhodes).
  2. Select the land use data for “commercial”, “industrial”, and “residential”.
  3. Create a 10m buffer around the land use data.
  4. Visualize the results.

After completing the exercise, answer the following questions:

  1. How many unique land use types are there in osm_landuse.gpkg?
  2. After selecting the three types of land use, how many entries (rows) are there in the results?

Hints:

  • data/osm_landuse.gpkg contains the land use data for the entire Greece. Use the administrative boundary of Rhodes Island (gdf_rhodes) to select the land use data for Rhodes Island.
  • The land use attribute is stored in the fclass column.
  • Reuse buffer_crs function to create the buffer.

PYTHON

# Read data with a mask of Rhodes
gdf_landuse = gpd.read_file('./data_workshop/osm/osm_landuse.gpkg', mask=gdf_rhodes)

# Find number of unique landuse types
print(len(gdf_landuse['fclass'].unique()))

# Extract built-up regions
builtup_labels = ['commercial', 'industrial', 'residential']
builtup = gdf_landuse.loc[gdf_landuse['fclass'].isin(builtup_labels)]

# Create 10m buffer around the built-up regions
builtup_buffer = buffer_crs(builtup, 10)

# Get the number of entries
print(len(builtup_buffer))

# Visualize the buffer
builtup_buffer.plot()

OUTPUT

19
1336
rhodes_builtup_buffer

Merge the infrastructure regions and built-up regions


Now that we have the infrastructure regions and built-up regions, we can merge them into a single region. We would like to keep track of the type after merging, so we will add two new columns: type and code by converting the GeoSeries to GeoDataFrame.

First we convert the buffer around key infrastructure:

PYTHON

data = {'geometry': key_infra_buffer, 'type': 'infrastructure', 'code': 1}
gdf_infra = gpd.GeoDataFrame(data)

Then we convert the built-up buffer:

PYTHON

data = {'geometry': builtup_buffer, 'type': 'builtup', 'code': 2}
gdf_builtup = gpd.GeoDataFrame(data)

After that, we can merge the two GeoDataFrame into one:

PYTHON

import pandas as pd
gdf_assets = pd.concat([gdf_infra, gdf_builtup]).reset_index(drop=True)

In gdf_assets, we can distinguish the infrastructure regions and built-up regions by the type and code columns. We can plot the gdf_assets to visualize the merged regions. See the geopandas documentation on how to do this:

PYTHON

gdf_assets.plot(column='type', legend=True)
rhodes_assets

Finally, we can save the gdf_assets to a file for future use:

PYTHON

gdf_assets.to_file('assets.gpkg')

Key Points

  • Load spatial objects into Python with geopandas.read_file() function.
  • Spatial objects can be plotted directly with GeoDataFrame’s .plot() method.
  • Convert CRS of spatial objects with .to_crs(). Note that this generates a GeoSeries object.
  • Create a buffer of spatial objects with .buffer().
  • Merge spatial objects with pd.concat().

Content from Crop raster data with rioxarray and geopandas


Last updated on 2024-06-25 | Edit this page

Estimated time: 40 minutes

Overview

Questions

  • How can I crop my raster data to the area of interest?

Objectives

  • Align the CRS of geopatial data.
  • Crop raster data with a bounding box.
  • Crop raster data with a polygon.

It is quite common that the raster data you have in hand is too large to process, or not all the pixels are relevant to your area of interest (AoI). In both situations, you should consider cropping your raster data before performing data analysis.

In this episode, we will introduce how to crop raster data into the desired area. We will use one Sentinel-2 image over Rhodes Island as the example raster data, and introduce how to crop your data to different types of AoIs.

Introduce the Data

In this episode, we will work with both raster and vector data.

As raster data, we will use satellite images from the search that we have carried out in the episode: “Access satellite imagery using Python” as well as Digital Elevation Model (DEM) data from the Copernicus DEM GLO-30 dataset.

For the satellite images, we have searched for Sentinel-2 scenes of Rhodes from July 1st to August 31st 2023 that have less than 1% cloud coverage. The search resulted in 11 scenes. We focus here on the most recent scene (August 27th), since that would show the situation after the wildfire, and use this as an example to demonstrate raster data cropping.

For your convenience, we have included the scene of interest among the datasets that you have already downloaded when following the setup instructions. You should, however, be able to download the satellite images “on-the-fly” using the JSON metadata file that was created in the previous episode (the file rhodes_sentinel-2.json).

If you choose to work with the provided data (which is advised in case you are working offline or have a slow/unstable network connection) you can skip the remaining part of the block and continue with the following section: Align the CRS of the raster and the vector data.

If you want instead to experiment with downloading the data on-the-fly, you need to load the file rhodes_sentinel-2.json, which contains information on where and how to access the target satellite images from the remote repository:

PYTHON

import pystac
items = pystac.ItemCollection.from_file("rhodes_sentinel-2.json")

You can then select the first item in the collection, which is the most recent in the sequence:

PYTHON

item = items[0]
print(item)

OUTPUT

<Item id=S2A_35SNA_20230827_0_L2A>

In this episode we will consider the true color image associated with this scene, which is labelled with the visual key in the asset dictionary. We extract the URL / href (Hypertext Reference) that point to the file, and store it in a variable that we can use later on instead of the raster data path to access the data:

PYTHON

rhodes_visual_href = item.assets["visual"].href  # true color image

As vector data, we will use the assets.gpkg, which was generated in an exercise from Episode 7: Vector data in python.

Align the CRS of the raster and the vector data


Data loading

First, we will load the visual image of Sentinel-2 over Rhodes Island, which we downloaded and stored in data/sentinel2/visual.tif.

We can open this asset with rioxarray, and specify the overview level, since this is a Cloud-Optimized GeoTIFF (COG) file. As explained in episode 6 raster images can be quite big, therefore we decided to resample the data using ´rioxarray’s´ overview parameter and set it to overview_level=1.

PYTHON

import rioxarray
path_visual = 'data/sentinel2/visual.tif'
visual = rioxarray.open_rasterio(path_visual, overview_level=1)
visual

OUTPUT

<xarray.DataArray (band: 3, y: 2745, x: 2745)>
[22605075 values with dtype=uint8]
Coordinates:
  * band         (band) int64 1 2 3
  * x            (x) float64 5e+05 5e+05 5.001e+05 ... 6.097e+05 6.098e+05
  * y            (y) float64 4.1e+06 4.1e+06 4.1e+06 ... 3.99e+06 3.99e+06
    spatial_ref  int64 0
Attributes:
    AREA_OR_POINT:       Area
    OVR_RESAMPLING_ALG:  AVERAGE
    _FillValue:          0
    scale_factor:        1.0
    add_offset:          0.0

As we introduced in the raster data introduction episode, this will perform a “lazy” loading of the image meaning that the image will not be loaded into the memory until necessary.

Let’s also load the assets file generated in the vector data episode:

PYTHON

import geopandas as gpd
assets = gpd.read_file('assets.gpkg')

Crop the raster with a bounding box

The assets file contains the information of the vital infrastructure and built-up areas on the island Rhodes. The visual image, on the other hand, has a larger extent. Let us check this by visualizing the raster image:

PYTHON

visual.plot.imshow()
Large visual raster

Let’s check the extent of the assets to find out its rough location in the raster image. We can use the total_bounds attribute from GeoSeries of geopandas to get the bounding box:

PYTHON

assets.total_bounds

OUTPUT

array([27.7121001 , 35.87837949, 28.24591124, 36.45725024])

The bounding box is composed of the [minx, miny, maxx, maxy] values of the raster. Comparing these values with the raster image, we can identify that the magnitude of the bounding box coordinates does not match the coordinates of the raster image. This is because the two datasets have different coordinate reference systems (CRS). This will cause problems when cropping the raster image, therefore we first need to align the CRS-s of the two datasets

Considering the raster image has larger data volume than the vector data, we will reproject the vector data to the CRS of the raster data. We can use the to_crs method:

PYTHON

# Reproject
assets = assets.to_crs(visual.rio.crs)

# Check the new bounding box
assets.total_bounds

OUTPUT

array([ 564058.0257114, 3970719.4080227,  611743.71498815, 4035358.56340039])

Now the bounding box coordinates are updated. We can use the clip_box function, through the rioaxarray accessor, to crop the raster image to the bounding box of the vector data. clip_box takes four positional input arguments in the order of xmin, ymin, xmax, ymax, which is exactly the same order in the assets.total_bounds. Since assets.total_bounds is an numpy.array, we can use the symbol * to unpack it to the relevant positions in clip_box.

PYTHON

# Crop the raster with the bounding box
visual_clipbox = visual.rio.clip_box(*assets.total_bounds)

# Visualize the cropped image
visual_clipbox.plot.imshow()
Clip box results

Code Tip

Cropping a raster with a bounding box is a quick way to reduce the size of the raster data. Since this operation is based on min/max coordinates, it is not as computational extensive as cropping with polygons, which requires more accurate overlay operations.

Crop the raster with a polygon

We can also crop the raster with a polygon. In this case, we will use the raster clip function through the rio accessor. For this we will use the geometry column of the assets GeoDataFrame to specify the polygon:

PYTHON

# Crop the raster with the polygon
visual_clip = visual_clipbox.rio.clip(assets["geometry"])

# Visualize the cropped image
visual_clip.plot.imshow()
Clip results

Exercise: Get the red band for the Rhodes

Now that you have seen how clip a raster using a polygon, we want you to do this for the red band of the sattelite image. Use the shape of Rhodes from GADM and clip the red band with it. Furthermore, make sure to transform the no data values to Not a Number values.

PYTHON

# Solution

# Step 1 - Load the datasets - Vector data

import geopandas as gpd
gdf_greece = gpd.read_file('./data_workshop/gadm/ADM_ADM_3.gpkg')
gdf_rhodes = gdf_greece.loc[gdf_greece['NAME_3']=='Rhodos']

# Step 2 - Load the raster red band
import rioxarray
path_red = './data_workshop/sentinel2/red.tif'
red = rioxarray.open_rasterio(path_red, overview_level=1)

# Step 3 - It will not work, since it is not projected yet

gdf_rhodes = gdf_rhodes.to_crs(red.rio.crs)

# Step 4 - Clip the two

red_clip = red.rio.clip(gdf_rhodes["geometry"])

# Step 5 - assing nan values to no data

red_clip_nan = red_clip.where(red_clip!=red_clip.rio.nodata)

# Step 6 - Visualize the result

red_clip_nan.plot()
rhodes_builtup_buffer

Match two rasters

Sometimes you need to match two rasters with different extents, resolutions, or CRS. For this you can use the reproject_match function . We will demonstrate this by matching the cropped raster visual_clip with the Digital Elevation Model (DEM),rhodes_dem.tif of Rhodes.

First, let’s load the DEM:

PYTHON

dem = rioxarray.open_rasterio('./data/dem/rhodes_dem.tif')

And visualize it:

PYTHON

dem.plot()
DEM

From the visualization, we can see that the DEM has a different extent, resolution and CRS compared to the cropped visual image. We can also confirm this by checking the CRS of the two images:

PYTHON

print(dem.rio.crs)
print(visual_clip.rio.crs)

OUTPUT

EPSG:4326
EPSG:32635

We can use the reproject_match function to match the two rasters. One can choose to match the dem to the visual image or vice versa. Here we will match the DEM to the visual image:

PYTHON

dem_matched = dem.rio.reproject_match(visual_clip)

And then visualize the matched DEM:

PYTHON

dem_matched.plot()
Matched DEM

As we can see, reproject_match does a lot of helpful things in one line of code:

  1. It reprojects.
  2. It matches the extent.
  3. It matches the resolution.

Finally, we can save the matched DEM for later use. We save it as a Cloud-Optimized GeoTIFF (COG) file:

PYTHON

dem_matched.rio.to_raster('dem_rhodes_match.tif', driver='COG')

Code Tip

There is also a method in rioxarray: reproject(), which only reprojects one raster to another projection. If you want more control over how rasters are resampled, clipped, and/or reprojected, you can use the reproject() method individually.

Key Points

  • Use clip_box to crop a raster with a bounding box.
  • Use clip to crop a raster with a given polygon.
  • Use reproject_match to match two raster datasets.

Content from Raster Calculations in Python


Last updated on 2024-06-25 | Edit this page

Estimated time: 75 minutes

Overview

Questions

  • How do I perform calculations on rasters and extract pixel values for defined locations?

Objectives

  • Carry out operations with two rasters using Python’s built-in math operators.

Introduction


We often want to combine values of and perform calculations on rasters to create a new output raster. This episode covers how to perform basic math operations using raster datasets. It also illustrates how to match rasters with different resolutions so that they can be used in the same calculation. As an example, we will calculate a binary classification mask to identify burned area over a satellite scene.

The classification mask requires the following of the Sentinel-2 bands (and derived indices):

\[ NDVI = \frac{NIR - red}{NIR + red} \]

\[ NDWI = \frac{green - NIR}{green + NIR} \]

  • A custom index derived from two of the short-wave infrared (SWIR) bands (with wavelenght ~1600 nm and ~2200 nm, respectively):

\[ INDEX = \frac{SWIR_{16} - SWIR_{22}}{SWIR_{16} + SWIR_{22}}\]

  • The blue, NIR, and SWIR (1600 nm) bands.

In the following, we start by computing the NDVI.

Load and crop the Data

For this episode, we will use one of the Sentinel-2 scenes that we have already employed in the previous episodes.

Introduce the Data

We will use satellite images from the search that we have carried out in the episode: “Access satellite imagery using Python”. Briefly, we have searched for Sentinel-2 scenes of Rhodes from July 1st to August 31st 2023 that have less than 1% cloud coverage. The search resulted in 11 scenes. We focus here on the most recent scene (August 27th), since that would show the situation after the wildfire, and use this as an example to demonstrate raster calculations.

For your convenience, we have included the scene of interest among the datasets that you have already downloaded when following the setup instructions. You should, however, be able to download the satellite images “on-the-fly” using the JSON metadata file that was created in the previous episode (the file rhodes_sentinel-2.json).

If you choose to work with the provided data (which is advised in case you are working offline or have a slow/unstable network connection) you can skip the remaining part of the block.

If you want instead to experiment with downloading the data on-the-fly, you need to load the file rhodes_sentinel-2.json, which contains information on where and how to access the target satellite images from the remote repository:

PYTHON

import pystac
items = pystac.ItemCollection.from_file("rhodes_sentinel-2.json")

You can then select the first item in the collection, which is the most recent in the sequence:

PYTHON

item = items[0]
print(item)

OUTPUT

<Item id=S2A_35SNA_20230827_0_L2A>

In this episode we will consider a number of bands associated with this scene. We extract the URL / href (Hypertext Reference) that point to each of the raster files, and store these in variables that we can use later on instead of the raster data paths to access the data:

PYTHON

rhodes_red_href = item.assets["red"].href  # red band
rhodes_green_href = item.assets["green"].href  # green band
rhodes_blue_href = item.assets["blue"].href  # blue band
rhodes_nir_href = item.assets["nir"].href  # near-infrared band
rhodes_swir16_href = item.assets["swir16"].href  # short-wave infrared (1600 nm) band
rhodes_swir22_href = item.assets["swir22"].href  # short-wave infrared (2200 nm) band
rhodes_visual_href = item.assets["visual"].href  # true-color image

Let’s load the red and NIR band rasters with open_rasterio using the argument masked=True:

PYTHON

import rioxarray
red = rioxarray.open_rasterio("data/sentinel2/red.tif", masked=True)
nir = rioxarray.open_rasterio("data/sentinel2/nir.tif", masked=True)

Let’s restrict our analysis to the island of Rhodes - we can extract the bounding box from the vector file written in an earlier episode (note that we need to match the CRS to the one of the raster files):

PYTHON

# determine bounding box of Rhodes, in the projected CRS
import geopandas
rhodes = geopandas.read_file('rhodes.gpkg')
rhodes_reprojected = rhodes.to_crs(red.rio.crs)
bbox = rhodes_reprojected.total_bounds

# crop the rasters
red_clip = red.rio.clip_box(*bbox)
nir_clip = nir.rio.clip_box(*bbox)

We can now plot the two rasters. Using robust=True color values are stretched between the 2nd and 98th percentiles of the data, which results in clearer distinctions between high and low reflectances:

PYTHON

red_clip.plot(robust=True)
red band image

PYTHON

nir_clip.plot(robust=True)
near infra-red band image

The burned area is immediately evident as a dark spot in the NIR wavelength, due to the lack of reflection from the vegetation in the scorched area.

Raster Math


We can perform raster calculations by subtracting (or adding, multiplying, etc.) two rasters. In the geospatial world, we call this “raster math”, and typically it refers to operations on rasters that have the same width and height (including nodata pixels). We can check the shapes of the two rasters in the following way:

PYTHON

print(red_clip.shape, nir_clip.shape)

OUTPUT

(1, 1131, 1207) (1, 1131, 1207)

The shapes of the two rasters match (and, not shown, the coordinates and the CRSs match too).

Let’s now compute the NDVI as a new raster using the formula presented above. We’ll use DataArray objects so that we can easily plot our result and keep track of the metadata.

PYTHON

ndvi = (nir_clip - red_clip)/ (nir_clip + red_clip)
print(ndvi)

OUTPUT

<xarray.DataArray (band: 1, y: 1131, x: 1207)>
array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]], dtype=float32)
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 5.615e+05 5.616e+05 ... 6.097e+05 6.098e+05
  * y            (y) float64 4.035e+06 4.035e+06 4.035e+06 ... 3.99e+06 3.99e+06
    spatial_ref  int64 0

We can now plot the output NDVI:

PYTHON

ndvi.plot()
NDVI map

Notice that the range of values for the output NDVI is between -1 and 1. Does this make sense for the selected region?

Maps are great, but it can also be informative to plot histograms of values to better understand the distribution. We can accomplish this using a built-in xarray method we have already been using: plot.hist()

PYTHON

ndvi.plot.hist()
NDVI histogram

Exercise: NDWI and custom index to detect burned areas

Calculate the other two indices required to compute the burned area classification mask, specifically:

\[ NDWI = \frac{green - NIR}{green + NIR} \]

  • A custom index derived from the 1600 nm and the 2200 nm short-wave infrared (SWIR) bands ( “swir16.tif” and “swir22.tif”, respectively):

\[ INDEX = \frac{SWIR_{16} - SWIR_{22}}{SWIR_{16} + SWIR_{22}}\]

What challenge do you foresee in combining the data from the two indices?

PYTHON

def get_band_and_clip(band_path, bbox):
    band = rioxarray.open_rasterio(band_path, masked=True)
    return band.rio.clip_box(*bbox)

PYTHON

data_path = 'data/sentinel2'
green_clip = get_band_and_clip(f'{data_path}/green.tif', bbox)
swir16_clip = get_band_and_clip(f'{data_path}/swir16.tif', bbox)
swir22_clip = get_band_and_clip(f'{data_path}/swir22.tif', bbox)

PYTHON

ndwi = (green_clip - nir_clip)/(green_clip + nir_clip)
index = (swir16_clip - swir22_clip)/(swir16_clip + swir22_clip)

PYTHON

ndwi.plot(robust=True)
NDWI index

PYTHON

index.plot(robust=True)
custom index

The challenge in combining the different indices is that the SWIR bands (and thus the derived custom index) have lower resolution:

PYTHON

ndvi.rio.resolution(), ndwi.rio.resolution(), index.rio.resolution()

OUTPUT

((10.0, -10.0), (10.0, -10.0), (20.0, -20.0))

In order to combine data from the computed indices, we use the reproject_match method, which reprojects, clips and match the resolution of a raster using another raster as a template. We use the index raster as a template, and match ndvi and ndwi to its resolution and extent:

PYTHON

ndvi_match = ndvi.rio.reproject_match(index)
ndwi_match = ndwi.rio.reproject_match(index)

Finally, we also fetch the blue band data and match this and the NIR band data to the template:

PYTHON

blue_clip = get_band_and_clip(f'{data_path}/blue.tif', bbox)
blue_match = blue_clip.rio.reproject_match(index)
nir_match = nir_clip.rio.reproject_match(index)

We can now go ahead and compute the binary classification mask for burned areas. Note that we need to convert the unit of the Sentinel-2 bands from digital numbers to reflectance (this is achieved by dividing by 10,000):

PYTHON

burned = (
    (ndvi_match <= 0.3) &
    (ndwi_match <= 0.1) &
    ((index + nir_match/10_000) <= 0.1) &
    ((blue_match/10_000) <= 0.1) &
    ((swir16_clip/10_000) >= 0.1)
)

The classification mask has a single element along the “band” axis, we can drop this dimension in the following way:

PYTHON

burned = burned.squeeze()

Let’s now fetch and visualize the true color image of Rhodes, after coloring the pixels identified as burned area in red:

PYTHON

visual = rioxarray.open_rasterio(f'{data_path}/visual.tif')
visual_clip = visual.rio.clip_box(*bbox)

PYTHON

# set red channel to max (255), green and blue channels to min (0).
visual_clip[0] = visual_clip[0].where(~burned, 255)
visual_clip[1:3] = visual_clip[1:3].where(~burned, 0)

PYTHON

visual_clip.plot.imshow()
RGB image with burned area in red

We can save the burned classification mask to disk after converting booleans to integers:

PYTHON

burned.rio.to_raster('burned.tif', dtype='int8')

Key Points

  • Python’s built-in math operators are fast and simple options for raster math.

Content from Calculating Zonal Statistics on Rasters


Last updated on 2024-06-25 | Edit this page

Estimated time: 40 minutes

Overview

Questions

  • How to compute raster statistics on different zones delineated by vector data?

Objectives

  • Extract zones from the vector dataset
  • Convert vector data to raster
  • Calculate raster statistics over zones

Introduction


Statistics on predefined zones of the raster data are commonly used for analysis and to better understand the data. These zones are often provided within a single vector dataset, identified by certain vector attributes. For example, in the previous episodes, we defined infrastructure regions and built-up regions on Rhodes Island as polygons. Each region can be respectively identified as a “zone”, resulting in two zones. One can evualuate the effect of the wild fire on the two zones by calculating the zonal statistics.

Data loading


We have created assets.gpkg in Episode “Vector data in Python”, which contains the infrastructure regions and built-up regions . We also calculated the burned index in Episode “Raster Calculations in Python” and saved it in burned.tif. Lets load them:

PYTHON

# Load burned index
import rioxarray
burned = rioxarray.open_rasterio('burned.tif')

# Load assests polygons
import geopandas as gpd
assets = gpd.read_file('assets.gpkg')

Align datasets


Before we continue, let’s check if the two datasets are in the same CRS:

PYTHON

print(assets.crs)
print(burned.rio.crs)

OUTPUT

EPSG:4326
EPSG:32635

The two datasets are in different CRS. Let’s reproject the assets to the same CRS as the burned index raster:

PYTHON

assets = assets.to_crs(burned.rio.crs)

Rasterize the vector data


One way to define the zones, is to create a grid space with the same extent and resolution as the burned index raster, and with the numerical values in the grid representing the type of infrastructure, i.e., the zones. This can be done by rasterize the vector data assets to the grid space of burned.

Let’s first take two elements from assets, the geometry column, and the code of the region.

PYTHON

geom = assets[['geometry', 'code']].values.tolist()
geom

OUTPUT

[[<POLYGON ((569708.927 3972332.358, 569709.096 3972333.79, 569710.406 3972341...>,
  1],
 [<MULTIPOLYGON (((567767.095 3970740.732, 567767.548 3970741.604, 567772.083 ...>,
  2]]

The raster image burned is a 3D image with a “band” dimension.

PYTHON

burned.shape

OUTPUT

(1, 1131, 1207)

To create the grid space, we only need the two spatial dimensions. We can used .squeeze() to drop the band dimension:

PYTHON

burned_squeeze = burned.squeeze()
burned_squeeze.shape

OUTPUT

(1131, 1207)

Now we can use features.rasterize from rasterio to rasterize the vector data assets to the grid space of burned:

PYTHON

from rasterio import features
assets_rasterized = features.rasterize(geom, out_shape=burned_squeeze.shape, transform=burned.rio.transform())
assets_rasterized

OUTPUT

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)

Perform zonal statistics


The rasterized zones assets_rasterized is a numpy array. The Python package xrspatial, which is the one we will use for zoning statistics, accepts xarray.DataArray. We need to first convert assets_rasterized. We can use burned_squeeze as a template:

PYTHON

assets_rasterized_xarr = burned_squeeze.copy()
assets_rasterized_xarr.data = assets_rasterized
assets_rasterized_xarr.plot()
Rasterized zones

Then we can calculate the zonal statistics using the zonal_stats function:

PYTHON

from xrspatial import zonal_stats
stats = zonal_stats(assets_rasterized_xarr, burned_squeeze)
stats

OUTPUT

    zone	  mean	    max	min	sum	    std	      var	      count
0	    0	    0.022570	1.0	0.0	28929.0	0.148528	0.022061	1281749.0
1	    1	    0.009607	1.0	0.0	568.0	  0.097542	0.009514	59125.0
2	    2	    0.000000	0.0	0.0	0.0	    0.000000	0.000000	24243.0

The results provide statistics for three zones: 1 represents infrastructure regions, 2 represents built-up regions, and 0 represents the rest of the area.

Key Points

  • Zones can be extracted by attribute columns of a vector dataset
  • Zones can be rasterized using rasterio.features.rasterize
  • Calculate zonal statistics with xrspatial.zonal_stats over the rasterized zones.

Content from Parallel raster computations using Dask


Last updated on 2024-06-25 | Edit this page

Estimated time: 55 minutes

Overview

Questions

  • How can I parallelize computations on rasters with Dask?
  • How can I determine if parallelization improves calculation speed?
  • What are good practices in applying parallelization to my raster calculations?

Objectives

  • Profile the timing of the raster calculations.
  • Open raster data as a chunked array.
  • Recognize good practices in selecting proper chunk sizes.
  • Setup raster calculations that take advantage of parallelization.

Introduction


Very often raster computations involve applying the same operation to different pieces of data. Think, for instance, to the “pixel”-wise sum of two raster datasets, where the same sum operation is applied to all the matching grid-cells of the two rasters. This class of tasks can benefit from chunking the input raster(s) into smaller pieces: operations on different pieces can be run in parallel using multiple computing units (e.g., multi-core CPUs), thus potentially speeding up calculations. In addition, working on chunked data can lead to smaller memory footprints, since one may bypass the need to store the full dataset in memory by processing it chunk by chunk.

In this episode, we will introduce the use of Dask in the context of raster calculations. Dask is a Python library for parallel and distributed computing. It provides a framework to work with different data structures, including chunked arrays (Dask Arrays). Dask is well integrated with (rio)xarray, which can use Dask arrays as underlying data structures.

Dask

This episode shows how Dask can be used to parallelize operations on local CPUs. However, the same library can be configured to run tasks on large compute clusters.

More resources on Dask:

It is important to realize, however, that many details determine the extent to which using Dask’s chunked arrays instead of regular Numpy arrays leads to faster calculations (and lower memory requirements). The actual operations to carry out, the size of the dataset, and parameters such as the chunks’ shape and size, all affects the performance of our computations. Depending on the specifics of the calculations, serial calculations might actually turn out to be faster! Being able to profile the computational time is thus essential, and we will see how to do that in a Jupyter environment in the next section.

Introduce the Data

We will use satellite images from the search that we have carried out in the episode: “Access satellite imagery using Python”. Briefly, we have searched for Sentinel-2 scenes of Rhodes from July 1st to August 31st 2023 that have less than 1% cloud coverage. The search resulted in 11 scenes. We focus here on the most recent scene (August 27th), since that would show the situation after the wildfire, and use this as an example to demonstrate parallel raster calculations.

For your convenience, we have included the scene of interest among the datasets that you have already downloaded when following the setup instructions. You should, however, be able to download the satellite images “on-the-fly” using the JSON metadata file that was created in the previous episode (the file rhodes_sentinel-2.json).

If you choose to work with the provided data (which is advised in case you are working offline or have a slow/unstable network connection) you can skip the remaining part of the block and continue with the following section: Dask-powered rasters.

If you want instead to experiment with downloading the data on-the-fly, you need to load the file rhodes_sentinel-2.json, which contains information on where and how to access the target satellite images from the remote repository:

PYTHON

import pystac
items = pystac.ItemCollection.from_file("rhodes_sentinel-2.json")

You can then select the first item in the collection, which is the most recent in the sequence:

PYTHON

item = items[0]
print(item)

OUTPUT

<Item id=S2A_35SNA_20230827_0_L2A>

In this episode we will consider the red and near infrared bands associated with this scene. We extract the URL / href (Hypertext Reference) that point to each of the raster files, and store these in variables that we can use later on instead of the raster data paths to access the data:

PYTHON

rhodes_red_href = item.assets["red"].href  # red band
rhodes_nir_href = item.assets["nir"].href  # near-infrared band

Dask-powered rasters


Chunked arrays

As we have mentioned, rioxarray supports the use of Dask’s chunked arrays as underlying data structure. When opening a raster file with open_rasterio and providing the chunks argument, Dask arrays are employed instead of regular Numpy arrays. chunks describes the shape of the blocks which the data will be split in. As an example, we open the red band raster using a chunk shape of (1, 4000, 4000) (block size of 1 in the first dimension and of 4000 in the second and third dimensions):

PYTHON

import rioxarray
red = rioxarray.open_rasterio("data/sentinel2/red.tif", chunks=(1, 4000, 4000))

Xarray and Dask also provide a graphical representation of the raster data array and of its blocked structure.

DataArray with Dask
Xarray Dask-backed DataArray

Exercise: Chunk sizes matter

We have already seen how COGs are regular GeoTIFF files with a special internal structure. Another feature of COGs is that data is organized in “blocks” that can be accessed remotely via independent HTTP requests, enabling partial file readings. This is useful if you want to access only a portion of your raster file, but it also allows for efficient parallel reading. You can check the blocksize employed in a COG file with the following code snippet:

PYTHON

import rasterio
with rasterio.open("/path/or/URL/to/file.tif") as r:
    if r.is_tiled:
        print(f"Chunk size: {r.block_shapes}")

In order to optimally access COGs it is best to align the blocksize of the file with the chunks enployed when loading the file. Which other elements do you think should be considered when choosing the chunk size? What do you think are suitable chunk sizes for the red band raster?

PYTHON

import rasterio
with rasterio.open("data/sentinel2/red.tif") as r:
    if r.is_tiled:
        print(f"Chunk size: {r.block_shapes}")

OUTPUT

Chunk size: [(1024, 1024)]

Ideal chunk size values for this raster are multiples of 1024. An element to consider is the number of resulting chunks and their size. While the optimal chunk size strongly depends on the specific application, chunks should in general not be too big nor too small (i.e. too many). As a rule of thumb, chunk sizes of 100 MB typically work well with Dask (see, e.g., this blog post). Also, the shape might be relevant, depending on the application! Here, we might select a chunks shape of (1, 6144, 6144)::

PYTHON

red = rioxarray.open_rasterio("data/sentinel2/red.tif", chunks=(1, 6144, 6144))

which leads to chunks 72 MB large: ((1 x 6144 x 6144) x 2 bytes / 2^20 = 72 MB). Also, we can let rioxarray and Dask figure out appropriate chunk shapes by setting chunks="auto":

PYTHON

red = rioxarray.open_rasterio("data/sentinel2/red.tif", chunks="auto")

which leads to (1, 8192, 8192) chunks (128 MB).

Parallel computations

Operations performed on a DataArray that has been opened as a chunked Dask array are executed using Dask. Dask coordinates how the operations should be executed on the individual chunks of data, and runs these tasks in parallel as much as possible.

Let us set up an example where we calculate the NDVI for a full Sentinel-2 tile, and try to estimate the performance gain by running the calculation in parallel on a multi-core CPU.

To run the calculation serially, we open the relevant raster bands, as we have learned in the previous episodes:

PYTHON

red = rioxarray.open_rasterio('data/sentinel2/red.tif', masked=True)
nir = rioxarray.open_rasterio('data/sentinel2/nir.tif', masked=True)

We then compute the NDVI. Note the Jupyter magic %%time, which returns the time required to run the content of a cell (note that commands starting with %% needs to be on the first line of the cell!):

PYTHON

%%time
ndvi = (nir - red)/(nir + red)

OUTPUT

CPU times: user 4.99 s, sys: 3.44 s, total: 8.43 s
Wall time: 8.53 s

We note down the calculation’s “Wall time” (actual time to perform the task).

Now we run the same task in parallel using Dask. To do so, we open the relevant rasters as chunked arrays:

PYTHON

red_dask = rioxarray.open_rasterio('data/sentinel2/red.tif', masked=True, lock=False, chunks=(1, 6144, 6144))
nir_dask = rioxarray.open_rasterio('data/sentinel2/nir.tif', masked=True, lock=False, chunks=(1, 6144, 6144))

Setting lock=False tells rioxarray that the individual data chunks can be loaded simultaneously from the source by the Dask workers.

We now continue to the actual calculation: Note how the same syntax as for its serial version is employed for computing the NDVI. Don’t forget to add the Jupyter magic to record the timing!

PYTHON

%%time
ndvi_dask = (nir_dask - red_dask)/(nir_dask + red_dask)

OUTPUT

CPU times: user 7.71 ms, sys: 1.71 ms, total: 9.42 ms
Wall time: 8.61 ms

Did we just observe a 1000x speed-up when comparing to the serial calculation (~8 s vs ~8 ms)? Actually, no calculation has run yet. This is because operations performed on Dask arrays are executed “lazily”, i.e. they are not immediately run.

Dask graph

The sequence of operations to carry out is stored in a task graph, which can be visualized with:

PYTHON

import dask
dask.visualize(ndvi_dask)
dask graph
Dask graph

The task graph gives Dask the complete “overview” of the calculation, thus enabling a better management of tasks and resources when dispatching calculations to be run in parallel.

Most methods of DataArray’s run operations lazily when Dask arrays are employed. In order to trigger calculations, we can use either .persist() or .compute(). The former keeps data in the form of chunked Dask arrays, and it should thus be used to run intermediate steps that will be followed by additional calculations. The latter merges instead the chunks in a single Numpy array, and it should be used at the very end of a sequence of calculations. Both methods accept the same parameters. Here, we explicitly tell Dask to parallelize the required workload over 4 threads. Let’s again time the cell execution:

PYTHON

%%time
ndvi_dask = ndvi_dask.persist(scheduler="threads", num_workers=4)

OUTPUT

CPU times: user 4.18 s, sys: 2.19 s, total: 6.37 s
Wall time: 2.32 s

The timing that we have recorded makes much more sense now. When running the task on a 4-core CPU laptop, we observe a x3.6 speed-up when comparing to the analogous serial calculation (8.53 s vs 2.32 s).

Once again, we stress that one does not always obtain similar performance gains by exploiting the Dask-based parallelization. Even if the algorithm employed is well suited for parallelization, Dask introduces some overhead time to manage the tasks in the Dask graph. This overhead, which is typically of the order of few milliseconds per task, can be larger than the parallelization gain. This is the typical situation with calculations with many small chunks.

Finally, let’s have a look at how Dask can be used to save raster files. When calling .to_raster(), we provide the as additional argument lock=threading.Lock(). This is because the threads which are splitting the workload must “synchronise” when writing to the same file (they might otherwise overwrite each other’s output).

PYTHON

from threading import Lock
ndvi_dask.rio.to_raster('ndvi.tif', driver='COG', lock=Lock())

Note that .to_raster() is among the methods that trigger immediate calculations (one can change this behaviour by specifying compute=False).

Key Points

  • The %%time Jupyter magic command can be used to profile calculations.
  • Data ‘chunks’ are the unit of parallelization in raster calculations.
  • (rio)xarray can open raster files as chunked arrays.
  • The chunk shape and size can significantly affect the calculation performance.
  • Cloud-optimized GeoTIFFs have an internal structure that enables performant parallel read.

Content from Data cubes with ODC-STAC


Last updated on 2024-06-25 | Edit this page

Estimated time: 45 minutes

Overview

Questions

  • a

Objectives

  • a

Introduction


In the previous episodes we worked with satellite images with a fixed boundary on how they have been collected, however in many cases you would want to have an image that covers your area of interest which often does not align with boundaries of the collected images. If the phenomena you are interested in covers two images you could manually mosaic them, but sometimes you are interested in multiple images that overlapping.

ODC-STAC offers functionality that allows you to get a mosaic-ed image based on the a bounding box or a polygon containing the area of interest. In this lesson we show how odc-stac can be employed to re-tile and stack satellite images in what are sometimes referred to as “data cubes”.

Create a data cube with ODC-STAC


As you might have noticed in the previous episodes, the satellite images we have used until now do actually not cover the whole island of Rhodes. They miss the southern part of the island. Using ODC-STAC we can obtain an image for the whole island. We use the administrative boundary of Rhodes to define our area of interest (AoI). This way we are sure to have the whole island.

More important, using ODC-STAC you can also load multiple images (lazy) into one datacube allowing you to perform all kind of interesting analyses as will be demonstrated below.

But first we need to upload the geometry of Rhodes. To do so we use geopandas and load the geometry we previously stored in a geopackage.

PYTHON

import geopandas
rhodes = geopandas.read_file('rhodes.gpkg')
bbox = rhodes.total_bounds

Next, we search for satellite images that cover our AoI (i.e. Rhodes) in the Sentinel-2 L2A collection that is indexed in the Earth Search STAC API. Since we are interested the period right before and after the wild fire we include as dates the 1st of July until the 31st of August 2023 :

PYTHON

import pystac_client

api_url = "https://earth-search.aws.element84.com/v1"
collection_id = "sentinel-2-c1-l2a"

client = pystac_client.Client.open(api_url)
search = client.search(
    collections=[collection_id],
    datetime="2023-07-01/2023-08-31",
    bbox=bbox
)

item_collection = search.item_collection()

odc-stac can ingest directly our search results and create a Xarray DataSet object from the STAC metadata that are present in the item_collection. By specifying groupby='solar_day', odc-stac automatically groups and merges images corresponding to the same date of acquisition. chunks={...} sets up the resulting data cube using Dask arrays, thus enabling lazy loading (and further operations). use_overviews=True tells odc-stac to direcly load lower-resolution versions of the images from the overviews, if these are available in Cloud Optimized Geotiffs (COGs). We set the resolution of the data cube using the resolution argument, and define the AoI using the bounding box (bbox). We decided to set the resolution to 20 in order to limit the size of the images a bit.

PYTHON

import odc.stac
ds = odc.stac.load(
    item_collection,
    groupby='solar_day',
    chunks={'x': 2048, 'y': 2048},
    use_overviews=True,
    resolution=20,
    bbox=rhodes.total_bounds,
)

odc-stac builds a data cube representation from all the relevant files linked in item_collection as a Xarray DataSet. Let us have a look at it:

PYTHON

print(ds)

OUTPUT

<xarray.Dataset> Size: 7GB
Dimensions:      (y: 3255, x: 2567, time: 25)
Coordinates:
  * y            (y) float64 26kB 4.035e+06 4.035e+06 ... 3.97e+06 3.97e+06
  * x            (x) float64 21kB 5.613e+05 5.613e+05 ... 6.126e+05 6.126e+05
    spatial_ref  int32 4B 32635
  * time         (time) datetime64[ns] 200B 2023-07-01T09:10:15.805000 ... 20...
Data variables: (12/18)
    red          (time, y, x) uint16 418MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
    green        (time, y, x) uint16 418MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
    blue         (time, y, x) uint16 418MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
    visual       (time, y, x) float32 836MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
    nir          (time, y, x) uint16 418MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
    swir22       (time, y, x) uint16 418MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
    ...           ...
    scl          (time, y, x) uint8 209MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
    aot          (time, y, x) uint16 418MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
    coastal      (time, y, x) uint16 418MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
    nir09        (time, y, x) uint16 418MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
    cloud        (time, y, x) uint8 209MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>
    snow         (time, y, x) uint8 209MB dask.array<chunksize=(1, 2048, 2048), meta=np.ndarray>

Working with the data cube


Like we did in the previous episode, let us calculate the NDVI for our study area. To do so we need to focus on the variables: the red band (red), the near infrared band (nir) and the scene classification map (scl). We will use the former two to calculated the NDVI for the AoI. The latter, we use as a classification mask provided together with Sentinel-2 L2A products. In this mask, each pixel is classified according to a set of labels (see Figure 3 in classification mask documentation ).

First we define the bands that we are interested in:

PYTHON

red = ds['red']
nir = ds['nir']
scl = ds['scl']

Next we will use the mask to drop pixels that are labeled as clouds and water. For this we use the classification map to mask out pixels recognized by the Sentinel-2 processing algorithm as cloud or water:

PYTHON

# generate mask ("True" for pixel being cloud or water)
mask = scl.isin([
    3,  # CLOUD_SHADOWS
    6,  # WATER
    8,  # CLOUD_MEDIUM_PROBABILITY
    9,  # CLOUD_HIGH_PROBABILITY
    10  # THIN_CIRRUS
])
red_masked = red.where(~mask)
nir_masked = nir.where(~mask)

Then, we calculate the NDVI:

PYTHON

ndvi = (nir_masked - red_masked) / (nir_masked + red_masked)

We can visualize the calculated NDVI for the AoI at two given dates (before and after the wildfires) by selecting the date:

PYTHON

ndvi_before = ndvi.sel(time="2023-07-13")
ndvi_before.plot()
NDVI before the wildfire
NDVI before the wildfire

PYTHON

ndvi_after = ndvi.sel(time="2023-08-27")
ndvi_after.plot()
NDVI after the wildfire
NDVI after the wildfire

Another feature of having the data available in a datacube is that you can for instance questions multiple layers. If you want for instance to see how the NDVI changed over time for a specific point you can do the following. Let us first define a point in the region where we know it was affected by the wildfile. To check that it is after the fire we plot it:

PYTHON

x = 585_000
y = 3_995_000

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ndvi_after.plot(ax=ax)
ax.scatter(x, y, marker="o", c="k")
NDVI plot with selected point
NDVI plot with selected point

Now let us extract the NDVI values computed at that point for the full time series:

PYTHON

ndvi_xy = ndvi.sel(x=x, y=y, method="nearest")
print(ndvi_xy)

OUTPUT

<xarray.DataArray (time: 25)> Size: 100B
dask.array<getitem, shape=(25,), dtype=float32, chunksize=(1,), chunktype=numpy.ndarray>
Coordinates:
    y            float64 8B 3.995e+06
    x            float64 8B 5.85e+05
    spatial_ref  int32 4B 32635
  * time         (time) datetime64[ns] 200B 2023-07-01T09:10:15.805000 ... 20...

We now trigger computation. Note that we run this in parallel (probably not much of an effect here, but definitely helpful for larger calculations):

PYTHON

%%time
ndvi_xy = ndvi_xy.compute(scheduler="threads", num_workers=4)

OUTPUT

CPU times: user 15.6 s, sys: 4.9 s, total: 20.5 s
Wall time: 1min 8s

The result is a time series representing the NDVI value computed for the selected point for all the available scenes in the time range. We drop the NaN values, and plot the final result:

PYTHON

ndvi_xy.dropna(dim="time").plot()
NDVI time series
NDVI time series