Open In Colab   Open in Kaggle

Tutorial 1: Creating DataArrays and Datasets to Assess Global Climate Data#

Week 1, Day 1, Climate System Overview

Content creators: Sloane Garelick, Julia Kent

Content reviewers: Yosmely Bermúdez, Katrina Dobson, Younkap Nina Duplex, Danika Gupta, Maria Gonzalez, Will Gregory, Nahid Hasan, Paul Heubel, Sherry Mi, Beatriz Cosenza Muralles, Jenna Pearson, Chi Zhang, Ohad Zivan

Content editors: Paul Heubel, Jenna Pearson, Chi Zhang, Ohad Zivan

Production editors: Wesley Banfield, Paul Heubel, Jenna Pearson, Konstantine Tsafatinos, Chi Zhang, Ohad Zivan

Our 2024 Sponsors: NFDI4Earth and CMIP

project pythia#

Pythia credit: Rose, B. E. J., Kent, J., Tyle, K., Clyne, J., Banihirwe, A., Camron, D., May, R., Grover, M., Ford, R. R., Paul, K., Morley, J., Eroglu, O., Kailyn, L., & Zacharias, A. (2023). Pythia Foundations (Version v2023.05.01) https://zenodo.org/record/8065851

CMIP.png#

Tutorial Objectives#

Estimated timing of tutorial: 20 minutes

As you just learned in the Introduction to Climate video, variations in global climate involve various forcings, feedbacks, and interactions between multiple processes and systems. Because of this complexity, global climate datasets are often very large with multiple dimensions and variables.

One useful computational tool for organizing, analyzing, and interpreting large global datasets is Xarray, an open-source project and Python package that makes working with labeled multi-dimensional arrays simple and efficient.

In this first tutorial, we will use the DataArray and Dataset objects, which are used to represent and manipulate spatial data, to practice organizing large global climate datasets, and to understand variations in Earth’s climate system.

Setup#

Similar to numpy, np; pandas, pd; you may often encounter xarray imported within a shortened namespace as xr.

# imports
import numpy as np
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt

Install and import feedback gadget#

Hide code cell source
# @title Install and import feedback gadget

!pip3 install vibecheck datatops --quiet

from vibecheck import DatatopsContentReviewContainer
def content_review(notebook_section: str):
    return DatatopsContentReviewContainer(
        "",  # No text prompt
        notebook_section,
        {
            "url": "https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab",
            "name": "comptools_4clim",
            "user_key": "l5jpxuee",
        },
    ).render()


feedback_prefix = "W1D1_T1"
[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: pip install --upgrade pip

Figure Settings#

Hide code cell source
# @title Figure Settings
import ipywidgets as widgets  # interactive display

%config InlineBackend.figure_format = 'retina'
plt.style.use(
    "https://raw.githubusercontent.com/neuromatch/climate-course-content/main/cma.mplstyle"
)

Video 1: Introduction to Climate#

Submit your feedback#

Hide code cell source
# @title Submit your feedback
content_review(f"{feedback_prefix}_Intro_to_Climate_Video")
If you want to download the slides: https://osf.io/download/4suf5/

Submit your feedback#

Hide code cell source
# @title Submit your feedback
content_review(f"{feedback_prefix}_Intro_to_Climate_Slides")

Introducing the DataArray and Dataset#

Xarray expands on the capabilities on NumPy arrays, providing a lot of streamlined data manipulation. It is similar in that respect to Pandas, but whereas Pandas excels at working with tabular data, Xarray is focused on N-dimensional arrays of data (i.e. grids). Its interface is based largely on the netCDF data model (variables, attributes, and dimensions), but it goes beyond the traditional netCDF interfaces to provide functionality similar to netCDF-java’s Common Data Model (CDM).

Section 1: Creation of a DataArray Object#

The DataArray is one of the basic building blocks of Xarray (see docs here). It provides a numpy.ndarray-like object that expands to provide two critical pieces of functionality:

  1. Coordinate names and values are stored with the data, making slicing and indexing much more powerful

  2. It has a built-in container for attributes

Here we’ll initialize a DataArray object by wrapping a plain NumPy array, and explore a few of its properties.

Section 1.1: Generate a Random Numpy Array#

For our first example, we’ll just create a random array of “temperature” data in units of Kelvin:

rand_data = 283 + 5 * np.random.randn(5, 3, 4)
rand_data
array([[[273.95339551, 284.6268224 , 281.74269872, 288.05417371],
        [274.99627883, 284.94917827, 278.79821952, 282.2044037 ],
        [280.28023518, 280.1482903 , 281.55700416, 291.28190087]],

       [[285.59856835, 278.89210919, 282.99427187, 285.06414043],
        [287.61689761, 293.97696305, 293.37257413, 286.44269925],
        [283.61196201, 284.27400146, 276.23271007, 296.06134392]],

       [[282.52726267, 280.00078967, 289.44139957, 284.82476282],
        [282.13356916, 280.585562  , 279.86618307, 278.43750857],
        [282.56212811, 289.51164003, 286.65860281, 281.19414213]],

       [[285.90135433, 284.27057757, 279.19969835, 288.94563814],
        [291.24475511, 281.64122933, 277.12763721, 279.52753637],
        [285.22180334, 278.45388608, 287.19798773, 287.12908447]],

       [[287.03066552, 280.64791975, 276.15263507, 292.06662196],
        [277.18493476, 277.35471965, 288.54870198, 287.26445002],
        [282.81033218, 283.97510043, 286.95976875, 284.43395494]]])

Section 1.2: Wrap the Array: First Attempt#

Now we create a basic DataArray just by passing our plain data as an input:

temperature = xr.DataArray(rand_data)
temperature
<xarray.DataArray (dim_0: 5, dim_1: 3, dim_2: 4)> Size: 480B
array([[[273.95339551, 284.6268224 , 281.74269872, 288.05417371],
        [274.99627883, 284.94917827, 278.79821952, 282.2044037 ],
        [280.28023518, 280.1482903 , 281.55700416, 291.28190087]],

       [[285.59856835, 278.89210919, 282.99427187, 285.06414043],
        [287.61689761, 293.97696305, 293.37257413, 286.44269925],
        [283.61196201, 284.27400146, 276.23271007, 296.06134392]],

       [[282.52726267, 280.00078967, 289.44139957, 284.82476282],
        [282.13356916, 280.585562  , 279.86618307, 278.43750857],
        [282.56212811, 289.51164003, 286.65860281, 281.19414213]],

       [[285.90135433, 284.27057757, 279.19969835, 288.94563814],
        [291.24475511, 281.64122933, 277.12763721, 279.52753637],
        [285.22180334, 278.45388608, 287.19798773, 287.12908447]],

       [[287.03066552, 280.64791975, 276.15263507, 292.06662196],
        [277.18493476, 277.35471965, 288.54870198, 287.26445002],
        [282.81033218, 283.97510043, 286.95976875, 284.43395494]]])
Dimensions without coordinates: dim_0, dim_1, dim_2

Note two things:

  1. Xarray generates some basic dimension names for us (dim_0, dim_1, dim_2). We’ll improve this with better names in the next example.

  2. Wrapping the numpy array in a DataArray gives us a rich display in the notebook! (Try clicking the array symbol to expand or collapse the view)

Section 1.3: Assign Dimension Names#

Much of the power of Xarray comes from making use of named dimensions. So let’s add some more useful names! We can do that by passing an ordered list of names using the keyword argument dims:

temperature = xr.DataArray(rand_data, dims=["time", "lat", "lon"])
temperature
<xarray.DataArray (time: 5, lat: 3, lon: 4)> Size: 480B
array([[[273.95339551, 284.6268224 , 281.74269872, 288.05417371],
        [274.99627883, 284.94917827, 278.79821952, 282.2044037 ],
        [280.28023518, 280.1482903 , 281.55700416, 291.28190087]],

       [[285.59856835, 278.89210919, 282.99427187, 285.06414043],
        [287.61689761, 293.97696305, 293.37257413, 286.44269925],
        [283.61196201, 284.27400146, 276.23271007, 296.06134392]],

       [[282.52726267, 280.00078967, 289.44139957, 284.82476282],
        [282.13356916, 280.585562  , 279.86618307, 278.43750857],
        [282.56212811, 289.51164003, 286.65860281, 281.19414213]],

       [[285.90135433, 284.27057757, 279.19969835, 288.94563814],
        [291.24475511, 281.64122933, 277.12763721, 279.52753637],
        [285.22180334, 278.45388608, 287.19798773, 287.12908447]],

       [[287.03066552, 280.64791975, 276.15263507, 292.06662196],
        [277.18493476, 277.35471965, 288.54870198, 287.26445002],
        [282.81033218, 283.97510043, 286.95976875, 284.43395494]]])
Dimensions without coordinates: time, lat, lon

This is already an improvement over a NumPy array because we have names for each of the dimensions (or axes). Even better, we can associate arrays representing the values for the coordinates for each of these dimensions with the data when we create the DataArray. We’ll see this in the next example.

Section 2: Create a DataArray with Named Coordinates#

Section 2.1: Make Time and Space Coordinates#

Here we will use Pandas to create an array of datetime data, which we will then use to create a DataArray with a named coordinate time.

times_index = pd.date_range("2018-01-01", periods=5)
times_index
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05'],
              dtype='datetime64[ns]', freq='D')

We’ll also create arrays to represent sample longitude and latitude:

lons = np.linspace(-120, -60, 4)
lats = np.linspace(25, 55, 3)

Section 2.1.1: Initialize the DataArray with Complete Coordinate Info#

When we create the DataArray instance, we pass in the arrays we just created:

temperature = xr.DataArray(
    rand_data, coords=[times_index, lats, lons], dims=["time", "lat", "lon"]
)
temperature
<xarray.DataArray (time: 5, lat: 3, lon: 4)> Size: 480B
array([[[273.95339551, 284.6268224 , 281.74269872, 288.05417371],
        [274.99627883, 284.94917827, 278.79821952, 282.2044037 ],
        [280.28023518, 280.1482903 , 281.55700416, 291.28190087]],

       [[285.59856835, 278.89210919, 282.99427187, 285.06414043],
        [287.61689761, 293.97696305, 293.37257413, 286.44269925],
        [283.61196201, 284.27400146, 276.23271007, 296.06134392]],

       [[282.52726267, 280.00078967, 289.44139957, 284.82476282],
        [282.13356916, 280.585562  , 279.86618307, 278.43750857],
        [282.56212811, 289.51164003, 286.65860281, 281.19414213]],

       [[285.90135433, 284.27057757, 279.19969835, 288.94563814],
        [291.24475511, 281.64122933, 277.12763721, 279.52753637],
        [285.22180334, 278.45388608, 287.19798773, 287.12908447]],

       [[287.03066552, 280.64791975, 276.15263507, 292.06662196],
        [277.18493476, 277.35471965, 288.54870198, 287.26445002],
        [282.81033218, 283.97510043, 286.95976875, 284.43395494]]])
Coordinates:
  * time     (time) datetime64[ns] 40B 2018-01-01 2018-01-02 ... 2018-01-05
  * lat      (lat) float64 24B 25.0 40.0 55.0
  * lon      (lon) float64 32B -120.0 -100.0 -80.0 -60.0

Section 2.1.2: Set Useful Attributes#

We can also set some attribute metadata, which will help provide clear descriptions of the data. In this case, we can specify that we’re looking at ‘air_temperature’ data and the units are ‘Kelvin’.

temperature.attrs["units"] = "Kelvin"
temperature.attrs["standard_name"] = "air_temperature"

temperature
<xarray.DataArray (time: 5, lat: 3, lon: 4)> Size: 480B
array([[[273.95339551, 284.6268224 , 281.74269872, 288.05417371],
        [274.99627883, 284.94917827, 278.79821952, 282.2044037 ],
        [280.28023518, 280.1482903 , 281.55700416, 291.28190087]],

       [[285.59856835, 278.89210919, 282.99427187, 285.06414043],
        [287.61689761, 293.97696305, 293.37257413, 286.44269925],
        [283.61196201, 284.27400146, 276.23271007, 296.06134392]],

       [[282.52726267, 280.00078967, 289.44139957, 284.82476282],
        [282.13356916, 280.585562  , 279.86618307, 278.43750857],
        [282.56212811, 289.51164003, 286.65860281, 281.19414213]],

       [[285.90135433, 284.27057757, 279.19969835, 288.94563814],
        [291.24475511, 281.64122933, 277.12763721, 279.52753637],
        [285.22180334, 278.45388608, 287.19798773, 287.12908447]],

       [[287.03066552, 280.64791975, 276.15263507, 292.06662196],
        [277.18493476, 277.35471965, 288.54870198, 287.26445002],
        [282.81033218, 283.97510043, 286.95976875, 284.43395494]]])
Coordinates:
  * time     (time) datetime64[ns] 40B 2018-01-01 2018-01-02 ... 2018-01-05
  * lat      (lat) float64 24B 25.0 40.0 55.0
  * lon      (lon) float64 32B -120.0 -100.0 -80.0 -60.0
Attributes:
    units:          Kelvin
    standard_name:  air_temperature

Section 2.1.3: Attributes Are Not Preserved by Default!#

Notice what happens if we perform a mathematical operaton with the DataArray: the coordinate values persist, but the attributes are lost. This is done because it is very challenging to know if the attribute metadata is still correct or appropriate after arbitrary arithmetic operations.

To illustrate this, we’ll do a simple unit conversion from Kelvin to Celsius:

temperature_in_celsius = temperature - 273.15
temperature_in_celsius
<xarray.DataArray (time: 5, lat: 3, lon: 4)> Size: 480B
array([[[ 0.80339551, 11.4768224 ,  8.59269872, 14.90417371],
        [ 1.84627883, 11.79917827,  5.64821952,  9.0544037 ],
        [ 7.13023518,  6.9982903 ,  8.40700416, 18.13190087]],

       [[12.44856835,  5.74210919,  9.84427187, 11.91414043],
        [14.46689761, 20.82696305, 20.22257413, 13.29269925],
        [10.46196201, 11.12400146,  3.08271007, 22.91134392]],

       [[ 9.37726267,  6.85078967, 16.29139957, 11.67476282],
        [ 8.98356916,  7.435562  ,  6.71618307,  5.28750857],
        [ 9.41212811, 16.36164003, 13.50860281,  8.04414213]],

       [[12.75135433, 11.12057757,  6.04969835, 15.79563814],
        [18.09475511,  8.49122933,  3.97763721,  6.37753637],
        [12.07180334,  5.30388608, 14.04798773, 13.97908447]],

       [[13.88066552,  7.49791975,  3.00263507, 18.91662196],
        [ 4.03493476,  4.20471965, 15.39870198, 14.11445002],
        [ 9.66033218, 10.82510043, 13.80976875, 11.28395494]]])
Coordinates:
  * time     (time) datetime64[ns] 40B 2018-01-01 2018-01-02 ... 2018-01-05
  * lat      (lat) float64 24B 25.0 40.0 55.0
  * lon      (lon) float64 32B -120.0 -100.0 -80.0 -60.0

We usually wish to keep metadata with our dataset, even after manipulating the data. For example it can tell us what the units are of a variable of interest. So when you perform operations on your data, make sure to check that all the information you want is carried over. If it isn’t, you can add it back in following the instructions in the section before this. For an in-depth discussion of how Xarray handles metadata, you can find more information in the Xarray documents here.

Section 3: The Dataset: a Container for DataArrays with Shared Coordinates#

Along with DataArray, the other key object type in Xarray is the Dataset, which is a dictionary-like container that holds one or more DataArrays, which can also optionally share coordinates (see docs here).

The most common way to create a Dataset object is to load data from a file (which we will practice in a later tutorial). Here, instead, we will create another DataArray and combine it with our temperature data.

This will illustrate how the information about common coordinate axes is used.

Section 3.1: Create a Pressure DataArray Using the Same Coordinates#

For our next DataArry example, we’ll create a random array of pressure data in units of hectopascal (hPa). This code mirrors how we created the temperature object above.

pressure_data = 1000.0 + 5 * np.random.randn(5, 3, 4)
pressure = xr.DataArray(
    pressure_data, coords=[times_index, lats, lons], dims=["time", "lat", "lon"]
)
pressure.attrs["units"] = "hPa"
pressure.attrs["standard_name"] = "air_pressure"

pressure
<xarray.DataArray (time: 5, lat: 3, lon: 4)> Size: 480B
array([[[ 998.65215526, 1000.37676813,  998.87493028, 1006.40790679],
        [ 991.85169879,  992.6599739 ,  997.68192663,  998.66982344],
        [1001.42969169,  995.30757558, 1002.76970817, 1002.33510028]],

       [[1002.70402781, 1004.4804087 ,  995.62147015, 1002.22738496],
        [1001.20581091,  999.48336639,  997.92665652, 1009.47901947],
        [1004.53680946, 1002.06884899, 1003.89421437,  988.54842017]],

       [[1005.54705467, 1001.41279246, 1010.81196967,  999.99064121],
        [1002.28448898, 1007.06778495, 1003.0859697 ,  992.60417697],
        [ 994.06430507, 1001.57810188,  999.09961031, 1005.65653522]],

       [[1002.91248314, 1000.65354538,  998.26454313,  994.57790549],
        [ 995.6753857 ,  999.4445824 ,  994.76902799, 1007.93786387],
        [1001.33220658, 1005.52046376,  990.19411255, 1005.20158207]],

       [[1003.63647625,  997.67147935, 1010.92041105, 1005.96883314],
        [ 998.73434611, 1006.6388429 ,  998.57446581,  995.90151923],
        [ 997.77289013, 1002.39281066, 1001.82318955, 1001.96416805]]])
Coordinates:
  * time     (time) datetime64[ns] 40B 2018-01-01 2018-01-02 ... 2018-01-05
  * lat      (lat) float64 24B 25.0 40.0 55.0
  * lon      (lon) float64 32B -120.0 -100.0 -80.0 -60.0
Attributes:
    units:          hPa
    standard_name:  air_pressure

Section 3.2: Create a Dataset Object#

Each DataArray in our Dataset needs a name!

The most straightforward way to create a Dataset with our temperature and pressure arrays is to pass a dictionary using the keyword argument data_vars:

ds = xr.Dataset(data_vars={"Temperature": temperature, "Pressure": pressure})
ds
<xarray.Dataset> Size: 1kB
Dimensions:      (time: 5, lat: 3, lon: 4)
Coordinates:
  * time         (time) datetime64[ns] 40B 2018-01-01 2018-01-02 ... 2018-01-05
  * lat          (lat) float64 24B 25.0 40.0 55.0
  * lon          (lon) float64 32B -120.0 -100.0 -80.0 -60.0
Data variables:
    Temperature  (time, lat, lon) float64 480B 274.0 284.6 281.7 ... 287.0 284.4
    Pressure     (time, lat, lon) float64 480B 998.7 1e+03 ... 1.002e+03

Notice that the Dataset object ds is aware that both data arrays sit on the same coordinate axes.

Section 3.3: Access Data Variables and Coordinates in a Dataset#

We can pull out any of the individual DataArray objects in a few different ways.

Using the “dot” notation:

ds.Pressure
<xarray.DataArray 'Pressure' (time: 5, lat: 3, lon: 4)> Size: 480B
array([[[ 998.65215526, 1000.37676813,  998.87493028, 1006.40790679],
        [ 991.85169879,  992.6599739 ,  997.68192663,  998.66982344],
        [1001.42969169,  995.30757558, 1002.76970817, 1002.33510028]],

       [[1002.70402781, 1004.4804087 ,  995.62147015, 1002.22738496],
        [1001.20581091,  999.48336639,  997.92665652, 1009.47901947],
        [1004.53680946, 1002.06884899, 1003.89421437,  988.54842017]],

       [[1005.54705467, 1001.41279246, 1010.81196967,  999.99064121],
        [1002.28448898, 1007.06778495, 1003.0859697 ,  992.60417697],
        [ 994.06430507, 1001.57810188,  999.09961031, 1005.65653522]],

       [[1002.91248314, 1000.65354538,  998.26454313,  994.57790549],
        [ 995.6753857 ,  999.4445824 ,  994.76902799, 1007.93786387],
        [1001.33220658, 1005.52046376,  990.19411255, 1005.20158207]],

       [[1003.63647625,  997.67147935, 1010.92041105, 1005.96883314],
        [ 998.73434611, 1006.6388429 ,  998.57446581,  995.90151923],
        [ 997.77289013, 1002.39281066, 1001.82318955, 1001.96416805]]])
Coordinates:
  * time     (time) datetime64[ns] 40B 2018-01-01 2018-01-02 ... 2018-01-05
  * lat      (lat) float64 24B 25.0 40.0 55.0
  * lon      (lon) float64 32B -120.0 -100.0 -80.0 -60.0
Attributes:
    units:          hPa
    standard_name:  air_pressure

… or using dictionary access like this:

ds["Pressure"]
<xarray.DataArray 'Pressure' (time: 5, lat: 3, lon: 4)> Size: 480B
array([[[ 998.65215526, 1000.37676813,  998.87493028, 1006.40790679],
        [ 991.85169879,  992.6599739 ,  997.68192663,  998.66982344],
        [1001.42969169,  995.30757558, 1002.76970817, 1002.33510028]],

       [[1002.70402781, 1004.4804087 ,  995.62147015, 1002.22738496],
        [1001.20581091,  999.48336639,  997.92665652, 1009.47901947],
        [1004.53680946, 1002.06884899, 1003.89421437,  988.54842017]],

       [[1005.54705467, 1001.41279246, 1010.81196967,  999.99064121],
        [1002.28448898, 1007.06778495, 1003.0859697 ,  992.60417697],
        [ 994.06430507, 1001.57810188,  999.09961031, 1005.65653522]],

       [[1002.91248314, 1000.65354538,  998.26454313,  994.57790549],
        [ 995.6753857 ,  999.4445824 ,  994.76902799, 1007.93786387],
        [1001.33220658, 1005.52046376,  990.19411255, 1005.20158207]],

       [[1003.63647625,  997.67147935, 1010.92041105, 1005.96883314],
        [ 998.73434611, 1006.6388429 ,  998.57446581,  995.90151923],
        [ 997.77289013, 1002.39281066, 1001.82318955, 1001.96416805]]])
Coordinates:
  * time     (time) datetime64[ns] 40B 2018-01-01 2018-01-02 ... 2018-01-05
  * lat      (lat) float64 24B 25.0 40.0 55.0
  * lon      (lon) float64 32B -120.0 -100.0 -80.0 -60.0
Attributes:
    units:          hPa
    standard_name:  air_pressure

We’ll return to the Dataset object when we start loading data from files in later tutorials today.

Summary#

In this initial tutorial, the DataArray and Dataset objects were utilized to create and explore synthetic examples of climate data.

Resources#

Code and data for this tutorial is based on existing content from Project Pythia.