Tutorial 1: ClimateBench Dataset and How Machine Learning Can Help

Tutorial 1: ClimateBench Dataset and How Machine Learning Can Help#

Week 2, Day 5, AI and Climate Change

Content creators: Deepak Mewada, Grace Lindsay

Content reviewers: Mujeeb Abdulfatai, Nkongho Ayuketang Arreyndip, Jeffrey N. A. Aryee, Paul Heubel, Jenna Pearson, Abel Shibu

Content editors: Deepak Mewada, Grace Lindsay

Production editors: Konstantine Tsafatinos

Our 2024 Sponsors: CMIP, NFDI4Earth

Tutorial Objectives#

Estimated timing of tutorial: 25 minutes

Today, you will work on a total of 6 short tutorials. In Tutorial 1, you delve into the fundamentals, including discussions on climate model emulators and the ClimateBench dataset. You gain insights into Earth System Models (ESMs) and Shared Socioeconomic Pathways (SSPs), alongside practical visualization techniques for ClimateBench features. Tutorial 2 expands on these foundations, exploring decision trees, hyperparameters, and random forest models. You learn to evaluate regression models, focusing on the coefficient of determination (R\(^2\)), and gain hands-on experience implementing models using scikit-learn. Tutorial 3 shifts focus to mitigating overfitting in machine learning models. Here, you learn the importance of model generalization and acquire practical skills for splitting data into training and test sets. In Tutorial 4, you refine your understanding of model robustness, with emphasis on within-distribution generalization and testing model performance on similar data. Tutorial 5 challenges you to test our models on various types of out-of-distribution data, while also exploring the role of climate model emulators in climate science research. Finally, Tutorial 6 concludes the series by discussing practical applications of AI and machine learning in addressing climate change-related challenges, and introducing available resources and tools in the field of climate change AI.

In this tutorial, you will

Learn about the basics of data science and machine learning.
Define “climate model emulators”.
Introduce the ClimateBench dataset.
Visualize features from this dataset.

Setup#

# imports
import matplotlib.pyplot as plt     # For plotting graphs
import pandas as pd                 # For data manipulation
import xarray as xr                 # For multidimensional data manipulation
import seaborn as sns               # For advanced visualizations
import cartopy.crs as ccrs          # for geospatial visualizations

Figure Settings#

Set random seed#

Executing set_seed(seed=seed) you are setting the seed

Random seed 42 has been set.

Video 1: Machine Learning on ClimateBench data#

Section 1: ClimateBench Dataset and How Machine Learning Can Help#

Section Objectives:

Understand how machine learning can be helpful generally
Understand the climate model data we will be working with
Understand the concept of a climate model emulator
Learn how to explore the dataset

Section 1.1: About the ClimateBench dataset#

The ClimateBench dataset offers a comprehensive collection of hypothetical climate data derived from sophisticated computer simulations (specifically, the NorESM2 model, available via CIMP6). It includes information on key climate variables such as temperature, precipitation, and diurnal temperature range. These values are collected by running simulations that represent the different Shared Socioeconomic Pathways (SSPs). Each pathway is associated with a different projected emissions profile over time. This data thus provides insights into how these climate variables may change in the future due to different emission scenarios. By utilizing this dataset, researchers can develop predictive models to better understand and anticipate the impacts of climate change, ultimately aiding in the development of effective mitigation strategies. Specifically, this data set is well-formatted for training machine learning models, which is exactly what you will do here.

A brief overview of the ClimateBench dataset is provided below; for additional details, please refer to the full paper -

ClimateBench v1.0: A Benchmark for Data-Driven Climate Projections

Spatial Resolution:#

The simulations are conducted on a grid with a spatial resolution of approximately 2°, allowing for analysis of regional climate patterns and phenomena.

Variables:#

The dataset includes four main variables defined for each point on the grid:

Temperature (TAS): Represents the annual mean surface air temperature.
Diurnal Temperature Range (DTR): Reflects the difference between the maximum and minimum temperatures within a day averaged annually.
Precipitation (PR): Indicates the annual total precipitation.
90th Percentile of Precipitation (PR90): Captures extreme precipitation events by identifying the 90th percentile of daily precipitation values.

ScenarioMIP Simulations:#

The dataset incorporates ScenarioMIP simulations, exploring various future emission pathways under different socio-economic scenarios. Each scenario is defined by a set of annual emissions values over future years. We will look at 5 different scenarios in total here.

Emissions Inputs:#

Emissions scenarios are defined according to the following four types of emissions:

Carbon dioxide (CO₂) concentrations.
Methane (CH₄) concentrations.
Sulfur dioxide (SO₂) emissions, a precursor to sulfate aerosols.
Black carbon (BC) emissions.

Note: In the ClimateBench dataset, sulfur dioxide and black carbon emissions are provided as a spatial map over grid locations, but we will just look at global totals here.

Model Specifications:#

Simulation Model: the NorESM2 model is run in its low atmosphere-medium ocean resolution (LM) configuration.
Model Components: Fully coupled earth system including the atmosphere, land, ocean, ice, and biogeochemistry components.
Ensemble Averaging: Target variables are averaged over three ensemble members to mitigate internal variability contributions.

By leveraging the ClimateBench dataset, researchers gain insights into climate dynamics, enabling the development and evaluation of predictive models crucial for understanding and addressing climate change challenges.

W2D5_Tutorial1_climatebench_Scenario

For simplicity’s sake, we’ll utilize a condensed version of the ClimateBench dataset. As mentioned above, we will be looking at only 5 scenarios (‘SSPs’, listed above as “experiments”), and all emissions will be given as global annual averages for the years 2015 to 2050. Furthermore, we will include climate variables for each spatial location (as defined by latitude and longitude for a restricted region) for the year 2015. The target for our model prediction will be temperature in the year 2050 for each spatial location.

Section 1.2: Load the Dataset (Condensed Version)#

We will use pandas to interact with the data, which is shared in the .csv format. First, let us load the environmental data into a pandas dataframe and print its contents.

#Load Dataset
url_Climatebench_train_val = "https://osf.io/y2pq7/download"
training_data = pd.read_csv(url_Climatebench_train_val)

Section 1.3: Explore Data Structure#

Next, we will quickly explore the size of the data, check for missing data, and understand column names

print(training_data.shape)

(3240, 152)

This tells us we have 3240 rows and 152 columns.

Let’s look at what these rows and columns mean:

training_data

	scenario	lat	lon	tas_2015	pr_2015	pr90_2015	dtr_2015	tas_FINAL	CO2_2015	SO2_2015	...	CH4_2048	BC_2048	CO2_2049	SO2_2049	CH4_2049	BC_2049	CO2_2050	SO2_2050	CH4_2050	BC_2050
0	ssp126	-19.894737	0.0	0.547699	-4.770247e-07	-1.412226e-07	0.034963	0.848419	1536.072222	6.686393e-08	...	0.206332	1.434831e-09	2585.223981	1.603985e-08	0.203214	1.398414e-09	2604.946519	1.547451e-08	0.200096	1.361996e-09
1	ssp126	-19.894737	2.5	0.648376	-2.947038e-07	-4.729113e-07	0.039381	0.737915	1536.072222	6.686393e-08	...	0.206332	1.434831e-09	2585.223981	1.603985e-08	0.203214	1.398414e-09	2604.946519	1.547451e-08	0.200096	1.361996e-09
2	ssp126	-19.894737	5.0	0.696808	-2.691091e-07	-5.525026e-07	0.021043	0.588806	1536.072222	6.686393e-08	...	0.206332	1.434831e-09	2585.223981	1.603985e-08	0.203214	1.398414e-09	2604.946519	1.547451e-08	0.200096	1.361996e-09
3	ssp126	-19.894737	7.5	0.721252	-4.967706e-08	-5.830042e-07	0.020420	0.522766	1536.072222	6.686393e-08	...	0.206332	1.434831e-09	2585.223981	1.603985e-08	0.203214	1.398414e-09	2604.946519	1.547451e-08	0.200096	1.361996e-09
4	ssp126	-19.894737	10.0	0.898682	-3.642627e-07	-9.914260e-07	-0.033305	0.776642	1536.072222	6.686393e-08	...	0.206332	1.434831e-09	2585.223981	1.603985e-08	0.203214	1.398414e-09	2604.946519	1.547451e-08	0.200096	1.361996e-09
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3235	ssp370-lowNTCF	63.473684	32.5	0.525085	1.653533e-06	4.044508e-06	0.107734	1.626139	1536.072222	6.686393e-08	...	0.530093	2.932431e-09	3231.101144	2.975203e-08	0.534263	2.840629e-09	3291.118087	2.854076e-08	0.538434	2.748826e-09
3236	ssp370-lowNTCF	63.473684	35.0	0.643158	1.000110e-06	3.569633e-06	0.020086	1.804036	1536.072222	6.686393e-08	...	0.530093	2.932431e-09	3231.101144	2.975203e-08	0.534263	2.840629e-09	3291.118087	2.854076e-08	0.538434	2.748826e-09
3237	ssp370-lowNTCF	63.473684	37.5	0.819377	8.274455e-07	3.599522e-06	-0.055249	1.925557	1536.072222	6.686393e-08	...	0.530093	2.932431e-09	3231.101144	2.975203e-08	0.534263	2.840629e-09	3291.118087	2.854076e-08	0.538434	2.748826e-09
3238	ssp370-lowNTCF	63.473684	40.0	0.795258	6.147420e-07	-4.846323e-07	0.078986	2.026601	1536.072222	6.686393e-08	...	0.530093	2.932431e-09	3231.101144	2.975203e-08	0.534263	2.840629e-09	3291.118087	2.854076e-08	0.538434	2.748826e-09
3239	ssp370-lowNTCF	63.473684	42.5	0.889465	1.107282e-06	2.231149e-06	0.076956	2.162618	1536.072222	6.686393e-08	...	0.530093	2.932431e-09	3231.101144	2.975203e-08	0.534263	2.840629e-09	3291.118087	2.854076e-08	0.538434	2.748826e-09

3240 rows × 152 columns

Each row represents a combination of spatial location and scenario. The scenario can be found in the ‘scenario’ column while the location is given in the ‘lat’ and ‘lon’ columns. Climate variables for 2015 are given in the following columns and tas_FINAL represents the temperature in 2050. After these columns, we get the annual global emissions values for each of the 4 emissions types included in ClimateBench, starting in 2015 and ending in 2050.

Handle Missing Values (if necessary):

We cannot train a machine learning model if there are values missing anywhere in this dataset. Therefore, we will check for missing values using training_data.isnull().sum(), which sums the number of ‘null’ or missing values. If missing values exist, we can consider imputation techniques (e.g., fillna, interpolate) based on the nature of the data and the specific column.

training_data.isnull().sum()

scenario    0
lat         0
lon         0
tas_2015    0
pr_2015     0
           ..
BC_2049     0
CO2_2050    0
SO2_2050    0
CH4_2050    0
BC_2050     0
Length: 152, dtype: int64

Here, there are no missing values as the sum of all isnull() values is zero for all columns. So we are good to go!

Summary#

In this tutorial, you acquainted yourself with the ClimateBench dataset and explored how machine learning contributes to climate analysis. We defined the versatility of machine learning and its role in predicting climate variables. By delving into the ClimateBench dataset, we highlight its accessibility in providing climate model data. We emphasize the importance of data visualization and engage in practical exercises to explore the dataset.

Resources#

ClimateBench v1.0: A Benchmark for Data-Driven Climate Projections