Bonus Tutorial 7: Deep Learning for Climate Prediction with CNN-LSTMs (PyTorch)

Bonus Tutorial 7: Deep Learning for Climate Prediction with CNN-LSTMs (PyTorch)#

Week 2, Day 4, AI and Climate Change

Content creators: Deepak Mewada, Grace Lindsay

Content reviewers: Jenna Pearson

Content editors: Deepak Mewada, Grace Lindsay

Production editors: Jenna Pearson, Konstantine Tsafatinos

Our 2024 Sponsors: CMIP, NFDI4Earth

Tutorial Objectives#

Estimated timing of tutorial: 60 minutes

Welcome back! You’ve skillfully applied scikit-learn to climate modeling in Tutorial 1 and Tutorial 2. Now, get ready to dive into the world of Deep Learning using PyTorch! This tutorial focuses on a Convolutional Neural Network (CNN) combined with a Long Short-Term Memory (LSTM) network, a powerful architecture for spatiotemporal data.

In this tutorial, you will learn

Deep Learning Fundamentals
PyTorch Primer
Climate Data in Tensors
Defining the DL model - CNN-LSTM
Training the Model
Making Prediction from the trained model

#Setup

import numpy as np # Numerical computing
import xarray as xr # Labeled multi-dimensional arrays
import pandas as pd # Data analysis and manipulation
import cartopy.crs as ccrs # Geospatial plotting
import matplotlib.pyplot as plt # Plotting
from types import MethodType
from IPython.display import clear_output
import types

import torch # PyTorch!
import torch.nn as nn # Neural network layers
import torch.optim as optim # Optimization algorithms
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset

import random # Random number generation

#from tqdm import tqdm
from ipywidgets import interact, IntSlider
import plotly.express as px

import os, sys, contextlib
import pooch
from IPython.display import display
import logging
import plotly.express as px

Install and import feedback gadget#

Figure settings#

Data retrieval helper function#

Show code cell source Hide code cell source

# @title  Data retrieval helper function

# Helper functions to download selected climate scenarios and test data (hidden setup)

# Silence pooch warnings (e.g., SHA256 hash prints)
import logging
logging.getLogger("pooch").setLevel(logging.CRITICAL)
import warnings
warnings.filterwarnings("ignore")


# Scenario-to-OSF code mapping
scenario_code_map = {
    'ssp126': ('jvqg5', '9jmsy'),
    #'ssp245': ('hqvkz', 'k7fqu'),
    'ssp370': ('4snxb', 'zcafm'),
    'ssp585': ('sejxt', 'vwg39'),
    'hist-GHG': ('p84hg', 'ys7nu'),
    'hist-aer': ('q7skr', 'bq3k8'),
    'historical': ('kqxet', 'une23')
}

osf_base_url = "https://osf.io/download/"

# Suppress printing (used in background downloads)

@contextlib.contextmanager
def suppress_output():
    # Save the current file descriptors
    devnull = os.open(os.devnull, os.O_WRONLY)
    old_stdout = os.dup(1)
    old_stderr = os.dup(2)

    try:
        # Redirect stdout and stderr to /dev/null
        os.dup2(devnull, 1)
        os.dup2(devnull, 2)
        yield
    finally:
        # Restore original stdout and stderr
        os.dup2(old_stdout, 1)
        os.dup2(old_stderr, 2)
        os.close(devnull)
        os.close(old_stdout)
        os.close(old_stderr)

# Function to download train/val files for selected scenarios
def download_selected_scenarios(selected_climate_input_vars, train_val_dir="Data/train_val/"):
    os.makedirs(train_val_dir, exist_ok=True)
    file_map = {
         'inputs_historical.nc': scenario_code_map['historical'][0],
        'outputs_historical.nc': scenario_code_map['historical'][1]
    }

    for s in selected_climate_input_vars:
        code_in, code_out = scenario_code_map[s]
        file_map[f'inputs_{s}.nc'] = code_in
        file_map[f'outputs_{s}.nc'] = code_out

    with suppress_output():
        for fname, code in file_map.items():
            url = osf_base_url + code + "/"
            _ = pooch.retrieve(url=url, known_hash=None, fname=fname, path=train_val_dir, progressbar=False)

# Function to download test files (must be called explicitly)
def download_test_data(test_dir="Data/test/"):
    os.makedirs(test_dir, exist_ok=True)
    file_map = {
        'inputs_ssp245.nc': '8gpvw',
        'outputs_ssp245.nc': '9pmtx'
    }
    with suppress_output():
        for fname, code in file_map.items():
            url = osf_base_url + code + "/"
            _ = pooch.retrieve(url=url, known_hash=None, fname=fname, path=test_dir, progressbar=False)

Helper functions {“run”:“auto”,“display-mode”:“form”}#

Show code cell source Hide code cell source

# @title Helper functions  {"run":"auto","display-mode":"form"}

import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML, display, clear_output
from ipywidgets import interact
import warnings
from matplotlib.colors import LogNorm
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", message=".*FrozenMappingWarningOnValuesAccess.*")
# Ensure interactive mode
%matplotlib inline

flag=0 #to be used in section 10.1


# Helper function for climate scenario selector
def setup_scenario_selector():
    # Available scenarios
    available_scenarios = ['historical','ssp126',  'ssp370', 'ssp585', 'hist-GHG', 'hist-aer']
    all_scenarios_set = set(available_scenarios)

    # Declare global to store selected scenarios
    global scenario_selected
    scenario_selected = ['historical']  # Default

    # Dropdown widget
    selector = widgets.SelectMultiple(
        options=available_scenarios,
        value=tuple(scenario_selected),
        description='Scenarios:',
        rows=len(available_scenarios),
        layout=widgets.Layout(width='50%'),
        style={'description_width': 'initial'}
    )

    # Output area
    output = widgets.Output()

    # Callback on selection change
    def on_change(change):
        global scenario_selected
        scenario_selected = list(selector.value)
        with output:
            clear_output(wait=True)
            print(f"Selected scenarios: {scenario_selected}")
            if set(scenario_selected) != all_scenarios_set:
                missing = sorted(all_scenarios_set - set(scenario_selected))
                print(f"Tip: You did not select: {missing}")
                print("For better model performance, select **all** scenarios.")
        # Call downstream logic (if defined)
        download_selected_scenarios(scenario_selected)

    # Attach observer
    selector.observe(on_change, names='value')

    # Initial display
    on_change(None)
    display(selector, output)



def setup_climate_input_selector():
    # Make accessible outside the function
    global selected_climate_input_vars

    climate_vars = ['CO2', 'CH4', 'SO2', 'BC']

    dropdown = widgets.SelectMultiple(
        options=climate_vars,
        value=('CO2',),  # Default selection
        description='Inputs:',
        style={'description_width': 'initial'},
        layout=widgets.Layout(width='50%'),
    )

    output = widgets.Output()

    def on_change(change):
        with output:
            clear_output()
            selected = list(dropdown.value)
            missed = list(set(climate_vars) - set(selected))

            print(f"Selected input variables: {selected}")
            if missed:
                print(f"⚠️ You missed: {missed}")
                print("🔎 Tip: For best model performance, please select **all** input variables.")
            else:
                print("✅ All input variables selected. Great for training!")

        # Update global variable
        global selected_climate_input_vars
        selected_climate_input_vars = selected

    dropdown.observe(on_change, names='value')

    display(dropdown, output)
    on_change(None)  # Initial trigger

def plot_climate_heatmap(Y_train, climate_var='pr'):
    """
    Creates an interactive heatmap for visualizing climate variables over time.

    Parameters:
    - Y_train (xarray.Dataset): Dataset containing climate variables.
    - climate_var (str): Variable to visualize (default: 'pr' for precipitation).
    """
    climate_data = Y_train[0][climate_var]  # Extract data for the first simulation

    def plot_data(time_step=0):
        fig = px.imshow(
            climate_data.isel(time=time_step).values,
            color_continuous_scale='viridis',
            labels={'x': "Longitude", 'y': "Latitude"},
            title=f"{climate_var.upper()} at Time Step {time_step}"
        )
        fig.show()

    return widgets.interactive(plot_data, time_step=(0, climate_data.sizes['time'] - 1, 1))


def plot_climate_timeseries(Y_train, climate_var='pr', latitude=50.0, longitude=-120.0):
    """
    Creates an interactive time series plot for a specific location.

    Parameters:
    - Y_train (xarray.Dataset): Climate dataset.
    - climate_var (str): Climate variable to visualize.
    - latitude (float): Latitude of the location.
    - longitude (float): Longitude of the location.
    """
    # Find the closest grid point
    lat_idx = np.abs(Y_train[0]['latitude'] - latitude).argmin()
    lon_idx = np.abs(Y_train[0]['longitude'] - longitude).argmin()

    # Extract time series data for this location
    climate_time_series = Y_train[0][climate_var][:, lat_idx, lon_idx]

    # Create the interactive plot
    fig = px.line(
        x=Y_train[0]['time'],
        y=climate_time_series,
        labels={'x': "Time", 'y': f"{climate_var.upper()}"},
        title=f"{climate_var.upper()} Time Series at ({latitude}, {longitude})"
    )
    fig.show()

def interactive_variable_selector(Y_train, plot_function):
    """
    Creates an interactive dropdown menu to select a climate variable
    and updates the visualization accordingly.

    Parameters:
    - Y_train (xarray.Dataset): Dataset containing climate variables.
    - plot_function (function): Function to visualize the selected climate variable.
    """
    variable_selector = widgets.Dropdown(
        options=list(Y_train[0].data_vars.keys()),
        description="Variable:"
    )

    def update_variable(selected_var):
        plot_function(Y_train, selected_var)  # Call the plotting function with the selected variable

    return widgets.interactive(update_variable, selected_var=variable_selector)

def compare_inputs_outputs(X_train_torch, Y_train_torch):
    """
    Creates an interactive widget to compare input climate variables with
    the predicted surface air temperature (TAS).

    Parameters:
    - X_train_torch (torch.Tensor): Input climate variables (samples, time, variables, height, width)
    - Y_train_torch (torch.Tensor): Target temperature values (samples, time, height, width)
    """

    def plot_sample(sample_idx, time_step):
        """
        Helper function to plot climate variables and TAS for a given sample and time step.
        """
        input_sample = X_train_torch[sample_idx, time_step].cpu().numpy()  # Shape: (4, 96, 144)
        output_sample = Y_train_torch[sample_idx, 0].cpu().numpy()  # Shape: (96, 144)
        variables = ["CO₂", "CH₄", "SO₂", "Black Carbon"]

        fig, axes = plt.subplots(1, 5, figsize=(20, 4))

        for i in range(4):
            im = axes[i].imshow(input_sample[i], cmap="coolwarm", origin="lower")
            axes[i].set_title(variables[i])
            fig.colorbar(im, ax=axes[i], shrink=0.6)

        # Plot output TAS
        im = axes[4].imshow(output_sample, cmap="coolwarm", origin="lower")
        axes[4].set_title("Surface Air Temperature (TAS)")
        fig.colorbar(im, ax=axes[4], shrink=0.6)

        plt.suptitle(f"Comparison at Time Step {time_step}, Sample {sample_idx}")
        plt.tight_layout()
        plt.show()

    # Interactive Widget
    interact(plot_sample,
             sample_idx=widgets.IntSlider(min=0, max=X_train_torch.shape[0]-1, step=1, value=0, description="Sample"),
             time_step=widgets.IntSlider(min=0, max=X_train_torch.shape[1]-1, step=1, value=0, description="Time Step"))


def animate_climate_variables(X_train_torch, sample_idx=0, scale_mode='auto'):
    """
    Creates an interactive animation to visualize selected climate input variables
    (CH₄, Black Carbon) over time, with optional log or linear scaling.
    """

    # Clear previous figures and outputs
    plt.close('all')
    clear_output(wait=True)

    # Extract input variables for the given sample
    input_seq = X_train_torch[sample_idx].cpu().numpy()  # Shape: (time, variables, height, width)

    # Only use CH₄ (1) and Black Carbon (3)
    selected_indices = [1, 3]
    variables = ["CH₄", "Black Carbon"]

    # Prepare selected input
    input_seq = input_seq[:, selected_indices, :, :]  # Now shape is (time, 2, H, W)

    # Create figure and axes
    fig, axes = plt.subplots(1, 2, figsize=(10, 4), constrained_layout=True)

    # Precompute color scale limits or normalization
    vmins, vmaxs, norms = [], [], []
    for i in range(len(variables)):
        data = input_seq[:, i, :, :]
        data_flat = data.flatten()
        all_positive = np.all(data_flat > 0)

        if scale_mode == 'log' and all_positive:
            norm = LogNorm(vmin=np.percentile(data_flat, 5), vmax=np.percentile(data_flat, 95))
            norms.append(norm)
            vmins.append(None)
            vmaxs.append(None)
        elif scale_mode == 'auto' and all_positive:
            norm = LogNorm(vmin=np.percentile(data_flat, 5), vmax=np.percentile(data_flat, 95))
            norms.append(norm)
            vmins.append(None)
            vmaxs.append(None)
        else:
            norm = None
            norms.append(norm)
            vmins.append(np.percentile(data_flat, 5))
            vmaxs.append(np.percentile(data_flat, 95))

    # Initialize image plots
    ims = []
    for i, ax in enumerate(axes):
        if norms[i]:
            im = ax.imshow(input_seq[0, i], cmap="coolwarm", origin="lower", norm=norms[i])
        else:
            im = ax.imshow(input_seq[0, i], cmap="coolwarm", origin="lower", vmin=vmins[i], vmax=vmaxs[i])
        ax.set_title(f"{variables[i]} (Year 1)")
        fig.colorbar(im, ax=ax, shrink=0.6)
        ims.append(im)

    # Animation update function
    def update(frame):
        for i, im in enumerate(ims):
            im.set_data(input_seq[frame, i])
            axes[i].set_title(f"{variables[i]} (Year {frame+1})")

    ani = FuncAnimation(fig, update, frames=input_seq.shape[0], interval=500, repeat=True)

    clear_output(wait=True)
    display(HTML(ani.to_jshtml()))


#  1. Live Loss & Validation Tracking
def plot_loss():
    plt.figure(figsize=(8, 5))
    plt.plot(range(1, len(train_losses) + 1), train_losses, label="Training Loss", marker="o")
    plt.plot(range(1, len(val_losses) + 1), val_losses, label="Validation Loss", marker="s")
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.title("Training & Validation Loss Over Time")
    plt.legend()
    plt.grid()
    plt.show()

#  2. Weight & Gradient Evolution (Histogram)
def plot_weight_gradients(epoch):
    if epoch < len(weights_history):
        weights = np.concatenate([w.flatten() for w in weights_history[epoch]])
        grads = np.concatenate([g.flatten() for g in grads_history[epoch]]) if grads_history[epoch] else None

        fig, ax = plt.subplots(1, 2, figsize=(12, 5))

        ax[0].hist(weights, bins=50, color="blue", alpha=0.7)
        ax[0].set_title(f"Model Weights Distribution (Epoch {epoch+1})")
        ax[0].set_xlabel("Weight Value")
        ax[0].set_ylabel("Frequency")

        if grads is not None:
            ax[1].hist(grads, bins=50, color="red", alpha=0.7)
            ax[1].set_title(f"Gradient Distribution (Epoch {epoch+1})")
            ax[1].set_xlabel("Gradient Value")
            ax[1].set_ylabel("Frequency")

        plt.show()

#  3. Sample Predictions Over Time (Slider)
def plot_predictions1(epoch):
    cnn_model.eval()
    with torch.no_grad():
        X_input = X_train_torch[:10].to(next(cnn_model.parameters()).device)
        Y_pred = cnn_model(X_input).cpu().numpy()

        #Y_pred = cnn_model(X_train_torch[:10]).cpu().numpy()
        Y_true = Y_train_torch[:10].cpu().numpy()

    plt.figure(figsize=(8, 5))
    plt.plot(Y_true[epoch].flatten(), label="Ground Truth", marker="o")
    plt.plot(Y_pred[epoch].flatten(), label="Predicted", linestyle="dashed", marker="x")
    plt.xlabel("Time Steps")
    plt.ylabel("Climate Variable")
    plt.title(f"Predictions vs. Ground Truth (Sample {epoch+1})")
    plt.legend()
    plt.grid()
    plt.show()

def plot_predictions(y_true, y_pred, title):
    """Plots predicted vs. actual values as spatial maps."""
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    vmin, vmax = np.percentile(y_true, [5, 95])

    axes[0].imshow(y_true.squeeze(), cmap='coolwarm', vmin=vmin, vmax=vmax)
    axes[0].set_title('Ground Truth')
    axes[0].axis('off')

    axes[1].imshow(y_pred.squeeze(), cmap='coolwarm', vmin=vmin, vmax=vmax)
    axes[1].set_title('Prediction')
    axes[1].axis('off')

    plt.show()

#for widgets
def compare_inputs_outputs(sample_idx, time_step):
    """
    Compare selected input climate variables (CH₄, Black Carbon) with predicted temperature change (TAS).
    """
    input_sample = X_train_torch[sample_idx, time_step].cpu().numpy()  # (4, 96, 144)
    output_sample = Y_train_torch[sample_idx, 0].cpu().numpy()  # (96, 144)

    # Only include CH₄ (index 1) and Black Carbon (index 3)
    selected_indices = [1, 3]
    variables = ["CH₄", "Black Carbon"]

    fig, axes = plt.subplots(1, 3, figsize=(15, 4))

    for i, idx in enumerate(selected_indices):
        im = axes[i].imshow(input_sample[idx], cmap="coolwarm", origin="lower")
        axes[i].set_title(variables[i])
        fig.colorbar(im, ax=axes[i], shrink=0.6)

    # Plot output TAS
    im = axes[2].imshow(output_sample, cmap="coolwarm", origin="lower")
    axes[2].set_title("Surface Air Temperature (TAS)")
    fig.colorbar(im, ax=axes[2], shrink=0.6)

    plt.suptitle(f"Comparison at Time Step {time_step}, Sample {sample_idx}")
    plt.tight_layout()
    plt.show()

Set random seed, when using `pytorch` {“run”:“auto”,“display-mode”:“form”}#

Executing set_seed(seed=seed) you are setting the seed

Random seed 2021 has been set.

Set device (GPU or CPU). Execute `set_device()`#

GPU is not enabled in this notebook. But it will help make training faster if GPU is enabled. 
If you want to enable it, in the menu under `Runtime` -> 
`Hardware accelerator.` and select `GPU` from the dropdown menu

Note:
GPU acceleration is optional for this tutorial. All code can be executed on CPUs, though some steps (especially model training) may take longer. Please allow additional time when running on CPU-only environments and meanwhile go through the remaining tutorial.

Video 1: Deep Learning Techniques#

Submit your feedback#

If you want to download the slides: https://osf.io/download/abem5/

Submit your feedback#

Section 1. Transitioning to Deep Learning with PyTorch: From Machine Learning to Deep Learning in Climate Data Analysis#

Section 1.1 Why Move from Machine Learning to Deep Learning?#

In our previous tutorials, we used Machine Learning (ML) models, such as Random Forests and Gradient Boosting Machines, to analyze climate data. These models are effective when working with structured, tabular datasets but come with limitations:

1️⃣ Manual Feature Engineering: ML models require carefully selected and engineered features, which may not fully capture hidden patterns in climate data.
2️⃣ Limited Spatial & Temporal Awareness: Climate data is highly spatiotemporal, meaning relationships exist across both space and time—something ML models struggle to capture.
3️⃣ Scalability Issues: ML techniques work well on small to medium-sized datasets, but struggle with high-dimensional and large-scale climate datasets.

🔹 Deep Learning (DL), on the other hand, is designed to overcome these challenges by automatically learning features from raw, unstructured climate data. Unlike ML, DL models can handle large datasets, capture complex dependencies, and extract meaningful insights without the need for manual feature engineering.

🔍 Aspect	⚡ Machine Learning (Previous Tutorials)	🚀 Deep Learning (This Tutorial)
Input Data	Structured tabular format	Raw climate data (`NetCDF`)
Feature Engineering	Manual selection required	Automatic feature extraction
Spatial Awareness	Limited or none	Captures spatial dependencies
Temporal Awareness	Limited	Captures long-term patterns
Scalability	Suitable for small datasets	Efficient for large datasets

Section 1.1.1 ML Input vs. DL Input: What Changes?#

Machine Learning in Previous Tutorials

Input: Climate variables from 2015 and projected emissions (2015–2050)
Output: Predicted 2050 temperature anomaly
Data Format: Tabular representation with location-scenario pairs
Limitation: Spatial and temporal dependencies were not explicitly preserved

While ML models performed well, they lacked the ability to capture complex spatiotemporal relationships present in climate data.

Deep Learning for Spatiotemporal Data Deep learning enables us to work with high-dimensional climate data while maintaining its spatial and temporal structure. Instead of using tabular data, we now process climate data in its original NetCDF format, which includes:

Variables: CO₂, CH₄, SO₂, BC
Dimensions: (time, latitude, longitude)
Input: Entire climate maps over time
Output: Future climate projections at a grid level

Why This Transition?

🚀 Advantage	🔍 Benefit in Climate Modeling
Retains Spatial Structure	Climate data is naturally spatial—DL can learn patterns across regions
Captures Temporal Trends	Climate events are time-dependent—DL can model long-term patterns
Works with Raw Data	No need for manual feature engineering—model extracts features directly
Uses CNNs & LSTMs	Specialized layers handle both spatial (CNNs) and temporal (LSTMs) relationships

Section 1.2 What is Deep Learning?

## Section 1.2 What is Deep Learning?

Deep Learning (DL) is a specialized branch of Machine Learning that uses Artificial Neural Networks (ANNs) to learn from data in a hierarchical manner. Instead of relying on handcrafted features, DL models extract patterns directly from raw data through multiple processing layers.

Key Components of Deep Learning

🔹 1. Neural Networks

Deep learning models are composed of neurons (inspired by the human brain).
Neurons are organized into layers—each transforming the input data into meaningful representations.
More layers = Deeper learning, hence the term “deep learning”.

🔹 2. Training via Backpropagation

The model learns by adjusting weights using a technique called gradient descent.
The error is propagated backward to refine the model iteratively.

🔹 3. Deep Learning Architectures

Convolutional Neural Networks (CNNs): Ideal for processing spatial climate data (e.g., satellite images).
Recurrent Neural Networks (RNNs) & LSTMs: Designed for sequential data (e.g., temperature trends over time).
Hybrid CNN-LSTM Models: Capture both spatial and temporal dependencies—perfect for climate prediction.

Section 1.3 Why Deep Learning?

## Section 1.3 Why Deep Learning? * **Automated Feature Extraction:** Deep learning models automatically discover complex relationships from the data, reducing reliance on manual feature engineering. * **Spatiotemporal Modeling:** CNN-LSTMs can simultaneously analyze spatial patterns and temporal dependencies, surpassing the capabilities of simpler models. * **Handles Complex Data:** Able to handle the high dimensional climate data more affectively than previous approaches.

Sectoin 1.4 Why PyTorch?

## Section 1.4 Why PyTorch?

To build our deep learning models, we will use PyTorch—one of the most widely used deep learning frameworks. PyTorch provides flexibility, intuitive coding, and GPU acceleration, making it an excellent choice for research and production applications.

Advantages of PyTorch
✅ Dynamic Computation Graphs: Unlike TensorFlow, PyTorch builds computational graphs dynamically, making debugging easier.
✅ Easy-to-Use API: Simple, Pythonic syntax that integrates seamlessly with NumPy.
✅ Efficient GPU Acceleration: Allows rapid training on GPUs, making deep learning models highly scalable.
✅ Robust Library Ecosystem: Includes built-in modules for automatic differentiation, optimization, and dataset handling.

PyTorch Essentials for This Tutorial

🔧 PyTorch Module	Purpose
`torch.Tensor`	Core data structure for PyTorch models
`torch.nn`	Provides layers like CNN, LSTM, etc.
`torch.optim`	Optimizers for training models
`torch.autograd`	Automatic differentiation for backpropagation
`torch.utils.data`	Handles datasets and dataloaders

💡 In this tutorial, we will use PyTorch to implement a CNN-LSTM model for climate prediction, leveraging both spatial and temporal patterns in raw climate data.

Section 1.5 Critique of the Previous ML Tutorial: What Can Be Improved?

## Section 1.5 Critique of the Previous ML Tutorial: What Can Be Improved?

While our previous ML-based approach was effective, it had some limitations that we aim to address with deep learning:

🔹 Limited Generalization: The ML model was trained on a condensed dataset, meaning it might not generalize well to real-world, large-scale climate data.
🔹 Feature Engineering Dependency: The performance of ML models heavily depends on manual feature selection, which is time-consuming and requires domain expertise.
🔹 Inability to Capture Spatial/Temporal Dependencies: Tree-based ML models treat input features as independent variables, ignoring crucial spatial and temporal correlations in climate data.
🔹 Scalability Issues: As climate datasets grow in size, traditional ML methods struggle to handle the increasing data complexity efficiently.

By moving to deep learning, we address these shortcomings by:
✅ Using raw, high-dimensional climate data instead of condensed versions.
✅ Leveraging CNNs and LSTMs to automatically learn patterns from spatial and temporal data.
✅ Utilizing GPU-accelerated PyTorch models to efficiently handle large datasets.

--- 🚀 Next, let's load and preprocess our climate dataset for deep learning!

Submit your feedback#

Section 2: ClimateBench Data Reloaded - Now in PyTorch!#

In this section, we transition from the pandas and scikit-learn world of Tutorials 1 and 2 to the tensor-centric universe of PyTorch. We’ll load a similar ClimateBench dataset, but prepare it for our CNN-LSTM architecture.

As before, we need a set of tools. Note the key change: we bring in torch and its related modules:

Note: For deateiled understanding of Pytorch and Deep leanring refer to Neuromacth Deep Learning Course’s W1D1Tutorial1

Working with Tensors: PyTorch's Core Data Structure

PyTorch revolves around tensors, which are multi-dimensional arrays similar to NumPy arrays, but with the added benefit of GPU acceleration. Think of a tensor as the fundamental building block for representing data in neural networks.

Why Tensors?

GPU Acceleration: Enable lightning-fast computations for complex models.
Automatic Differentiation: Seamlessly compute gradients for training.
Flexibility: Represent various data types (floats, integers, etc.).

Essential Components for Climate-Informed Deep Learning

torch.Tensor: The base data structure for representing climate variables (temperature, emissions, etc.).
torch.nn: A module containing building blocks for defining our CNN-LSTM model architecture (convolutional layers, LSTM layers, etc.).
torch.optim: Optimization algorithms (e.g., Adam) to train the model effectively.
torch.utils.data.Dataset & torch.utils.data.DataLoader: Powerful tools for managing large climate datasets and efficiently feeding them into our model during training.

Pytorch core Component Breakdown Table

Component	Symbol	Purpose	Climate Application Example
Tensors	⚡	GPU-accelerated multidimensional arrays	Store 3D atmospheric data cubes
nn.Module	🧱	Neural network building blocks	Create CNN-LSTM hybrid architectures
Optimizers	🎯	Parameter update strategies	Adam for stable climate model training
DataLoaders	📂	Batch processing & shuffling	Handle decades of climate observations
Loss Functions	📉	Model performance quantification	MSE for temperature prediction

🔄 Workflow Insight: Typical development pattern: 1. Tensor Preparation → 2. Model Architecture → 3. Loss/Optimizer Setup → 4. Training Loop → 5. Validation

Section 2.1 The Shift to Spatiotemporal Data: ClimateBench in Native Format#

In Tutorials 1 and 2, we trained machine learning models using a simplified, spatially-averaged dataset. While this approach was useful, it had limitations:

Loss of spatial information, reducing the model’s ability to capture regional climate variations
Limited temporal structure, as time-series emissions were flattened into tabular form

Now, we transition to deep learning, unlocking the full potential of the ClimateBench dataset by preserving its original spatial and temporal structure.

Recap: What Was the Previous Data Format?

What Was the Previous Data Format?#

Previously, we averaged across spatial dimensions and flattened the emissions time series, resulting in:

Shape: (3240, 152)
- 3240 rows → location-scenario combinations
- 152 columns → 2015 climate variables + time-averaged emissions

This simplified dataset was easier to process with scikit-learn, but it sacrificed critical spatial and temporal dependencies.

Recap: What Data Are We going to Use Now?

Recap: What Data Are We going to Use Now?#

We now work with the original NetCDF structure, which explicitly retains all spatial and temporal information. This dataset will be structured as:

Input (X) → (766, 10, 96, 144, 4)
- 766 sequences → extracted from all climate simulations
- 10 time steps → sliding window approach over years
- 96 latitude points → range: -90° to 90°
- 144 longitude points → range: 0° to 357.5° (2.5° increments)
- 4 climate variables → CO₂, CH₄, SO₂, BC
Target (Y) → (766, 1, 96, 144)
- Single time-step temperature anomaly prediction

This structure allows deep learning models to capture spatial dependencies (across latitude and longitude) and learn temporal trends (using recurrent or convolutional layers).

Overall#

In Tutorials 1 & 2, we worked with a pre-processed, tabular dataset optimized for scikit-learn.
Now, we will directly load and process the raw ClimateBench dataset using xarray. This ensures that our deep learning models can fully leverage the spatiotemporal structure of climate data.

💡 Next Step: Let’s dive into the code and see how to load and preprocess this data! 🚀

First, Define the path to the training data and then define the climate scenarios

Section 2.1.1 Data Retrieval#

Selection of Climate Scenarios for Model Training#

Choose one or more climate scenarios to train the model. More data can improve performance.
This cell can be re-executed to modify selections.
Available scenarios: ['ssp126', 'ssp370', 'ssp585', 'hist-GHG', 'hist-aer', 'historical']

Tip: Better select all scenario for best possible model perfromance.

Select and Download Climate Scenarios of your interest (Training Data)#

Run this cell to enable the Climate Scenarios selector dropdown.

Downloading data from 'https://osf.io/download/kqxet/' to file '/home/runner/work/climate-course-content/climate-course-content/tutorials/W2D4_AIandClimateChange/student/Data/train_val/inputs_historical.nc'.

SHA256 hash of downloaded file: 85f206cda4846841c3b6a7814961682125b8239834ef4007eb1c8fadb143ba19
Use this value as the 'known_hash' argument of 'pooch.retrieve' to ensure that the file hasn't changed if it is downloaded again in the future.

Downloading data from 'https://osf.io/download/une23/' to file '/home/runner/work/climate-course-content/climate-course-content/tutorials/W2D4_AIandClimateChange/student/Data/train_val/outputs_historical.nc'.

SHA256 hash of downloaded file: 28df86a8a3131289d99ff661e78f52a15071a141e96b22c2d8c6542cc5e6b2a3
Use this value as the 'known_hash' argument of 'pooch.retrieve' to ensure that the file hasn't changed if it is downloaded again in the future.

Above cell downloads the required data to the temporary storage of your Colab session.
You can view the downloaded files under the folder icon 📁 on the left sidebar inside the sub-folder ‘Data’ .

Above cell downloads the required data to the temporary storage of your Colab session.
You can view the downloaded files under the folder icon 📁 on the left sidebar inside the sub-folder ‘Data’ .

Select Input Variables#

Run this cell to enable the input variables selector dropdown- slecet all for best performance

data_path = "Data/train_val/"  # Path to data in Colab RAM
len_historical = 165  # Historical period length

Section 2.2 Loading & Processing The Full Raw ClimateBench Data#

In this section, we transition from pre-processed datasets to raw NetCDF files while ensuring correct alignment and standardization for deep learning.

What We Are Doing in next code cell:
✔️ Load historical & future climate data using xarray from NetCDF files.
✔️ Merge historical and scenario-specific simulations to create a continuous time series.
✔️ Standardize dimensions, variable names, and units for consistency.
✔️ Rescale precipitation data from kg/m²/s to mm/day.
✔️ Drop unnecessary variables to optimize memory usage.

New Functions & Methods to Look Out For:
🔹 xr.open_dataset() → Loads single NetCDF files into an xarray.Dataset.
🔹 xr.open_mfdataset() → Efficiently loads and merges multiple NetCDF files.
🔹 xr.concat() → Concatenates datasets along a specified dimension.
🔹 mean(dim='member') → Computes the ensemble mean across climate model members.
🔹 .transpose() → Ensures correct ordering of dimensions (time, latitude, longitude).

This step prepares the dataset for deep learning, ensuring that temporal and spatial relationships are correctly preserved.

Make sure you execute this cell to load the data in two variables X_train and Y_train

Show code cell source Hide code cell source

# @markdown Make sure you execute this cell to load the data in two variables `X_train` and `Y_train`

# Initialize empty lists to store training input (X) and output (Y) data
X_train = []
Y_train = []

# Iterate through each climate simulation in the 'simus' list
for i, simu in enumerate(scenario_selected):
    # Construct filenames for input and output NetCDF files
    input_name = 'inputs_' + simu + '.nc'
    output_name = 'outputs_' + simu + '.nc'

    # Load input and output data using xarray
    if 'hist' in simu:  # Check if the simulation is historical
        # Open the input dataset for historical simulation
        input_xr = xr.open_mfdataset(data_path + input_name)

        # Open the output dataset and compute the mean across 'member' dimension
        # 'member' refers to different realizations (ensemble members) of the climate model
        output_xr = xr.open_mfdataset(data_path + output_name).mean(dim='member')

    else:  # If it's a future scenario simulation
        # Open historical input and scenario-specific input, merging them together
        # 'open_mfdataset' allows handling multiple files efficiently
        input_xr = xr.mfopen_mfdataset([data_path + 'inputs_historical.nc', data_path + input_name]).compute()

        # Open and concatenate historical & scenario-specific output data along the 'time' dimension
        # Compute the mean across ensemble members for consistency
        output_xr = xr.concat([
            xr.open_mfdataset(data_path + 'outputs_historical.nc').mean(dim='member'),
            xr.open_mfdataset(data_path + output_name).mean(dim='member')
        ], dim='time').compute()

    # Standardizing variable names and units for consistency
    output_xr = output_xr.assign({
        "pr": output_xr.pr * 86400,  # Convert precipitation from kg/m²/s to mm/day
        "pr90": output_xr.pr90 * 86400  # Convert 90th percentile precipitation similarly
    }).rename({'lon': 'longitude', 'lat': 'latitude'})  # Rename dimensions for clarity

    # Ensure the dataset follows the correct ordering of dimensions
    output_xr = output_xr.transpose('time', 'latitude', 'longitude')
    input_xr = input_xr.transpose('time', 'latitude', 'longitude')
    # Drop unnecessary variables (like 'quantile') to save memory
    output_xr = output_xr.drop(['quantile'])

    # Print the dataset dimensions to verify correctness
    #print(input_xr.dims, simu)


    # Append processed input and output datasets to training lists
    X_train.append(input_xr)
    Y_train.append(output_xr)

print("The data has been successfully loaded into `X_train` and `Y_train`!")
# Print the shape of the first element (xarray.Dataset) in the list
print("The shape of the `X_train[0]` is: ", X_train[0].sizes)
print("The shape of the `Y_train[0]` is: ", Y_train[0].sizes)

The data has been successfully loaded into `X_train` and `Y_train`!
The shape of the `X_train[0]` is:  Frozen({'time': 165, 'longitude': 144, 'latitude': 96})
The shape of the `Y_train[0]` is:  Frozen({'time': 165, 'latitude': 96, 'longitude': 144})

Code explaination?

We have now successfully loaded and preprocessed the ClimateBench dataset in its native NetCDF format. Here’s a breakdown:

📌 Historical vs. Future Simulations →

If the dataset is historical (hist in simu), we load it directly.
If it’s a future scenario, we merge it with historical data to maintain continuity.

📌 Preprocessing Steps Applied →

Ensured temporal consistency using concatenation (xr.concat()).
Standardized spatial dimensions (longitude, latitude).
Converted precipitation units for better interpretability.
Dropped unnecessary variables to optimize storage and speed.

💡 Key Takeaways
✅ Why This Matters? Our model will now work with the full spatiotemporal structure instead of a condensed dataset, capturing richer patterns.
✅ What’s Next? Now that the data is correctly formatted, we move to data transformation and model input preparation! 🚀

Next:- Preparing for Data Normalization Before feeding our climate data into a deep learning model, we need to ensure that all features are on a comparable scale. Different climate variables have different units and magnitudes (e.g., CO₂ in ppm vs. precipitation in mm/day), which can negatively impact model performance.

In the next section, we’ll explore data normalization, understand why it is essential, and implement a standardization technique to transform our dataset for optimal learning.

So we have just loaded the datatset in two variables X_train and Y_train

Let’ Check the content inside both the variable

print(X_train)

[<xarray.Dataset> Size: 37MB
Dimensions:    (time: 165, longitude: 144, latitude: 96)
Coordinates:
  * time       (time) int64 1kB 1850 1851 1852 1853 1854 ... 2011 2012 2013 2014
  * longitude  (longitude) float64 1kB 0.0 2.5 5.0 7.5 ... 352.5 355.0 357.5
  * latitude   (latitude) float64 768B -90.0 -88.11 -86.21 ... 86.21 88.11 90.0
Data variables:
    CO2        (time) float64 1kB dask.array<chunksize=(165,), meta=np.ndarray>
    SO2        (time, latitude, longitude) float64 18MB dask.array<chunksize=(165, 96, 144), meta=np.ndarray>
    CH4        (time) float64 1kB dask.array<chunksize=(165,), meta=np.ndarray>
    BC         (time, latitude, longitude) float64 18MB dask.array<chunksize=(165, 96, 144), meta=np.ndarray>]

print(Y_train)

[<xarray.Dataset> Size: 55MB
Dimensions:                    (time: 165, latitude: 96, longitude: 144)
Coordinates:
  * time                       (time) int64 1kB 1850 1851 1852 ... 2013 2014
  * latitude                   (latitude) float64 768B -90.0 -88.11 ... 90.0
  * longitude                  (longitude) float64 1kB 0.0 2.5 ... 355.0 357.5
Data variables:
    diurnal_temperature_range  (time, latitude, longitude) float32 9MB dask.array<chunksize=(165, 96, 144), meta=np.ndarray>
    tas                        (time, latitude, longitude) float32 9MB dask.array<chunksize=(165, 96, 144), meta=np.ndarray>
    pr                         (time, latitude, longitude) float64 18MB dask.array<chunksize=(165, 96, 144), meta=np.ndarray>
    pr90                       (time, latitude, longitude) float64 18MB dask.array<chunksize=(165, 96, 144), meta=np.ndarray>]

Section 2.3 Visualization of the Data#

1️⃣ Interactive Climate Data Heatmap

Display a heatmap of climate variables (e.g., temperature, precipitation) at different time steps.
Use a slider to navigate through time dynamically.

Run this cell to activate Interactive Climate Data Heatmap {“run”:“auto”,“vertical-output”:true,“display-mode”:“form”}#

2️⃣ Interactive Time Series for a Specific Location
Allow users to select a location (lat, lon) and see how a climate variable changes over time.

Run this cell to activate Interactive widget {“run”:“auto”,“vertical-output”:true,“display-mode”:“form”}#

Submit your feedback#

Section 3: Data Normalization#

Data normalization is a fundamental preprocessing step in deep learning, ensuring that input features are on a similar scale. This improves model training stability and overall performance.

Why is Data Normalization Essential?
In deep learning, input features can have vastly different scales. For instance, CO₂ levels are measured in ppm, while SO₂ concentrations are in ppb. Without normalization, models struggle to learn effectively due to imbalanced feature magnitudes.

Section 3.1 Benefits of Normalization

Section 3.1 Benefits of Normalization#

✅ Faster & More Stable Training

Gradient-based optimizers converge efficiently when features are on a similar scale.
Avoids issues like vanishing/exploding gradients.

✅ Improved Model Generalization

Prevents dominance of high-magnitude features over low-magnitude ones.
Reduces model sensitivity to varying data distributions.

✅ Numerical Stability

Helps prevent extreme weight updates during backpropagation.
Standardized inputs maintain consistent learning dynamics across different datasets.

Section 3.2 Standardization: The Chosen Normalization Technique

Section 3.2 Standardization: The Chosen Normalization Technique#

We apply Z-score normalization, a widely used method in deep learning:
$$ X_{\text{normalized}} = \frac{X - \mu}{\sigma} $$ Where:

(X) = Original feature value
(\mu) = Mean of the feature
(\sigma) = Standard deviation of the feature

Why Z-score Normalization?
✔ Ensures a mean of 0 and a standard deviation of 1, making features comparable.
✔ Retains outlier sensitivity while improving learning efficiency.

Section 3.3 Implementing Normalization#

What We’ll Do Next:
🔹 Define functions for normalizing and unnormalizing the data.
🔹 Compute mean and standard deviation for key climate variables (CO₂, CH₄, SO₂, BC).
🔹 Apply Z-score transformation to the input dataset.

This ensures our data is preprocessed optimally for deep learning. Let’s proceed! 🚀

# **3. Data Normalization**

# Utility function to normalize data using Z-score normalization
def normalize(data, var, meanstd_dict):
    """
    Applies standardization (Z-score normalization) to the given variable.
    Formula: (X - mean) / std
    - data: The input array for the variable.
    - var: Name of the variable being normalized.
    - meanstd_dict: Dictionary containing mean and standard deviation for each variable.

    Returns:
        Normalized data with mean 0 and standard deviation 1.
    """
    mean = meanstd_dict[var][0]  # Extract mean for the variable
    std = meanstd_dict[var][1]   # Extract standard deviation for the variable
    return (data - mean) / std    # Apply normalization


# Utility function to revert normalized data back to its original scale
def unnormalize(data, var, meanstd_dict):
    """
    Converts standardized data back to its original scale.
    Formula: X = (X_normalized * std) + mean
    - data: The normalized array.
    - var: Name of the variable to be unnormalized.
    - meanstd_dict: Dictionary containing mean and standard deviation for each variable.

    Returns:
        Unnormalized data in its original scale.
    """
    mean = meanstd_dict[var][0]  # Extract mean for the variable
    std = meanstd_dict[var][1]   # Extract standard deviation for the variable
    return data * std + mean      # Apply inverse transformation

# **Step 1: Compute mean and standard deviation for each variable across the dataset**
meanstd_inputs = {}  # Dictionary to store computed mean and standard deviation


#selected_climate_input_vars = ['CO2', 'CH4', 'SO2', 'BC'] # uncomment this for better result
selected_simulations_ids = range(0, len(X_train))
for var in selected_climate_input_vars:  # Iterate over selected climate variables;selected_climate_input_vars came from the hiddeb cell upon the dropdown slection
    # Concatenate relevant data samples across historical and future simulations
    array = np.concatenate(
        [X_train[i][var].data for i in selected_simulations_ids] +  # Directly use data from selected simulations
        [X_train[i][var].sel(time=slice(len_historical, None)).data for i in selected_simulations_ids]  # Use post-historical data
    )

    print((array.mean(), array.std()))  # Display computed mean and standard deviation
    meanstd_inputs[var] = (array.mean(), array.std())  # Store computed values in dictionary


# **Step 2: Normalize the input dataset using computed statistics**
X_train_norm = []  # List to store normalized training data

for i, train_xr in enumerate(X_train):  # Iterate over each training sample
    for var in selected_climate_input_vars:  # Process each selected climate variable
        var_dims = train_xr[var].dims  # Retrieve variable's dimension structure (e.g., time, lat, lon)

        # Apply normalization and assign the transformed values back to the dataset
        train_xr = train_xr.assign({var: (var_dims, normalize(train_xr[var].data, var, meanstd_inputs))})

    X_train_norm.append(train_xr)  # Append the normalized dataset to the list

(dask.array<mean_agg-aggregate, shape=(), dtype=float64, chunksize=(), chunktype=numpy.ndarray>, dask.array<_sqrt, shape=(), dtype=float64, chunksize=(), chunktype=numpy.ndarray>)

Year	Feature Value
2000	2.5
2001	2.8
2002	3.1
2003	3.4
2004	3.7
2005	4.0

Input Sequence	Target Value
[2.5, 2.8, 3.1]	3.4
[2.8, 3.1, 3.4]	3.7
[3.1, 3.4, 3.7]	4.0

Component	PyTorch Function	Purpose
CNN Layer	`nn.Conv2d()`	Extracts spatial features
Pooling Layer	`nn.AvgPool2d()`	Reduces spatial dimensions
Flattening	`torch.flatten()`	Prepares data for LSTM input
LSTM Layer	`nn.LSTM()`	Captures temporal dependencies
Fully Connected	`nn.Linear()`	Outputs predictions
Activation	`nn.ReLU()`, `nn.Linear()`	Introduces non-linearity, maps predictions

Bonus Tutorial 7: Deep Learning for Climate Prediction with CNN-LSTMs (PyTorch)

Contents

Bonus Tutorial 7: Deep Learning for Climate Prediction with CNN-LSTMs (PyTorch)#

Tutorial Objectives#

Install and import feedback gadget#

Figure settings#

Data retrieval helper function#

Helper functions {“run”:“auto”,“display-mode”:“form”}#

Set random seed, when using pytorch {“run”:“auto”,“display-mode”:“form”}#

Set device (GPU or CPU). Execute set_device()#

Video 1: Deep Learning Techniques#

Submit your feedback#

Submit your feedback#

Section 1. Transitioning to Deep Learning with PyTorch: From Machine Learning to Deep Learning in Climate Data Analysis#

Section 1.1 Why Move from Machine Learning to Deep Learning?#

Section 1.1.1 ML Input vs. DL Input: What Changes?#

Submit your feedback#

Section 2: ClimateBench Data Reloaded - Now in PyTorch!#

Section 2.1 The Shift to Spatiotemporal Data: ClimateBench in Native Format#

What Was the Previous Data Format?#

Recap: What Data Are We going to Use Now?#

Overall#

Section 2.1.1 Data Retrieval#

Selection of Climate Scenarios for Model Training#

Select and Download Climate Scenarios of your interest (Training Data)#

Select Input Variables#

Section 2.2 Loading & Processing The Full Raw ClimateBench Data#

Section 2.3 Visualization of the Data#

Run this cell to activate Interactive Climate Data Heatmap {“run”:“auto”,“vertical-output”:true,“display-mode”:“form”}#

Run this cell to activate widget#

Run this cell to activate Interactive widget {“run”:“auto”,“vertical-output”:true,“display-mode”:“form”}#

Submit your feedback#

Section 3: Data Normalization#

Section 3.1 Benefits of Normalization#

Section 3.2 Standardization: The Chosen Normalization Technique#

Section 3.3 Implementing Normalization#

Submit your feedback#

Section 4: Reshaping Data for the CNN-LSTM Model#

Section 4.1 The Sliding Window Approach#

Section 4.2 Implementing the Sliding Window Transformation#

Submit your feedback#

Section 5: Preparing the Training Data#

Section 5.1: Interactive Climate Data Explorer#

Section 5.2: Spatiotemporal Climate Data Explorer#

Interpreting the Animation#

How to use the animation#

Submit your feedback#

Section 6: Defining the CNN-LSTM Model Architecture#

Section 6.1 Understanding the CNN-LSTM Hybrid Model#

Why Use Time-Distributed Layers?#

Section 6.2 Implementing the CNN-LSTM Model in PyTorch#

Section 6.2.1. Building the CNN-LSTM Model#

Section 6.2.2 Implementing TimeDistributed Wrapper#

Coding Exercise 6.2.2: Implementing the forward pass of CNN_LSTM model, which is designed for spatiotemporal climate forecasting.#

Section 6.2.4 Final Step: Instantiating and Checking Model#

Submit your feedback#

Section 7: Defining the Model, Loss Function, and Optimizer#

Section 7.1 Defining the Model#

Section 7.2 Defining the Loss Function (Mean Squared Error)#

Section 7.3 Defining the Optimizer (Adam Algorithm)#

Submit your feedback#

Section 8: Training Loop with Validation and Early Stopping#

Section 8.1 Understanding the Training Process#

Section 8.2 Key Hyperparameters in Training#

Section 8.3 Preparing for Training#

Section 8.4 Training Loop with Early Stopping#

Coding Exercise 8.4: Training a CNN with weight and gradient tracking#

Loss Trend#

Run this cell to activate Interactive widget {“run”:“auto”,“vertical-output”:true,“display-mode”:“form”}#

Run this cell to activate Interactive widget {“run”:“auto”,“display-mode”:“form”}#

Submit your feedback#

Section 9: Making Predictions and Visualizing Results#

Section 9.1 Making Predictions#

Section 9.2 Visualizing Results#

🔧 Improve Prediction & Visualization#

Questions 9.2: How can we “trust” an emulator’s prediction?#

Submit your feedback#

Section 10 : Make final prediction on the Test Data from ClimateBench Repository#

Section 10.1 Make prediction for precipitation-related variables#

Run this Cell to See the Plot of the Results {“run”:“auto”,“display-mode”:“form”}#

Set random seed, when using `pytorch` {“run”:“auto”,“display-mode”:“form”}#

Set device (GPU or CPU). Execute `set_device()`#

Section 10.2 Run Training and Predictions for Each Target Variable from the `test set` of the Climatebench dataset#