Open In Colab   Open in Kaggle

Tutorial 3: Testing Model Generalization#

Week 2, Day 4, AI and Climate Change

Content creators: Deepak Mewada, Grace Lindsay

Content reviewers: Mujeeb Abdulfatai, Nkongho Ayuketang Arreyndip, Jeffrey N. A. Aryee, Paul Heubel, Jenna Pearson, Abel Shibu

Content editors: Deepak Mewada, Grace Lindsay

Production editors: Paul Heubel, Konstantine Tsafatinos

Our 2024 Sponsors: CMIP, NFDI4Earth

Tutorial Objectives#

Estimated timing of tutorial: 25 minutes

In this tutorial, you will

  • Understand the problem of overfitting

  • Understand generalization

  • Learn to split data into train and test data

  • Evaluate trained models on held-out test data

  • Think about the relationship between model capacity and overfitting

Setup#

# imports:

import pandas as pd                                       # For data manipulation
from sklearn.model_selection import train_test_split      # For splitting dataset into train and test sets
from sklearn.ensemble import RandomForestRegressor        # For Random Forest Regression
from sklearn.tree import DecisionTreeRegressor            # For Decision Tree Regression

Install and import feedback gadget#

Hide code cell source
# @title Install and import feedback gadget

!pip3 install vibecheck datatops --quiet

from vibecheck import DatatopsContentReviewContainer
def content_review(notebook_section: str):
    return DatatopsContentReviewContainer(
        "",  # No text prompt
        notebook_section,
        {
            "url": "https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab",
            "name": "comptools_4clim",
            "user_key": "l5jpxuee",
        },
    ).render()


feedback_prefix = "W2D4_T3"
[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: pip install --upgrade pip

Figure Settings#

Hide code cell source
# @title Figure Settings
import ipywidgets as widgets  # interactive display
import matplotlib.pyplot as plt

%config InlineBackend.figure_format = 'retina'
plt.style.use(
    "https://raw.githubusercontent.com/neuromatch/climate-course-content/main/cma.mplstyle"
)

Set random seed#

Executing set_seed(seed=seed) you are setting the seed

Hide code cell source
# @title Set random seed

# @markdown Executing `set_seed(seed=seed)` you are setting the seed

# Call `set_seed` function in the exercises to ensure reproducibility.
import random
import numpy as np

def set_seed(seed=None):
    if seed is None:
        seed = np.random.choice(2 ** 32)
    random.seed(seed)
    np.random.seed(seed)
    print(f'Random seed {seed} has been set.')

# Set a global seed value for reproducibility
random_state = 42 # change 42 with any number you like

set_seed(seed=random_state)
Random seed 42 has been set.

Video 1: Testing model generalization#

Submit your feedback#

Hide code cell source
# @title Submit your feedback
content_review(f"{feedback_prefix}_Testing_model_generalization_Video")

Submit your feedback#

Hide code cell source
# @title Submit your feedback
content_review(f"{feedback_prefix}_Testing_model_generalization_Slides")

Section 1: Model generalization#

As discussed in the video, machine learning models can overfit. This means they essentially memorize the data points they were trained on. This makes them perform very well on those data points, but when they are presented with data they weren’t trained on their predictions are not very good. Therefore, we need to evaluate our models according to how well they perform on data they weren’t trained on.

To do this, we will split the data into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate how well the model performs on unseen data. This helps us ensure that our model can generalize well to new data and avoid overfitting.

Section 1.1: Load and Prepare the Data#

As we’ve learned in the previous tutorial, here we load our dataset and prepare it by removing unnecessary columns and extracting the target variable tas_FINAL, representing temperature anomalies in 2050. The anomalies in every case are calculated by subtracting the annual means of the pre-industrial scenario from the annual means of the respective scenario of interest.

# Load and Prepare the Data
url_Climatebench_train_val = "https://osf.io/y2pq7/download" # Dataset URL
training_data = pd.read_csv(url_Climatebench_train_val)  # Load the training data from the provided URL
training_data.pop('scenario')  # drop the `scenario` column from the data as it is just a label, but will not be passed into the model.
target = training_data.pop('tas_FINAL')  # Extract the target variable 'tas_FINAL' which we aim to predict

Section 1.2: Data Splitting for Training and Testing#

Now, our primary objective is to prepare our dataset for model training and evaluation. To achieve this, we’ll utilize the train_test_split function from Scikit-learn, which conveniently splits our dataset into training and testing subsets.

To facilitate this process, we’ve imported the essential train_test_split function from Scikit-learn earlier in the code:

from sklearn.model_selection import train_test_split      

Our strategy involves randomly allocating 20% of the data for testing purposes, while reserving the remaining 80% for model training. This ensures that our model is evaluated on unseen data, which is crucial for assessing its real-world performance.

With this function ready to use, let’s seamlessly proceed to split our dataset and go ahead on the journey of model training and evaluation.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    training_data, target, test_size=0.2, random_state=1
)

We now have separated the input features (now called X) and the target variable (now called y) into a training set (X_train, y_train) and a test set (X_test, y_test).

Section 1.3: Train a decision tree model on the training data and evaluate it#

# Training the model on the training data
dt_regressor = DecisionTreeRegressor(random_state=random_state,max_depth=20)
dt_regressor.fit(X_train, y_train)
DecisionTreeRegressor(max_depth=20, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now we will evaluate the model on both the training and test data

print('Performance on training data:', dt_regressor.score(X_train, y_train))
print('Performance on test data    :', dt_regressor.score(X_test, y_test))
Performance on training data: 0.9990899735488357
Performance on test data    : 0.7857776327773902

We can see here that our model is overfitting: it is performing much better on the data it was trained on than on held-out test data.

Section 1.4: Train a random forest model on the testing data and evaluate it#

Use what you know to train a random forest model on the training data and evaluate it on both the training and test data. We have already imported RandomForestRegressor in Setup section via

from sklearn.ensemble import RandomForestRegressor  

Coding Exercise 1.4:#

def train_random_forest_model(X_train, y_train, X_test, y_test, random_state):
    """Train a Random Forest model and evaluate its performance.

    Args:
        X_train (ndarray): Training features.
        y_train (ndarray): Training labels.
        X_test (ndarray): Test features.
        y_test (ndarray): Test labels.
        random_state (int): Random seed for reproducibility.

    Returns:
        RandomForestRegressor: Trained Random Forest regressor model.
    """
    #################################################
    ## TODO for students: Train a random forest model on the testing data and evaluate it ##
    # Implement training a RandomForestRegressor model using X_train and y_train
    # Then, evaluate its performance on both training and test data using .score() method
    # Print out the performance on training and test data
    # Please remove the following line of code once you have completed the exercise:
    raise NotImplementedError("Student exercise: Implement the training and evaluation process.")
    #################################################

    # train the model on the training data
    rf_regressor = RandomForestRegressor(random_state=random_state)

    # fit the model
    _ = rf_regressor.fit(..., ...)

    print('Performance on training data :', rf_regressor.score(..., y_train))
    print('Performance on test data     :', rf_regressor.score(X_test, ...))

    return rf_regressor

# test the function
rf_model = ...

Click for solution

Submit your feedback#

Hide code cell source
# @title Submit your feedback
content_review(f"{feedback_prefix}_Coding_Exercise_1_4")

Question 1.4: Overfitting - Decision Tree vs Random Forest#

  1. Does the random forest model overfit less than a single decision tree?

Click for solution

Submit your feedback#

Hide code cell source
# @title Submit your feedback
content_review(f"{feedback_prefix}_Questions_1_4")

Section 1.5: Explore Parameters of the Random Forest Model#

In the previous tutorial, you saw how we can control the depth of a single decision tree.
We can also control the depth of the decision trees used in our random forest model by passing a max_depth argument. We can also control the number of trees in the random forest model by setting n_estimator.

Intuitively, these variables control the capacity of the model. Capacity loosely refers to the number of trainable parameters in the model. The more trees and the deeper they are, the more free parameters the model has to capture the training data. If the model has too low of capacity, it won’t be powerful enough to capture complex relationships between the input features and the target variable. If it has too many parameters that it can move around, however, it may end up memorizing every single training point and therefore overfit.

Use the sliders below to experiment with different values of n_estimator and max_depth and see how they impact performance on training and test data.

Interactive Demo 1.5: Performance of the Random Forest Regression#

In this activity, you can adjust the sliders for n_estimators and max_depth to observe their effect on model performance:

  • n_estimators: Controls the number of trees in the Random Forest.

  • max_depth: Sets the maximum depth of each tree.
    After adjusting the sliders, the code fits a new Random Forest model and prints the training and testing scores, showing how changes in these parameters impact model performance.

Use the slider to change the values of ‘n_estimators’ and ‘max_depth’ and observe the effect on performance.#

Make sure you execute this cell to enable the widget!

Hide code cell source
# @title Use the slider to change the values of 'n_estimators' and 'max_depth' and observe the effect on performance.
# @markdown Make sure you execute this cell to enable the widget!
# Uncomment the code below to run the widget.

# Function to train random forest and display scatter plot
def train_rf_and_plot(X_tr, y_train, X_test, y_test, max_depth, n_estim):
    global rf_regressor, X_train

    # Instantiate and train the decision tree regressor
    rf_regressor = RandomForestRegressor(n_estimators=n_estim, max_depth=max_depth)
    rf_regressor.fit(X_tr, y_train)

    # Calculate and print the scores
    score_train = rf_regressor.score(X_tr, y_train)
    score_test = rf_regressor.score(X_test, y_test)
    print(f"\n\tTraining Score: {score_train}")
    print(f"\tTesting Score  : {score_test}\n")

    # Generate scatter plot: Predicted vs. True Temperatures
    predicted = rf_regressor.predict(X_tr)

    fig, ax = plt.subplots()

    # Scatter plot
    ax.scatter(predicted, y_train, color='blue', alpha=0.7, label='Comparison of Predicted and True Temperatures', edgecolors='black')
    ax.plot([min(y_train), max(y_train)], [min(y_train), max(y_train)], color='red', linestyle='--', label='Ideal Prediction Line')
    ax.set_xlabel('Predicted Temperature (K)')
    ax.set_ylabel('True Temperature (K)')
    ax.set_title('Annual mean temperature anomaly')
    # add a caption
    caption_text = 'The anomalies are calculated by subtracting the annual means of the pre-industrial scenario from \nthe annual means of the respective scenario.'
    plt.figtext(0.5, -0.03, caption_text, ha='center', fontsize=10)  # Adjusted y-coordinate to create space
    ax.legend()
    ax.grid(True)

    plt.tight_layout()
    plt.show()


# Interactive widget to control max_depth and n_estimators
# @widgets.interact(max_depth=(1, 41, 1), n_estimators=(10,100,5))
# def visualize_scores_with_max_depth(max_depth=20, n_estimators=50):
#     train_rf_and_plot(X_train, y_train, X_test, y_test, max_depth, n_estimators)

Interactive Demo 1.5: Discussion#

  1. Did you observe any trends in how the performance changes?

  2. Try to explain in you own words the concepts of capacity and overfitting and how they relate.

  3. In addition to model capacity, what else could be changed to prevent overfitting?

Click for solution

Submit your feedback#

Hide code cell source
# @title Submit your feedback
content_review(f"{feedback_prefix}_Discussion_Interactive_Demo_1_5")

Summary#

In this tutorial, we delved into the importance of training and testing sets in constructing robust machine learning models. Understanding the concept of overfitting and the necessity of using separate test sets for model assessment were pivotal. Through practical exercises, we acquired hands-on proficiency in data partitioning, model training, and performance evaluation.