Tutorial 1: Distributions#
Week 2, Day 3, Extremes & Variability
Content creators: Matthias Aengenheyster, Joeri Reinders
Content reviewers: Younkap Nina Duplex, Sloane Garelick, Zahra Khodakaramimaghsoud, Peter Ohue, Laura Paccini, Jenna Pearson, Agustina Pesce, Derick Temfack, Peizhen Yang, Cheng Zhang, Chi Zhang, Ohad Zivan
Content editors: Paul Heubel, Jenna Pearson, Chi Zhang, Ohad Zivan
Production editors: Wesley Banfield, Paul Heubel, Jenna Pearson, Konstantine Tsafatinos, Chi Zhang, Ohad Zivan
Our 2024 Sponsors: CMIP, NFDI4Earth
Tutorial Objectives#
Estimated timing of tutorial: 30 minutes
In this initial tutorial, your focus will be on examining the distribution of annual extreme precipitation levels in Germany. Your objective is to explore various aspects of the distribution, including the mean, variance, and skewness. By the end of this tutorial, you will be able to:
Visualize an observational record as both a time series and a distribution.
Compute the moments of a record.
Generate and plot a distribution with predefined moments.
Setup#
# imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import os
import pooch
import tempfile
from scipy import stats
Install and import feedback gadget#
Show code cell source
# @title Install and import feedback gadget
!pip3 install vibecheck datatops --quiet
from vibecheck import DatatopsContentReviewContainer
def content_review(notebook_section: str):
return DatatopsContentReviewContainer(
"", # No text prompt
notebook_section,
{
"url": "https://pmyvdlilci.execute-api.us-east-1.amazonaws.com/klab",
"name": "comptools_4clim",
"user_key": "l5jpxuee",
},
).render()
feedback_prefix = "W2D3_T1"
[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: pip install --upgrade pip
Figure Settings#
Show code cell source
# @title Figure Settings
import ipywidgets as widgets # interactive display
%config InlineBackend.figure_format = 'retina'
plt.style.use(
"https://raw.githubusercontent.com/neuromatch/climate-course-content/main/cma.mplstyle"
)
Helper functions#
Show code cell source
# @title Helper functions
def pooch_load(filelocation=None, filename=None, processor=None):
shared_location = "/home/jovyan/shared/Data/tutorials/W2D3_ExtremesandVariability" # this is different for each day
user_temp_cache = tempfile.gettempdir()
if os.path.exists(os.path.join(shared_location, filename)):
file = os.path.join(shared_location, filename)
else:
file = pooch.retrieve(
filelocation,
known_hash=None,
fname=os.path.join(user_temp_cache, filename),
processor=processor,
)
return file
Video 1: Distributions#
Submit your feedback#
Show code cell source
# @title Submit your feedback
content_review(f"{feedback_prefix}_Distributions_Video")
If you want to download the slides: https://osf.io/download/6vu3z/
Submit your feedback#
Show code cell source
# @title Submit your feedback
content_review(f"{feedback_prefix}_Distributions_Slides")
Section 1: Inspect a Precipitation Record and Plot it Over Time#
Extreme rainfall can pose flood hazards with disasterous consequences for society, the economy, and ecosystems. Annual maximum daily precipitation is a valuable measurement for assessing flood risks, and in this tutorial you will start the journey of statistically analyzing a dataset of annual maximum daily precipitation records for Germany.
First, download the precipitation files:
# download file: 'precipitationGermany_1920-2022.csv'
filename_precipitationGermany = "precipitationGermany_1920-2022.csv"
url_precipitationGermany = "https://osf.io/xs7h6/download"
data = pd.read_csv(
pooch_load(url_precipitationGermany, filename_precipitationGermany), index_col=0
).set_index("years")
data.columns = ["precipitation"]
precipitation = data.precipitation
Downloading data from 'https://osf.io/xs7h6/download' to file '/tmp/precipitationGermany_1920-2022.csv'.
SHA256 hash of downloaded file: ef9f29e709a1db0745c28bcb9460dac22f6c9624de41d9e2187ab547e9aaa3c8
Use this value as the 'known_hash' argument of 'pooch.retrieve' to ensure that the file hasn't changed if it is downloaded again in the future.
precipitation
years
1920 24.5
1921 27.7
1922 15.6
1923 23.5
1924 59.9
...
2018 31.6
2019 24.9
2020 33.3
2021 57.4
2022 25.4
Name: precipitation, Length: 103, dtype: float64
You can see that for each year between 1920 and 2022, there is one number. This represents the largest amount of precipitation that was observed at a point in Germany at 51 N 6 E. In other words, one looked at each day’s precipitation for each year and included the highest amount in this time series.
Now we can plot a time series of the data from 1920-2022.
precipitation.plot.line(style=".-",grid=True)
plt.xlabel("Time (years)")
plt.ylabel("Annual Maximum Daily Precipitation \n(mm/day)")
Text(0, 0.5, 'Annual Maximum Daily Precipitation \n(mm/day)')
Click here for a description of the plot
A time series of the annual maximum daily precipitation in millimeters per day per year. The data ranges from approximately 16 mm/day in the 1920s to almost 70 mm/day around 1970. About four peaks show unusual events of extreme precipitation (in 1924, 1966, 1982 and 2021). Moreover, a positive trend over time seems to be visible.To make this dataset more interpretable, we can plot a histogram of the data. Recall that we can make a histogram of this data by plotting the y-axis from the previous figure on the x-axis of this new figure versus the count of how many data points fall within a ‘bin’ on the x-axis.
# create the bins (x axis) for the data
fig, ax = plt.subplots()
bins = np.arange(0, precipitation.max(), 2)
# make the histogram
sns.histplot(precipitation, bins=bins, ax=ax)
# set limits and labels
ax.set_xlim(bins[0], bins[-1])
ax.set_xlabel("Annual Maximum Daily Precipitation \n(mm/day)")
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
Text(0.5, 0, 'Annual Maximum Daily Precipitation \n(mm/day)')
Click here for a description of the plot
Histogram plot of the annual maximum daily precipitation in millimeters per day. The horizontal axis of the histogram represents the data's range via bins, while the vertical axis represents the frequency of occurrences within each interval by counting the number of data points that fall into each bin. By visually inspecting the histogram we get an impression of the shape, central tendency, and spread of the dataset. The bin with the largest amount of occurrences shows 12 counts of annual maximum daily precipitation between 20 and 22 millimeters per day. Bins that show at least one count range from 14 millimeters per day to 60 millimeters per day.Next let’s calculate the moments of our dataset. A moment helps us define the center of mass (mean), scale (variance), and shape (skewness and kurtosis) of a distribution. The scale is how the data is stretched or compressed along the x-axis, while the shape parameters help us answer questions about the geometry of the distribution, for example if data points lie more frequently to one side of the mean than the other or if the tails are ‘heavy’ (i.e. larger chance of getting extreme values).
Let’s compute the mean, the variance and the standard deviation of your precipitation data. Plot the mean as a vertical line on the histogram.
# mean
mean_pr = precipitation.mean()
# variance
var_pr = precipitation.var()
# standard deviation
std_pr = precipitation.std()
mean_pr, var_pr, std_pr
(30.97184466019418, 98.55047401484866, 9.927259139100212)
# re-plot histogram from above
fig, ax = plt.subplots()
bins = np.arange(0, precipitation.max(), 2)
sns.histplot(precipitation, bins=bins, ax=ax) # this will plot the counts for each bin
ax.set_xlim(bins[0], bins[-1])
ylim = ax.get_ylim()
# add in vertical line at mean
ax.vlines(mean_pr, ymin=ylim[0], ymax=ylim[1], color="C3", lw=3, label="mean")
ax.set_xlabel("Annual Maximum Daily Precipitation \n(mm/day)")
ax.legend()
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<matplotlib.legend.Legend at 0x7ff7e79e95e0>
Click here for a description of the plot
Histogram plot of the annual maximum daily precipitation in millimeters per day as above. Here, the computed mean is highlighted as a red vertical line. The mean is shown in the 30 to 32 bin as the calculated value is approximately 30.97 millimeters per day.As you can observe, the range of values on either side of the mean-line you added earlier is unequal, suggesting a skewed distribution. To assess the extent of the potential skewness, we will use the .skew()
function. We will also generate a set of 100 random values from a normal distribution (mean = 0, standard deviation = 1) and compare its skewness to that of the precipitation data.
# calculate the skewness of our precipitation data
precipitation.skew()
1.1484425874858337
# generate data following a normal distribution (mean = 0, standard deviation = 1)
normal_data = np.random.normal(0, 1, size=data.index.size)
# calculate the skewness of a normal distribution
stats.skew(normal_data)
-0.48328953063855207
Note that a more positive value of skewness means the tail on the right is longer/heavier and extends more towards the positive values, and vice versa for a more negative skewness. An unskewed dataset would have a value of zero - such as the normal distribution.
By comparing the precipitation data to the skewness of the normal distribution, we can see that our data is positively skewed, meaning that the right tail of the distribution contains more data than the left. This is consistent with our findings from the histogram.
To delve deeper into this observation, let’s try fitting a normal distribution to our precipitation data. This entails computing the mean and standard deviation of the ‘precipitation’ variable, which serve as the two parameters for a normal distribution.
By utilizing the scipy
function norm.pdf
, we can generate a probability density function (pdf) that accompanies our histogram. The pdf provides insight into the probability of encountering various levels of precipitation based on the available data. For a normal distribution the mean value exhibits the highest probability (this is not the case for all distributions!).
fig, ax = plt.subplots()
bins = np.arange(0, precipitation.max(), 2)
sns.histplot(
precipitation, bins=bins, ax=ax, stat="density"
) # notice the different stat being ploted
ax.set_xlim(bins[0], bins[-1])
ylim = ax.get_ylim()
# add in vertical line at mean
ax.vlines(mean_pr, ymin=ylim[0], ymax=ylim[1], color="C3", lw=3, label="mean")
# add PDF
x_r100 = np.arange(0, 100, 1)
ax.plot(x_r100, stats.norm.pdf(x_r100, mean_pr, std_pr), c="k", lw=3)
ax.set_xlabel("Annual Maximum Daily Precipitation \n(mm/day)")
ax.legend()
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<matplotlib.legend.Legend at 0x7ff7e78d10a0>
Click here for a description of the plot
Histogram plot of the annual maximum daily precipitation in millimeters per day as above. Here, the computed mean is highlighted as a red vertical line. Furthermore, the computed probability density function is overlaid in black.Coding Exercises 1#
Add uncertainty bands to the distribution.
Create 1000 records of 100 samples each that are drawn from a normal distribution with the mean and standard deviation of the precipitation record.
Compute the 5-th and 95-th percentiles across the 1000-member ensemble and add them to the figure above to get an idea of the uncertainty.
Hint: you can use the function np.random.normal()
to draw from a normal distribution. Call np.random.normal()
or help(np.random.normal)
to understand how to use it. np.quantile()
is useful for computing quantiles.
# take 1000 records of 100 samples
random_samples = ...
# create placeholder for pdfs
pdfs = np.zeros([x_r100.size, 1000])
# loop through all 1000 records and create a pdf of each sample
for i in range(1000):
# find pdfs
pdfi = ...
# add to array
...
fig, ax = plt.subplots()
# make histogram
_ = ...
# set x limits
_ = ...
# get y lims for plotting mean line
ylim = ...
# add vertical line with mean
_ = ...
# plot pdf
_ = ...
# plot 95th percentile
_ = ...
# plot 5th percentile
_ = ...
# set xlabel
ax.set_xlabel("Annual Maximum Daily Precipitation \n(mm/day)")
Text(0.5, 0, 'Annual Maximum Daily Precipitation \n(mm/day)')
Example output:
Click to see description of plot
Histogram plot of the annual maximum daily precipitation in millimeters per day. Here, the computed mean is highlighted as a red vertical line. The computed probability density function is overlaid in black as above. Furthermore, the 5th and 95th percentiles across the 1000-member ensemble are computed and added to the figure as dashed black lines to get an idea of the uncertainty.Submit your feedback#
Show code cell source
# @title Submit your feedback
content_review(f"{feedback_prefix}_Coding_Exercise_1")
Summary#
In this tutorial, you focused on the analysis of the annual maximum daily precipitation levels in Germany. You started by visualizing the observational record as both a time series and a distribution. This led you to compute the moments of the record, specifically, the mean, variance, and standard deviation. You then compared the skewness of the precipitation data to a normally distributed set of values. Finally, you assessed the fit of a normal distribution to your dataset.
Resources#
Data from this tutorial uses the 0.25 degree precipitation dataset E-OBS. It combines precipitation observations to generate a gridded (i.e. no “holes”) precipitation over Europe. We used the precipitation data from the gridpoint at 51°N, 6°E.
The dataset can be accessed using the KNMI Climate Explorer here. The Climate Explorer is a great resource to access, manipulate and visualize climate data, including observations and climate model simulations. It is freely accessible - feel free to explore!