Tutorial 5: Drafting the Analysis

Tutorial 5: Drafting the Analysis#

Good Research Practices

Content creators: Marguerite Brown, Zane Mitrevica, Natalie Steinemann, Yuxin Zhou

Content reviewers: Katrina Dobson, Sloane Garelick, Maria Gonzalez, Nahid Hasan, Paul Heubel, Beatriz Cosenza Muralles, Sherry Mi, Cheng Zhang

Content editors: Jenna Pearson, Chi Zhang, Ohad Zivan

Production editors: Wesley Banfield, Paul Heubel, Jenna Pearson, Chi Zhang, Ohad Zivan

Our 2024 Sponsors: CMIP, NFDI4Earth

Tutorials Objectives#

In Tutorials 5-8, you will learn about the research process. This includes how to

Draft analyses of data to test a hypothesis
Implement analysis of data
Interpret results in the context of existing knowledge
Communicate your results and conclusions

By the end of these tutorials you will be able to:

Understand the principles of good research practices
Learn to view a scientific data set or question through the lens of equity: Who is represented by this data and who is not? Who has access to this information? Who is in a position to use it?

# imports

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
from scipy import interpolate
from scipy import stats

Figure Settings#

Video 1: Drafting the Analysis#

Coding Exercise 1#

To explore the relationship between CO₂ and temperature, you may want to make a scatter plot of the two variables, where the x-axis represents CO₂ and the y-axis represents temperature. Then you can see if a linear regression model fits the data well.

Before you do that, let’s learn how to apply a linear regression model using generated data.

If you aren’t familiar with a linear regression model, it is simply a way of isolating a relationship between two variables (e.g. x and y). For example, each giraffe might have different running speeds. You might wonder if taller giraffes run faster than shorter ones. How do we describe the relationship between a giraffe’s height and its running speed? A linear regression model will be able to provide us with a mathematical equation:

\[ \text{speed} = a * \text{height} + b \]

where \(a\) and \(b\) are the slope and intercept of the equation, respectively. Such an equation allows us to predict an unknown giraffe’s running speed by simply plugging its height into the equation. Not all giraffes will fit the relationship and other factors might influence their speeds, such as health, diet, age, etc. However, because of its simplicity, linear regression models are usually first attempted by scientists to quantify the relationship between variables.

For more information on linear regression models, see the Wikipedia page, especially the first figure on that page:

# set up a random number generator
rng = np.random.default_rng()
# x is one hundred random numbers between 0 and 1
x = rng.random(100)
# y is one hundred random numbers according to the relationship y = 1.6x + 0.5
y = 1.6 * x + rng.random(100)

# plot
fig, ax = plt.subplots()
ax.scatter(x, y, color="gray")

# regression
res = stats.linregress(x, y)  # ordinary least sqaure

ax.plot(x, x * res.slope + res.intercept, color="k")
ax.set_xlabel("x")
ax.set_ylabel("y")

Text(0, 0.5, 'y')

../../../_images/02b0a837c16335a38c61db304e235ffb38f4e034629500424def8235f3cfe678.png

To get a sense of how our model fits the data, you can look at the regression results.

# summarize model

print(
    r"Pearson (r$^2$) value: "
    + "{:.2f}".format(res.rvalue**2)
    + " \nwith a p-value of: "
    + "{:.2e}".format(res.pvalue)
)

Pearson (r$^2$) value: 0.65 
with a p-value of: 2.39e-24

Now that we know how to write codes to analyze the linear relationship between two variables, we’re ready to move on to real world data!