# Tutorial 5: Drafting the Analysis#

**Good Research Practices**

**Content creators:** Yuxin Zhou, Marguerite Brown, Natalie Steinemann, Zane Mitrevica

**Content reviewers:** Sherry Mi, Maria Gonzalez, Nahid Hasan, Beatriz Cosenza Muralles, Katrina Dobson, Sloane Garelick, Cheng Zhang

**Content editors:** Jenna Pearson, Chi Zhang, Ohad Zivan

**Production editors:** Wesley Banfield, Jenna Pearson, Chi Zhang, Ohad Zivan

**Our 2023 Sponsors:** NASA TOPS and Google DeepMind

# Tutorials Objectives#

In Tutorials 5-8, you will learn about the research process. This includes how to

Draft analyses of data to test a hypothesis

Implement analysis of data

Interpret results in the context of existing knowledge

Communicate your results and conclusions

By the end of these tutorials you will be able to:

Understand the principles of good research practices

Learn to view a scientific data set or question through the lens of equity: Who is represented by this data and who is not? Who has access to this information? Who is in a position to use it?

```
# imports
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
from scipy import interpolate
from scipy import stats
```

# Coding Exercise 1#

To explore the relationship between CO_{2} and temperature, you may want to make a scatter plot of the two variables, where the x-axis represents CO_{2} and the y-axis represents temperature. Then you can see if a linear regression model fits the data well.

Before you do that, let’s learn how to apply a linear regression model using generated data.

If you aren’t familiar with a linear regression model, it is simply a way of isolating a relationship between two variables (e.g. x and y). For example, each giraffe might have different running speeds. You might wonder if taller giraffes run faster than shorter ones. How do we describe the relationship between a giraffe’s height and its running speed? A linear regression model will be able to provide us a mathematical equation:

`speed = a * height + b`

where a and b are the slope and intercept of the equation, respectively. Such an equation allows us to predict an unknown giraffe’s running speed by simply plugging its height into the equation. Not all giraffes will fit the relationship and other factors might influence their speeds, such as health, diet, age, etc. However, because of its simplicity, linear regression models are usually first attempted by scientists to quantify the relationship between variables.

For more information on linear regression models, see the Wikipedia page, especially the first figure on that page:

```
# set up a random number generator
rng = np.random.default_rng()
# x is one hundred random numbers between 0 and 1
x = rng.random(100)
# y is one hundred random numbers according to the relationship y = 1.6x + 0.5
y = 1.6 * x + rng.random(100)
# plot
fig, ax = plt.subplots()
ax.scatter(x, y, color="gray")
# regression
res = stats.linregress(x, y) # ordinary least sqaure
ax.plot(x, x * res.slope + res.intercept, color="k")
ax.set_xlabel("x")
ax.set_ylabel("y")
```

```
Text(0, 0.5, 'y')
```

To get a sense of how our model fits the data, you can look at the regression results.

```
# summarize model
print(
"pearson (r^2) value: "
+ "{:.2f}".format(res.rvalue**2)
+ " \nwith a p-value of: "
+ "{:.2e}".format(res.pvalue)
)
```

```
pearson (r^2) value: 0.69
with a p-value of: 7.18e-27
```

Now that we know how to write codes to analyze the linear relationship between two variables, we’re ready to move on to real world data!