Open In Colab   Open in Kaggle

Tutorial 4: Public Opinion on the Climate Emergency and Why it Matters#

Week 2, Day 3: IPCC Socio-economic Basis

Content creators: Maximilian Puelma Touzel

Content reviewers: Peter Ohue, Derick Temfack, Zahra Khodakaramimaghsoud, Peizhen Yang, Younkap Nina Duplex, Laura Paccini, Sloane Garelick, Abigail Bodner, Manisha Sinha, Agustina Pesce, Dionessa Biton, Cheng Zhang, Jenna Pearson, Chi Zhang, Ohad Zivan

Content editors: Jenna Pearson, Chi Zhang, Ohad Zivan

Production editors: Wesley Banfield, Jenna Pearson, Chi Zhang, Ohad Zivan

Our 2023 Sponsors: NASA TOPS and Google DeepMind

Tutorial Objectives#

In this tutorial, we will explore a dataset derived from Twitter, focusing on public sentiment surrounding the Conference of Parties (COP) climate change conferences. We will use data from a published study by Falkenberg et al. Nature Clim. Chg. 2022. This dataset encompasses tweets mentioning the COP conferences, which bring together world governments, NGOs, and businesses to discuss and negotiate on climate change progress. Our main objective is to understand public sentiment about climate change and how it has evolved over time through an analysis of changing word usage on social media. In the process, we will also learn how to manage and analyze large quantities of text data.

The tutorial is divided into sections, where we first delve into loading and inspecting the data, examining the timing and languages of the tweets, and analyzing sentiments associated with specific words, including those indicating ‘hypocrisy’. We’ll also look at sentiments regarding institutions within these tweets and compare the sentiment of tweets containing ‘hypocrisy’-related words versus those without. This analysis is supplemented with visualization techniques like word clouds and distribution plots.

By the end of this tutorial, you will have developed a nuanced understanding of how text analysis can be used to study public sentiment on climate change and other environmental issues, helping us to navigate the intricate and evolving landscape of climate communication and advocacy.

Setup#

# imports
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# notebook config
from IPython.display import display, HTML
import datetime
import re
import nltk
from nltk.corpus import stopwords
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
import urllib.request  # the lib that handles the url stuff
from afinn import Afinn
import pooch
import os
import tempfile
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 2
      1 # imports
----> 2 get_ipython().run_line_magic('matplotlib', 'inline')
      3 import numpy as np
      4 import pandas as pd

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/IPython/core/interactiveshell.py:2417, in InteractiveShell.run_line_magic(self, magic_name, line, _stack_depth)
   2415     kwargs['local_ns'] = self.get_local_scope(stack_depth)
   2416 with self.builtin_trap:
-> 2417     result = fn(*args, **kwargs)
   2419 # The code below prevents the output from being displayed
   2420 # when using magics with decodator @output_can_be_silenced
   2421 # when the last Python token in the expression is a ';'.
   2422 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/IPython/core/magics/pylab.py:99, in PylabMagics.matplotlib(self, line)
     97     print("Available matplotlib backends: %s" % backends_list)
     98 else:
---> 99     gui, backend = self.shell.enable_matplotlib(args.gui.lower() if isinstance(args.gui, str) else args.gui)
    100     self._show_matplotlib_backend(args.gui, backend)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3588, in InteractiveShell.enable_matplotlib(self, gui)
   3567 def enable_matplotlib(self, gui=None):
   3568     """Enable interactive matplotlib and inline figure support.
   3569 
   3570     This takes the following steps:
   (...)
   3586         display figures inline.
   3587     """
-> 3588     from matplotlib_inline.backend_inline import configure_inline_support
   3590     from IPython.core import pylabtools as pt
   3591     gui, backend = pt.find_gui_and_backend(gui, self.pylab_gui_select)

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/matplotlib_inline/__init__.py:1
----> 1 from . import backend_inline, config  # noqa
      2 __version__ = "0.1.6"  # noqa

File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/matplotlib_inline/backend_inline.py:6
      1 """A matplotlib backend for publishing figures via display_data"""
      3 # Copyright (c) IPython Development Team.
      4 # Distributed under the terms of the BSD 3-Clause License.
----> 6 import matplotlib
      7 from matplotlib import colors
      8 from matplotlib.backends import backend_agg

ModuleNotFoundError: No module named 'matplotlib'

Figure settings#

Hide code cell source
# @title Figure settings
import ipywidgets as widgets  # interactive display

%config InlineBackend.figure_format = 'retina'
plt.style.use(
    "https://raw.githubusercontent.com/ClimateMatchAcademy/course-content/main/cma.mplstyle"
)

sns.set_style("ticks", {"axes.grid": False})
display(HTML("<style>.container { width:100% !important; }</style>"))
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 2
      1 # @title Figure settings
----> 2 import ipywidgets as widgets  # interactive display
      4 get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'")
      5 plt.style.use(
      6     "https://raw.githubusercontent.com/ClimateMatchAcademy/course-content/main/cma.mplstyle"
      7 )

ModuleNotFoundError: No module named 'ipywidgets'

Video 2: A Simple Greenhouse Model#

Hide code cell source
# @title Video 2: A Simple Greenhouse Model
# Tech team will add code to format and display the video
# helper functions


def pooch_load(filelocation=None, filename=None, processor=None):
    shared_location = "/home/jovyan/shared/Data/tutorials/W2D3_FutureClimate-IPCCII&IIISocio-EconomicBasis"  # this is different for each day
    user_temp_cache = tempfile.gettempdir()

    if os.path.exists(os.path.join(shared_location, filename)):
        file = os.path.join(shared_location, filename)
    else:
        file = pooch.retrieve(
            filelocation,
            known_hash=None,
            fname=os.path.join(user_temp_cache, filename),
            processor=processor,
        )

    return file

Section 1: Data Preprocessing#

We have performed the following preprocessing steps for you (simply follow along; there is no need to execute any commands in this section):

Every Twitter message (hereon called tweets) has an ID. IDs of all tweets mentioning COPx (x=20-26, which refers to the session number of each COP meeting) used in Falkenberg et al. (2022) were placed by the authors in an osf archive. You can download the 7 .csv files (one for each COP) here

The twarc2 program serves as an interface with the Twitter API, allowing users to retrieve full tweet content and metadata by providing the tweet ID. Similar to GitHub, you need to create a Twitter API account and configure twarc on your local machine by providing your account authentication keys. To rehydrate a set of tweets using their IDs, you can use the following command: twarc2 hydrate source_file.txt store_file.jsonl. In this command, each line of the source_file.txt represents a Twitter ID, and the hydrated tweets will be stored in the store_file.jsonl.

  • First, format the downloaded IDs and split them into separate files (batches) to make hydration calls to the API more time manageable (hours versus days - this is slow because of an API-imposed limit of 100 tweets/min.).

# import os
# dir_name='Falkenberg2022_data/'
# if not os.path.exists(dir_name):
#     os.mkdir(dir_name)
# batch_size = int(1e5)
# download_pathname=''#~/projects/ClimateMatch/SocioEconDay/Polarization/COP_Twitter_IDs/
# for copid in range(20,27):
#     df_tweetids=pd.read_csv(download_pathname+'tweet_ids_cop'+str(copid)+'.csv')
#     for batch_id,break_id in enumerate(range(0,len(df_tweetids),batch_size)):
#         file_name="tweetids_COP"+str(copid)+"_b"+str(batch_id)+".txt"
#         df_tweetids.loc[break_id:break_id+batch_size,'id'].to_csv(dir_name+file_name,index=False,header=False)
  • Make the hydration calls for COP26 (this took 4 days to download 50GB of data for COP26).

# import glob
# import time
# copid=26
# filename_list = glob.glob('Falkenberg2022_data/'+"tweetids_COP"+str(copid)+"*")
# dir_name='tweet_data/'
# if not os.path.exists(dir_name):
#     os.mkdir(dir_name)
# file_name="tweetids_COP"+str(copid)+"_b"+str(batch_id)+".txt"
# for itt,tweet_id_batch_filename in enumerate(filename_list):
#     strvars=tweet_id_batch_filename.split('/')[1].split('.')[0].split('_')
#     tweet_store_filename = dir_name+'tweets_'+strvars[1]+'_'+strvars[2]+'.json'
#     if not os.path.exists(tweet_store_filename):
#         st=time.time()
#         os.system('twarc2 hydrate '+tweet_id_batch_filename+' '+tweet_store_filename)
#         print(str(itt)+' '+str(strvars[2])+" "+str(time.time()-st))
  • Load the data, then inspect and pick a chunk size. Note, by default, there are 100 tweets per line in the .json files returned by the API. Given we asked for 1e5 tweets/batch, there should be 1e3 lines in these files.

# copid=26
# batch_id = 0
# tweet_store_filename = 'tweet_data/tweets_COP'+str(copid)+'_b'+str(batch_id)+'.json'
# num_lines = sum(1 for line in open(tweet_store_filename))
# num_lines
  • Now we read in the data, iterating over chunks in each batch and only store the needed data in a dataframe (takes 10-20 minutes to run). Let’s look at when the tweets were posted, what language they are in, and the tweet text:

# selected_columns = ['created_at','lang','text']
# st=time.time()
# filename_list = glob.glob('tweet_data/'+"tweets_COP"+str(copid)+"*")
# df=[]
# for tweet_batch_filename in filename_list[:-1]:
#     reader = pd.read_json(tweet_batch_filename, lines=True,chunksize=1)
# #     df.append(pd.DataFrame([item[selected_columns] for sublist in reader.data.values.tolist()[:-1] for item in sublist] )[selected_columns])
#     dfs=[]
#     for chunk in reader:
#         if 'data' in chunk.columns:
#             dfs.append(pd.DataFrame(list(chunk.data.values)[0])[selected_columns])
#     df.append(pd.concat(dfs,ignore_index=True))
# #     df.append(pd.DataFrame(list(reader.data)[0])[selected_columns])
# df=pd.concat(df,ignore_index=True)
# df.created_at=pd.to_datetime(df.created_at)
# print(str(len(df))+' tweets took '+str(time.time()-st))
# df.head()
  • Finally, store the data in the efficiently compressed feather format

# df.to_feather('stored_tweets')

Section 2: Load and Inspect Data#

Now that we have reviewed the steps that were taken to generate the preprocessed data, we can load the data. It may a few minutes to download the data.

filename_tweets = "stored_tweets"
url_tweets = "https://osf.io/download/8p52x/"
df = pd.read_feather(
    pooch_load(url_tweets, filename_tweets)
)  # takes a couple minutes to download
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 3
      1 filename_tweets = "stored_tweets"
      2 url_tweets = "https://osf.io/download/8p52x/"
----> 3 df = pd.read_feather(
      4     pooch_load(url_tweets, filename_tweets)
      5 )  # takes a couple minutes to download

NameError: name 'pd' is not defined

Let’s check the timing of the tweets relative to the COP26 event (duration shaded in blue in the plot you will make) to see how the number of tweets vary over time.

total_tweetCounts = (
    df.created_at.groupby(df.created_at.apply(lambda x: x.date))
    .count()
    .rename("counts")
)
fig, ax = plt.subplots()
total_tweetCounts.reset_index().plot(
    x="created_at", y="counts", figsize=(20, 5), style=".-", ax=ax
)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
ax.set_yscale("log")
COPdates = [
    datetime.datetime(2021, 10, 31),
    datetime.datetime(2021, 11, 12),
]  # shade the duration of the COP26 to guide the eye
ax.axvspan(*COPdates, alpha=0.3)
# gray region
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 2
      1 total_tweetCounts = (
----> 2     df.created_at.groupby(df.created_at.apply(lambda x: x.date))
      3     .count()
      4     .rename("counts")
      5 )
      6 fig, ax = plt.subplots()
      7 total_tweetCounts.reset_index().plot(
      8     x="created_at", y="counts", figsize=(20, 5), style=".-", ax=ax
      9 )

NameError: name 'df' is not defined

In addition to assessing the number of tweets, we can also explore who was tweeting about this COP. Look at how many tweets were posted in various languages:

counts = df.lang.value_counts().reset_index()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 counts = df.lang.value_counts().reset_index()

NameError: name 'df' is not defined

The language name of the tweet is stored as a code name. We can pull a language code dictionary from the web and use it to translate the language code to the language name.

target_url = "https://gist.githubusercontent.com/carlopires/1262033/raw/c52ef0f7ce4f58108619508308372edd8d0bd518/gistfile1.txt"
exec(urllib.request.urlopen(target_url).read())
lang_code_dict = dict(iso_639_choices)
counts = counts.replace({"index": lang_code_dict})
counts
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 2
      1 target_url = "https://gist.githubusercontent.com/carlopires/1262033/raw/c52ef0f7ce4f58108619508308372edd8d0bd518/gistfile1.txt"
----> 2 exec(urllib.request.urlopen(target_url).read())
      3 lang_code_dict = dict(iso_639_choices)
      4 counts = counts.replace({"index": lang_code_dict})

NameError: name 'urllib' is not defined

Coding Exercise 2#

Run the following cell to print the dictionary for the language codes:

lang_code_dict
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 lang_code_dict

NameError: name 'lang_code_dict' is not defined

Find your native language code in the dictionary you just printed and use it to select the COP tweets that were written in your language!

language_code = ...
df_tmp = df.loc[df.lang == language_code, :].reset_index(drop=True)
pd.options.display.max_rows = 100  # see up to 100 entries
pd.options.display.max_colwidth = 250  # widen how much text is presented of each tweet
samples = ...
samples
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 2
      1 language_code = ...
----> 2 df_tmp = df.loc[df.lang == language_code, :].reset_index(drop=True)
      3 pd.options.display.max_rows = 100  # see up to 100 entries
      4 pd.options.display.max_colwidth = 250  # widen how much text is presented of each tweet

NameError: name 'df' is not defined

Click for solution

df = df_tmp
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[17], line 1
----> 1 df = df_tmp

NameError: name 'df_tmp' is not defined

Section 3: Word Set Prevalence#

Falkenberg et al. investigated the hypothesis that public sentiment around the COP conferences has increasingly framed them as hypocritical (“political hypocrisy as a topic of cross-ideological appeal”). The authors operationalized hypocrisy language as any tweet containing any of the following words:

selected_words = [
    "hypocrisy",
    "hypocrite",
    "hypocritical",
    "greenwash",
    "green wash",
    "blah",
]  # the last 3 words don't add much. Greta Thurnberg's 'blah, blah blah' speech on Sept. 28th 2021.

Questions 3#

  1. How might this matching procedure be limited in its ability to capture this sentiment?

Click for solution

The authors then searched for these words within a distinct dataset across all COP conferences (this dataset was not made openly accessible but the figure using that data is here). They found that hypocrisy has been mentioned more in recent COP conferences.

Here, we will shift our focus to their accessible COP26 dataset and analyze the nature of comments related to specific topics, such as political hypocrisy. First, let’s look through the whole dataset and pull tweets that mention any of the selected words.

selectwords_detector = re.compile(
    r"\b(?:{0})\b".format("|".join(selected_words))
)  # to make a word detector for a wordlist faster to run, compile it!
df["select_talk"] = df.text.apply(
    lambda x: selectwords_detector.search(x, re.IGNORECASE)
)  # look through whole dataset, flagging tweets with select_talk (computes in under a minute)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[19], line 1
----> 1 selectwords_detector = re.compile(
      2     r"\b(?:{0})\b".format("|".join(selected_words))
      3 )  # to make a word detector for a wordlist faster to run, compile it!
      4 df["select_talk"] = df.text.apply(
      5     lambda x: selectwords_detector.search(x, re.IGNORECASE)
      6 )  # look through whole dataset, flagging tweets with select_talk (computes in under a minute)

NameError: name 're' is not defined

Let’s extract these tweets and examine their occurrence statistics in relation to the entire dataset that we calculated above.

selected_tweets = df.loc[~df.select_talk.isnull(), :]
selected_tweet_counts = (
    selected_tweets.created_at.groupby(
        selected_tweets.created_at.apply(lambda x: x.date)
    )
    .count()
    .rename("counts")
)
selected_tweet_fraction = selected_tweet_counts / total_tweetCounts
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[20], line 1
----> 1 selected_tweets = df.loc[~df.select_talk.isnull(), :]
      2 selected_tweet_counts = (
      3     selected_tweets.created_at.groupby(
      4         selected_tweets.created_at.apply(lambda x: x.date)
   (...)
      7     .rename("counts")
      8 )
      9 selected_tweet_fraction = selected_tweet_counts / total_tweetCounts

NameError: name 'df' is not defined
fig, ax = plt.subplots(figsize=(20, 5))
selected_tweet_fraction.reset_index().plot(
    x="created_at", y="counts", style=[".-"], ax=ax
)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
ax.axvspan(*COPdates, alpha=0.3)  # gray region
ax.set_ylabel("fraction talking about hypocrisy")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[21], line 1
----> 1 fig, ax = plt.subplots(figsize=(20, 5))
      2 selected_tweet_fraction.reset_index().plot(
      3     x="created_at", y="counts", style=[".-"], ax=ax
      4 )
      5 ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")

NameError: name 'plt' is not defined

Please note that these fractions are normalized, meaning that larger fractions closer to the COP26 dates (shaded in blue) when the total number of tweets are orders of magnitude larger indicate a significantly greater absolute number of tweets talking about hypocrisy.

Now, let’s examine the content of these tweets by randomly sampling 100 of them.

selected_tweets.text.sample(100).values
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[22], line 1
----> 1 selected_tweets.text.sample(100).values

NameError: name 'selected_tweets' is not defined

Coding Exercise 3#

  1. Please select another topic and provide a list of topic words. We will then conduct the same analysis for that topic. For example, if the topic is “renewable technology,” please provide a list of relevant words.

selected_words_2 = [..., ..., ..., ..., ...]

selectwords_detector_2 = re.compile(r"\b(?:{0})\b".format("|".join([str(word) for word in selected_words_2])))
df["select_talk_2"] = df.text.apply(
    lambda x: selectwords_detector_2.search(x, re.IGNORECASE)
)

selected_tweets_2 = df.loc[~df.select_talk_2.isnull(), :]
selected_tweet_counts_2 = (
    selected_tweets_2.created_at.groupby(
        selected_tweets_2.created_at.apply(lambda x: x.date)
    )
    .count()
    .rename("counts")
)
selected_tweet_fraction_2 = ...

samples = ...
samples
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[23], line 3
      1 selected_words_2 = [..., ..., ..., ..., ...]
----> 3 selectwords_detector_2 = re.compile(r"\b(?:{0})\b".format("|".join([str(word) for word in selected_words_2])))
      4 df["select_talk_2"] = df.text.apply(
      5     lambda x: selectwords_detector_2.search(x, re.IGNORECASE)
      6 )
      8 selected_tweets_2 = df.loc[~df.select_talk_2.isnull(), :]

NameError: name 're' is not defined

Click for solution

Section 4: Sentiment Analysis#

Let’s test this hypothesis from Falkenberg et al. (that public sentiment around the COP conferences has increasingly framed them as political hypocrisy). To do so, we can use sentiment analysis, which is a method for computing the proportion of words that have positive connotations, negative connotations or are neutral. Some sentiment analysis systems can measure other word attributes as well. In this case, we will analyze the sentiment of the subset of tweets that mention international organizations central to globalization (e.g., G7), focusing specifically on the tweets related to hypocrisy.

Note: part of the computation flow in what follows is from Caren Neal’s tutorial.

We’ll assign tweets a sentiment score using a dictionary method (i.e. based on the word sentiment scores of words in the tweet that appear in given word-sentiment score dictionary). The particular word-sentiment score dictionary we will use is compiled in the AFINN package and reflects a scoring between -5 (negative connotation) and 5 (positive connotation). The English language dictionary consists of 2,477 coded words.

Let’s initialize the dictionary for the selected language. For example, the language code for English is ‘en’.

afinn = Afinn(language=language_code)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 afinn = Afinn(language=language_code)

NameError: name 'Afinn' is not defined

Now we can load the dictionary:

filename_afinn_wl = "AFINN-111.txt"
url_afinn_wl = (
    "https://raw.githubusercontent.com/fnielsen/afinn/master/afinn/data/AFINN-111.txt"
)

afinn_wl_df = pd.read_csv(
    pooch_load(url_afinn_wl, filename_afinn_wl),
    header=None,  # no column names
    sep="\t",  # tab sepeated
    names=["term", "value"],
)  # new column names
seed = 808  # seed for sample so results are stable
afinn_wl_df.sample(10, random_state=seed)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[25], line 6
      1 filename_afinn_wl = "AFINN-111.txt"
      2 url_afinn_wl = (
      3     "https://raw.githubusercontent.com/fnielsen/afinn/master/afinn/data/AFINN-111.txt"
      4 )
----> 6 afinn_wl_df = pd.read_csv(
      7     pooch_load(url_afinn_wl, filename_afinn_wl),
      8     header=None,  # no column names
      9     sep="\t",  # tab sepeated
     10     names=["term", "value"],
     11 )  # new column names
     12 seed = 808  # seed for sample so results are stable
     13 afinn_wl_df.sample(10, random_state=seed)

NameError: name 'pd' is not defined

Let’s look at the distribution of scores over all words in the dictionary

fig, ax = plt.subplots()
afinn_wl_df.value.value_counts().sort_index().plot.bar(ax=ax)
ax.set_xlabel("Finn score")
ax.set_ylabel("dictionary counts")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[26], line 1
----> 1 fig, ax = plt.subplots()
      2 afinn_wl_df.value.value_counts().sort_index().plot.bar(ax=ax)
      3 ax.set_xlabel("Finn score")

NameError: name 'plt' is not defined

These scores were assigned to words based on labeled tweets (validation paper).

Before focussing on sentiments about institutions within the hypocrisy tweets, let’s look at the hypocrisy tweets in comparison to non-hypocrisy tweets. This will take some more intensive computation, so let’s only perform it on a 1% subsample of the dataset

smalldf = df.sample(frac=0.01)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[27], line 1
----> 1 smalldf = df.sample(frac=0.01)

NameError: name 'df' is not defined
smalldf["afinn_score"] = smalldf.text.apply(
    afinn.score
)  # intensive computation! We have reduced the data set to frac=0.01 it's size so it takes ~1 min. (the full dataset takes 1hrs 50 min.)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[28], line 1
----> 1 smalldf["afinn_score"] = smalldf.text.apply(
      2     afinn.score
      3 )  # intensive computation! We have reduced the data set to frac=0.01 it's size so it takes ~1 min. (the full dataset takes 1hrs 50 min.)

NameError: name 'smalldf' is not defined
smalldf["afinn_score"].describe()  # generate descriptive statistics.
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[29], line 1
----> 1 smalldf["afinn_score"].describe()  # generate descriptive statistics.

NameError: name 'smalldf' is not defined

From this, we can see that the maximum score is 24 and the minimum score is -33. The score is computed by summing up the scores of all dictionary words present in the tweet, which means that longer tweets tend to have higher scores.

To make the scores comparable across tweets of different lengths, a rough approach is to convert them to a per-word score. This is done by normalizing each tweet’s score by its word count. It’s important to note that this per-word score is not specific to the dictionary words used, so this approach introduces a bias that depends on the proportion of dictionary words in each tweet. We will refer to this normalized score as afinn_adjusted.

def word_count(text_string):
    """Calculate the number of words in a string"""
    return len(text_string.split())


smalldf["word_count"] = smalldf.text.apply(word_count)
smalldf["afinn_adjusted"] = (
    smalldf["afinn_score"] / smalldf["word_count"]
)  # note this isn't a percentage
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[30], line 6
      2     """Calculate the number of words in a string"""
      3     return len(text_string.split())
----> 6 smalldf["word_count"] = smalldf.text.apply(word_count)
      7 smalldf["afinn_adjusted"] = (
      8     smalldf["afinn_score"] / smalldf["word_count"]
      9 )  # note this isn't a percentage

NameError: name 'smalldf' is not defined
smalldf["afinn_adjusted"].describe()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[31], line 1
----> 1 smalldf["afinn_adjusted"].describe()

NameError: name 'smalldf' is not defined

After normalizing the scores, we find that the maximum score is now 2 and the minimum score is now -1.5.

Now let’s look at the sentiment of tweets with hypocrisy words versus those without those words. For reference, we’ll first make cumulative distribution plots of score distributions for some other possibly negative words: fossil, G7, Boris and Davos.

for sel_words in [["Fossil"], ["G7"], ["Boris"], ["Davos"], selected_words]:
    sel_name = sel_words[0] if len(sel_words) == 1 else "select_talk"
    selectwords_detector = re.compile(
        r"\b(?:{0})\b".format("|".join(sel_words))
    )  # compile for speed!
    smalldf[sel_name] = smalldf.text.apply(
        lambda x: selectwords_detector.search(x, re.IGNORECASE) is not None
    )  # flag if tweet has word(s)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[32], line 3
      1 for sel_words in [["Fossil"], ["G7"], ["Boris"], ["Davos"], selected_words]:
      2     sel_name = sel_words[0] if len(sel_words) == 1 else "select_talk"
----> 3     selectwords_detector = re.compile(
      4         r"\b(?:{0})\b".format("|".join(sel_words))
      5     )  # compile for speed!
      6     smalldf[sel_name] = smalldf.text.apply(
      7         lambda x: selectwords_detector.search(x, re.IGNORECASE) is not None
      8     )  # flag if tweet has word(s)

NameError: name 're' is not defined
for sel_words in [["Fossil"], ["G7"], ["Boris"], ["Davos"], selected_words]:
    sel_name = sel_words[0] if len(sel_words) == 1 else "select_talk"
    fig, ax = plt.subplots()
    ax.set_xlim(-1, 1)
    ax.set_xlabel("adjusted Finn score")
    ax.set_ylabel("probabilty")
    counts, bins = np.histogram(
        smalldf.loc[smalldf[sel_name], "afinn_adjusted"],
        bins=np.linspace(-1, 1, 101),
        density=True,
    )
    ax.plot(bins[:-1], np.cumsum(counts), color="C0", label=sel_name + " tweets")
    counts, bins = np.histogram(
        smalldf.loc[~smalldf[sel_name], "afinn_adjusted"],
        bins=np.linspace(-1, 1, 101),
        density=True,
    )
    ax.plot(
        bins[:-1], np.cumsum(counts), color="C1", label="non-" + sel_name + " tweets"
    )
    ax.axvline(0, color=[0.7] * 3, zorder=1)
    ax.legend()
    ax.set_title("cumulative Finn score distribution for " + sel_name + " occurence")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[33], line 3
      1 for sel_words in [["Fossil"], ["G7"], ["Boris"], ["Davos"], selected_words]:
      2     sel_name = sel_words[0] if len(sel_words) == 1 else "select_talk"
----> 3     fig, ax = plt.subplots()
      4     ax.set_xlim(-1, 1)
      5     ax.set_xlabel("adjusted Finn score")

NameError: name 'plt' is not defined

Recall from our previous calculations that the tweets containing the selected hypocrisy-associated words have minimum adjusted score of -1.5. This score is much more negative than the scores of all four reference words we just plotted. So what is the content of these selected tweets that is causing them to be so negative? The explore this, we can use word clouds to assess the usage of specific words.

Section 5: Word Clouds#

To analyze word usage, let’s first vectorize the text data. Vectorization (also known as tokenization) here means giving each word in the vocabulary an index and transforming each word sequence to its vector representation and creating a sequence of elements with the corresponding word indices (e.g. the response ['I','love','icecream'] maps to something like [34823,5937,79345]).

We’ll use and compare two methods: term-frequency (\(\mathrm{tf}\)) and term-frequency inverse document frequency (\(\mathrm{Tfidf}\)). Both of these methods measure how important a term is within a document relative to a collection of documents by using vectorization to transform words into numbers.

Term Frequency (\(\mathrm{tf}\)): the number of times the word appears in a document compared to the total number of words in the document.

\[\mathrm{tf}=\frac{\mathrm{number \; of \; times \; the \; term \; appears \; in \; the \; document}}{\mathrm{total \; number \; of \; terms \; in \; the \; document}}\]

Inverse Document Frequency (\(\mathrm{idf}\)): reflects the proportion of documents in the collection of documents that contain the term. Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance values than words common across all documents (e.g., a, the, and).

\[\mathrm{idf}=\frac{\log(\mathrm{number \; of \; the \; documents \; in \; the \; collection})}{\log(\mathrm{number \; of \; documents \; in \; the \; collection \; containing \; the \; term})}\]

Thus the overall term-frequency inverse document frequency can be calculated by multiplying the term-frequency and the inverse document frequency:

\[\mathrm{Tfidf}=\mathrm{Tf} * \mathrm{idf}\]

\(\mathrm{Tfidf}\) aims to add more discriminability to frequency as a word relevance metric by downweighting words that appear in many documents since these common words are less discriminative. In other words, the importance of a term is high when it occurs a lot in a given document and rarely in others.

If you are interested in learning more about the mathematical equations used to develop these two methods, please refer to the additional details in the “Further Reading” section for this day.

Let’s run both of these methods and store the vectorized data in a dictionary:

vectypes = ["counts", "Tfidf"]


def vectorize(doc_data, ngram_range=(1, 1), remove_words=[], min_doc_freq=1):

    vectorized_data_dict = {}
    for vectorizer_type in vectypes:
        if vectorizer_type == "counts":
            vectorizer = CountVectorizer(
                stop_words=remove_words, min_df=min_doc_freq, ngram_range=ngram_range
            )
        elif vectorizer_type == "Tfidf":
            vectorizer = TfidfVectorizer(
                stop_words=remove_words, min_df=min_doc_freq, ngram_range=ngram_range
            )

        vectorized_doc_list = vectorizer.fit_transform(data).todense().tolist()
        feature_names = (
            vectorizer.get_feature_names_out()
        )  # or  get_feature_names() depending on scikit learn version
        print("vocabulary size:" + str(len(feature_names)))
        wdf = pd.DataFrame(vectorized_doc_list, columns=feature_names)
        vectorized_data_dict[vectorizer_type] = wdf
    return vectorized_data_dict, feature_names


def plot_wordcloud_and_freqdist(wdf, title_str, feature_names):
    """
    Plots a word cloud
    """
    pixel_size = 600
    x, y = np.ogrid[:pixel_size, :pixel_size]
    mask = (x - pixel_size / 2) ** 2 + (y - pixel_size / 2) ** 2 > (
        pixel_size / 2 - 20
    ) ** 2
    mask = 255 * mask.astype(int)
    wc = WordCloud(
        background_color="rgba(255, 255, 255, 0)", mode="RGBA", mask=mask, max_words=50
    )  # ,relative_scaling=1)
    wordfreqs = wdf.T.sum(axis=1)
    num_show = 50
    sorted_ids = np.argsort(wordfreqs)[::-1]

    fig, ax = plt.subplots(figsize=(10, 5))
    ax.bar(x=range(num_show), height=wordfreqs[sorted_ids][:num_show])
    ax.set_xticks(range(num_show))
    ax.set_xticklabels(
        feature_names[sorted_ids][:num_show], rotation=45, fontsize=8, ha="right"
    )
    ax.set_ylabel("total frequency")
    ax.set_title(title_str + " vectorizer")
    ax.set_ylim(0, 10 * wordfreqs[sorted_ids][int(num_show / 2)])

    ax_wc = inset_axes(ax, width="90%", height="90%")
    wc.generate_from_frequencies(wordfreqs)
    ax_wc.imshow(wc, interpolation="bilinear")
    ax_wc.axis("off")


nltk.download(
    "stopwords"
)  # downloads basic stop words, i.e. words with little semantic value  (e.g. "the"), to be used as words to be removed
remove_words = stopwords.words("english")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[34], line 60
     56     ax_wc.imshow(wc, interpolation="bilinear")
     57     ax_wc.axis("off")
---> 60 nltk.download(
     61     "stopwords"
     62 )  # downloads basic stop words, i.e. words with little semantic value  (e.g. "the"), to be used as words to be removed
     63 remove_words = stopwords.words("english")

NameError: name 'nltk' is not defined

We can now vectorize and look at the wordclouds for single word statistics. Let’s explicitly exclude some words and implicity exclude ones that appear in fewer than some threshold number of tweets.

data = (
    selected_tweets["text"].sample(frac=0.1).values
)  # reduce size since the vectorization computation transforms the corpus into an array of large size (vocabulary size x number of tweets)
# let's add some more words that we don't want to track (you can generate this kind of list iteratively by looking at the results and adding to this list):
remove_words += [
    "cop26",
    "http",
    "https",
    "30",
    "000",
    "je",
    "rt",
    "climate",
    "limacop20",
    "un_climatetalks",
    "climatechange",
    "via",
    "ht",
    "talks",
    "unfccc",
    "peru",
    "peruvian",
    "lima",
    "co",
]
print(str(len(data)) + " tweets")
min_doc_freq = 5 / len(data)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[35], line 2
      1 data = (
----> 2     selected_tweets["text"].sample(frac=0.1).values
      3 )  # reduce size since the vectorization computation transforms the corpus into an array of large size (vocabulary size x number of tweets)
      4 # let's add some more words that we don't want to track (you can generate this kind of list iteratively by looking at the results and adding to this list):
      5 remove_words += [
      6     "cop26",
      7     "http",
   (...)
     24     "co",
     25 ]

NameError: name 'selected_tweets' is not defined
ngram_range = (1, 1)  # start and end number of words
vectorized_data_dict, feature_names = vectorize(
    selected_tweets,
    ngram_range=ngram_range,
    remove_words=remove_words,
    min_doc_freq=min_doc_freq,
)
for vectorizer_type in vectypes:
    plot_wordcloud_and_freqdist(
        vectorized_data_dict[vectorizer_type], vectorizer_type, feature_names
    )
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[36], line 3
      1 ngram_range = (1, 1)  # start and end number of words
      2 vectorized_data_dict, feature_names = vectorize(
----> 3     selected_tweets,
      4     ngram_range=ngram_range,
      5     remove_words=remove_words,
      6     min_doc_freq=min_doc_freq,
      7 )
      8 for vectorizer_type in vectypes:
      9     plot_wordcloud_and_freqdist(
     10         vectorized_data_dict[vectorizer_type], vectorizer_type, feature_names
     11     )

NameError: name 'selected_tweets' is not defined

Note in the histograms how the \(\mathrm{Tfidf}\) vectorizer has scaled down the hypocrisy words such that they are less prevalent relative to the count vectorizer.

There are some words here (e.g. private and jet) that look like they likely would appear in pairs. Let’s tell the vectorizer to also look for high frequency pairs of words.

ngram_range = (1, 2)  # start and end number of words
vectorized_data_dict, feature_names = vectorize(
    selected_tweets,
    ngram_range=ngram_range,
    remove_words=remove_words,
    min_doc_freq=min_doc_freq,
)
for vectorizer_type in vectypes:
    plot_wordcloud_and_freqdist(
        vectorized_data_dict[vectorizer_type], vectorizer_type, feature_names
    )
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[37], line 3
      1 ngram_range = (1, 2)  # start and end number of words
      2 vectorized_data_dict, feature_names = vectorize(
----> 3     selected_tweets,
      4     ngram_range=ngram_range,
      5     remove_words=remove_words,
      6     min_doc_freq=min_doc_freq,
      7 )
      8 for vectorizer_type in vectypes:
      9     plot_wordcloud_and_freqdist(
     10         vectorized_data_dict[vectorizer_type], vectorizer_type, feature_names
     11     )

NameError: name 'selected_tweets' is not defined

The hypocrisy words take up so much frequency that it is hard to see what the remaining words are. To clear this list a bit more, let’s also remove the hypocrisy words altogether.

remove_words += selected_words
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[38], line 1
----> 1 remove_words += selected_words

NameError: name 'remove_words' is not defined
ngram_range = (1, 2)  # start and end number of words
vectorized_data_dict, feature_names = vectorize(
    selected_tweets,
    ngram_range=ngram_range,
    remove_words=remove_words,
    min_doc_freq=min_doc_freq,
)
for vectorizer_type in vectypes:
    plot_wordcloud_and_freqdist(
        vectorized_data_dict[vectorizer_type], vectorizer_type, feature_names
    )
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[39], line 3
      1 ngram_range = (1, 2)  # start and end number of words
      2 vectorized_data_dict, feature_names = vectorize(
----> 3     selected_tweets,
      4     ngram_range=ngram_range,
      5     remove_words=remove_words,
      6     min_doc_freq=min_doc_freq,
      7 )
      8 for vectorizer_type in vectypes:
      9     plot_wordcloud_and_freqdist(
     10         vectorized_data_dict[vectorizer_type], vectorizer_type, feature_names
     11     )

NameError: name 'selected_tweets' is not defined

Observe that terms we might have expected are associated with hypocrisy, e.g. “flying” are still present. Even when allowing for pairs, the semantics are hard to extract from this analysis that ignores the correlations in usage among multiple words.

To futher assess statistics, one approach is use a generative model with latent structure.

Topic models (the structural topic model in particular) are a nice modelling framework to start analyzing those correlations.

For a modern introduction to text analysis in the social sciences, I recommend the textbook:

Text as Data: A New Framework for Machine Learning and the Social Sciences (2022) by Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart

Summary#

In this tutorial, you’ve learned how to analyze large amounts of text data from social media to understand public sentiment about climate change. You’ve been introduced to the process of loading and examining Twitter data, specifically relating to the COP climate change conferences. You’ve also gained insights into identifying and analyzing sentiments associated with specific words, with a focus on those indicating ‘hypocrisy’.

We used techniques to normalize sentiment scores and to compare sentiment among different categories of tweets. You have also learned about text vectorization methods, term-frequency (tf) and term-frequency inverse document frequency (tfidf), and their applications in word usage analysis. This tutorial provided you a valuable stepping stone to further delve into text analysis, which could help deeper our understanding of public sentiment on climate change. Such analysis helps us track how global perceptions and narratives about climate change evolve over time, which is crucial for policy planning and climate communication strategies.

This tutorial therefore not only provided you with valuable tools for text analysis but also demonstrated their potential in contributing to our understanding of climate change perceptions, a key factor in driving climate action.

Resources#

The data for this tutorial can be accessed from Falkenberg et al. Nature Clim. Chg. 2022.