HW5. Natural Language Processing

38 minute read

Topics: NLP(Natural Language Processing)

Background

(From https://www.kaggle.com/headsortails/us-election-2020-presidential-debates.)

The US 2020 election saw the incumbent Republican president Donald Trump facing off against his Democrat challenger Joe Biden. Both candidates agreed to debate their political stances in the lead-up to the vote on November 3rd.

The 1st debate took place on September 29th at Case Western Reserve University in Cleveland, Ohio. It was moderated by Fox News anchor Chris Wallace.

After Donald Trump tested positive for Covid-19 after his first debate and subsequently recovered, the 2nd presidential debate was initially converted into a virtual event. However, Trump declined to participate in a virtual debate and the event was transformed into a (socially distanced) town hall with Joe Biden alone. This event was moderated by George Stephanopoulos on the ABC network and took place on October 15th. Donald Trump then decided to hold a town hall of his own, which took place at the same day and time as the Biden town hall and was aired on NBC. This event was moderated by Savannah Guthrie.

The actual 2nd, and final, presidential debate between Biden and Trump took place on October 22nd at Belmont University, Nashville, Tennessee and was moderated by NBC News’ Kristen Welker. This debate implemented stricter conditions on the candidates’ speaking times to limit the frequent interruptions that had characterised the 1st debate.

The goal of this homework assignment is to investigate the transcripts of those events.

The data files are available via https://github.com/umsi-data-science/data/tree/main/us_election_2020. Note that each file is available in two formats: CSV and TXT. You can work from either format – the content of the corresponding CSV and TXT files is the same.

You can either use the correct URLs to the Raw version of the data files like we’ve done with other data files in the course so far, you can clone the repo if you’re comfortable with git, or you can click on each file and then view the “Raw” format and download the resulting file. You do not need the us_election_2020_vice_presidential_debate files as we are only concerned with the presidential debates and town halls for this assignment.

Note that you will need to make decisions about how you proceed with the analysis. For example, you will need to decide whether you include or omit stopwords, whether you stem or lemmatize words before analysis, and how you handle weird situations.

You will need to decide when to exclude speech from the moderators and when to include it (note that in most cases, the assignment asks about the candidates, which means Trump and Biden and not the moderators).

Please fill in your uniqname in the next code cell:

MY_UNIQNAME = "yjwoo"

Answer each of the questions below. You are encouraged to use as many code and markdown cells as you need for each question.

We strongly suggest running all cells from top to bottom before you submit your notebook.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import spacy
import string
from spacy.lang.en.stop_words import STOP_WORDS
from nltk.stem import PorterStemmer
import matplotlib.patches as mpatches
from wordcloud import WordCloud
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from spacy.tokens import Doc
import statsmodels.api as sm
import statsmodels.formula.api as smf

1. How many words did each candidate (Trump & Biden) use in each debate?

1-0. prepare the data for each candidate

txt_1st = pd.read_csv("./data/us_election_2020_1st_presidential_debate.csv")
txt_2nd = pd.read_csv("./data/us_election_2020_2nd_presidential_debate.csv")
txt_townhall_biden = pd.read_csv("./data/us_election_2020_biden_town_hall.csv")
txt_townhall_trump = pd.read_csv("./data/us_election_2020_trump_town_hall.csv")

Let’s find all rows spoken by Trump and Biden, and make two data frames, one for only Trump and the other for only Biden.

Data for Biden

txt_1st.speaker.unique()

array(['Chris Wallace', 'Vice President Joe Biden',
       'President Donald J. Trump', 'Chris Wallace:'], dtype=object)

txt_1st_biden = txt_1st[txt_1st.speaker == "Vice President Joe Biden"][["text"]]
txt_1st_biden["source"] = "1st_debate"

txt_2nd.speaker.unique()

array(['Kristen Welker', 'Donald Trump', 'Joe Biden'], dtype=object)

txt_2nd_biden = txt_2nd[txt_2nd.speaker == "Joe Biden"][["text"]]
txt_2nd_biden["source"] = "2nd_debate"

txt_townhall_biden.speaker.unique()

array(['George Stephanopoulos', 'Joe Biden', 'Nicholas Fed', 'Kelly Lee',
       'Anthony Archer', 'Voice Over', 'Cedric Humphrey',
       'George Stephanopoulus', 'Angelia Politarhos', 'Speaker 1',
       'Nathan Osburn', 'Andrew Lewis', 'Speaker 2', 'Michele Ellison',
       'Mark Hoffman', 'Mieke Haeck', 'Speaker 3', 'Keenan Wilson'],
      dtype=object)

txt_townhall_biden_only = txt_townhall_biden[txt_townhall_biden.speaker == "Joe Biden"][["text"]]
txt_townhall_biden_only["source"] = "townhall"

txt_biden = pd.concat([txt_1st_biden, txt_2nd_biden, txt_townhall_biden_only])

txt_biden

	text	source
2	How you doing, man?	1st_debate
4	I’m well.	1st_debate
9	Well, first of all, thank you for doing this a...	1st_debate
11	The American people have a right to have a say...	1st_debate
12	Now, what’s at stake here is the President’s m...	1st_debate
...	...	...
212	Every single solitary generation, the dial has...	townhall
214	We’d better be able to do it again.	townhall
216	Well, it could say I’m a lousy candidate, and ...	townhall
218	Yeah. And by the way, before I came up here, I...	townhall
220	I expect to be there.	townhall

507 rows × 2 columns

Data for Trump

txt_1st.speaker.unique()

array(['Chris Wallace', 'Vice President Joe Biden',
       'President Donald J. Trump', 'Chris Wallace:'], dtype=object)

txt_1st_trump = txt_1st[txt_1st.speaker == "President Donald J. Trump"][["text"]]
txt_1st_trump["source"] = "1st_debate"

txt_2nd.speaker.unique()

array(['Kristen Welker', 'Donald Trump', 'Joe Biden'], dtype=object)

txt_2nd_trump = txt_2nd[txt_2nd.speaker == "Donald Trump"][["text"]]
txt_2nd_trump["source"] = "2nd_debate"

txt_townhall_trump.speaker.unique()

array(['Savannah Guthrie', 'Speaker 1', 'President Trump',
       'Jacqueline Lugo', 'Barbara Peña', 'Isabella Peña', 'Savannah',
       'Cristy Montesinos Alonso', 'Adam Schucher', 'Moriah Geene',
       'Cindy Velez', 'Paulette Dale'], dtype=object)

txt_townhall_trump_only = txt_townhall_trump[txt_townhall_trump.speaker == "President Trump"][["text"]]
txt_townhall_trump_only["source"] = "townhall"

txt_trump = pd.concat([txt_1st_trump, txt_2nd_trump, txt_townhall_trump_only])

txt_trump

	text	source
3	How are you doing?	1st_debate
6	Thank you very much, Chris. I will tell you ve...	1st_debate
7	And we won the election and therefore we have ...	1st_debate
10	Thank you, Joe.	1st_debate
14	There aren’t a hundred million people with pre...	1st_debate
...	...	...
395	It’s very happy. The fact is, we got rid of ca...	townhall
397	Sure.	townhall
399	That’s right.	townhall
401	Because I’ve done a great job. We have the str...	townhall
403	Thank you very much.	townhall

706 rows × 2 columns

1-1. lowering the case

First, let’s lower the case for all text.

txt_biden.text = txt_biden.text.str.lower()
txt_trump.text = txt_trump.text.str.lower()

1-2. remove punctuation and special characters

Then, let’s remove punctuation and special characters like ‘!”#$%&'()*+,-./:;<=>?@[\]^_`{

}~’.

table = str.maketrans(dict.fromkeys(string.punctuation))

txt_biden.text = txt_biden.text.str.translate(table)
txt_trump.text = txt_trump.text.str.translate(table)

1-3. tokenization

For the cases like “aren’t”, “doesn’t”, let’s do tokenization.

nlp = spacy.load('en_core_web_sm')

txt_biden.text = txt_biden.text.apply(lambda x: ' '.join([token.text for token in nlp(x)]))
txt_trump.text = txt_trump.text.apply(lambda x: ' '.join([token.text for token in nlp(x)]))

Since I am interested in how many words each candidate used, I think stemming process is needed to consider words like cat and cats as the same word.

from nltk.stem import PorterStemmer

ps = PorterStemmer()

ps.stem("i'm")

"i'm"

txt_biden.text.apply(lambda x: [ps.stem(y) for y in x.split()])

                                  [how, you, do, man]
                                          [i’m, well]
    [well, first, of, all, thank, you, for, do, th...
   [the, american, peopl, have, a, right, to, hav...
   [now, what’, at, stake, here, is, the, preside...
                             ...                        
  [everi, singl, solitari, gener, the, dial, ha,...
         [we’d, better, be, abl, to, do, it, again]
  [well, it, could, say, i’m, a, lousi, candid, ...
  [yeah, and, by, the, way, befor, i, came, up, ...
                         [i, expect, to, be, there]
Name: text, Length: 507, dtype: object

1-4. remove stop words

Since we want to count how many words each candidate used, let’s filter out stopwords that have little lexical content.

txt_biden.text =  txt_biden.text.apply(lambda x: ' '.join([word for word in x.split() if word not in (STOP_WORDS)]))
txt_trump.text =  txt_trump.text.apply(lambda x: ' '.join([word for word in x.split() if word not in (STOP_WORDS)]))

1-4. stemming

Since I am interested in how many words each candidate used, I think the stemming process is needed to consider words like ‘thinks’ and ‘thinking’ as the same word ‘think’.

ps = PorterStemmer()

txt_biden.text = txt_biden.text.apply(lambda x: ' '.join([ps.stem(word) for word in x.split()]))
txt_trump.text = txt_trump.text.apply(lambda x: ' '.join([ps.stem(word) for word in x.split()]))

1-4. count words

data = [("1st_debate", "Biden", len(txt_biden[txt_biden.source == "1st_debate"].text.str.split(expand = True).stack().unique())),
        ("1st_debate", "Trump", len(txt_trump[txt_trump.source == "1st_debate"].text.str.split(expand = True).stack().unique())),
        ("2nd_debate", "Biden", len(txt_biden[txt_biden.source == "2nd_debate"].text.str.split(expand = True).stack().unique())),
        ("2nd_debate", "Trump", len(txt_trump[txt_trump.source == "2nd_debate"].text.str.split(expand = True).stack().unique())),
        ("townhall", "Biden", len(txt_biden[txt_biden.source == "townhall"].text.str.split(expand = True).stack().unique())),
        ("townhall", "Trump", len(txt_trump[txt_trump.source == "townhall"].text.str.split(expand = True).stack().unique()))]

words_count = pd.DataFrame(data, columns = ["source", "speaker", "word_counts"])

words_count

	source	speaker	word_counts
0	1st_debate	Biden	860
1	1st_debate	Trump	809
2	2nd_debate	Biden	956
3	2nd_debate	Trump	831
4	townhall	Biden	1161
5	townhall	Trump	728

plt.figure(figsize = (10, 8))
sns.barplot(x = "source", y = "word_counts", hue = "speaker", data = words_count)
plt.xlabel("Source", fontsize = 16)
plt.ylabel("Word counts", fontsize = 16)
plt.show()

png

Biden used 860 words in the first debate, 956 words in the second debate, and 1161 words in the town hall.
Trump used 809 words in the first debate, 831 words in the second debate, and 728 words in the town hall.
Trump used an almost similar number of words in every event.
Biden used more words than trump in every event.

2. How many turns speaking did each candidate (Trump & Biden) take during each of the debates?

A “turn” is an uninterrupted period of speech. For example, consider the following exchange:

Chris: Big data is really interesting.

Colleague: Actually, it’s a bit boring.

Chris: Really? Why would you say that?

Colleague: Your choice of tools is really limited.

Colleague: I mean, you’re basically stuck with Spark, right?

Chris: Yeah, but Spark provides most of the functionality you need to do really cool data science work.

In it, Chris had 3 turns, while his colleague had 2.

You answer should include a dataframe that looks like (with the real values instead of $__$:

	First Debate	Second Debate
Candidate
Trump	__	__
Biden	__	__

txt_1st["before_speaker"] = txt_1st.speaker.shift()

txt_1st.loc[txt_1st.speaker != txt_1st.before_speaker, "is_turn"] = 1

first_debate_turn_count = pd.DataFrame(txt_1st.groupby("speaker").is_turn.sum() \
                              [["President Donald J. Trump", "Vice President Joe Biden"]]).rename(columns = {"is_turn" : "First Debate"})

first_debate_turn_count.index = ["Trump", "Biden"]

txt_2nd["before_speaker"] = txt_2nd.speaker.shift()

txt_2nd.loc[txt_2nd.speaker != txt_2nd.before_speaker, "is_turn"] = 1

second_debate_turn_count = pd.DataFrame(txt_2nd.groupby("speaker").is_turn.sum() \
                               [["Donald Trump", "Joe Biden"]]).rename(columns = {"is_turn" : "Second Debate"})

second_debate_turn_count.index = ["Trump", "Biden"]

debate_turn_count = pd.concat([first_debate_turn_count, second_debate_turn_count], axis = 1)

debate_turn_count.index.name = "Candidate"

debate_turn_count

	First Debate	Second Debate
Candidate
Trump	305.0	169.0
Biden	240.0	113.0

In the first debate, Trump had 305 turns and Biden had 240 turns.
In the second debate, Trump had 169 turns and Biden had 113 turns.
In both debates, Trump had more turns than Biden.

3. How many different nouns did each candidate (Trump & Biden) use in each of the debates?

Present your results using a visualization (hint: a simple bar chart, with one bar per candidate per debate would be a good choice; you should decide whether you want to somehow stack or group the bars in your bar chart, or just leave them all side by side). You do not need to report which nouns were used. Provide a written interpretation of your visualization.

data = [("1st_debate", "Biden", np.sum(pd.DataFrame(txt_biden[txt_biden.source == "1st_debate"].text.str.split(expand = True).stack() \
                                .unique(), columns = ["word"]).word.apply(lambda x: [x.pos_ for x in nlp(x)]).str[0] == "NOUN")),
        ("1st_debate", "Trump", np.sum(pd.DataFrame(txt_trump[txt_trump.source == "1st_debate"].text.str.split(expand = True).stack() \
                                .unique(), columns = ["word"]).word.apply(lambda x: [x.pos_ for x in nlp(x)]).str[0] == "NOUN")),
        ("2nd_debate", "Biden", np.sum(pd.DataFrame(txt_biden[txt_biden.source == "2nd_debate"].text.str.split(expand = True).stack() \
                                .unique(), columns = ["word"]).word.apply(lambda x: [x.pos_ for x in nlp(x)]).str[0] == "NOUN")),
        ("2nd_debate", "Trump", np.sum(pd.DataFrame(txt_trump[txt_trump.source == "2nd_debate"].text.str.split(expand = True).stack() \
                                .unique(), columns = ["word"]).word.apply(lambda x: [x.pos_ for x in nlp(x)]).str[0] == "NOUN")),
        ("townhall", "Biden", np.sum(pd.DataFrame(txt_biden[txt_biden.source == "townhall"].text.str.split(expand = True).stack() \
                                .unique(), columns = ["word"]).word.apply(lambda x: [x.pos_ for x in nlp(x)]).str[0] == "NOUN")),
        ("townhall", "Trump", np.sum(pd.DataFrame(txt_trump[txt_trump.source == "townhall"].text.str.split(expand = True).stack() \
                                .unique(), columns = ["word"]).word.apply(lambda x: [x.pos_ for x in nlp(x)]).str[0] == "NOUN"))]

noun_counts = pd.DataFrame(data, columns = ["source", "speaker", "distinct_noun_counts"])

noun_counts

	source	speaker	distinct_noun_counts
0	1st_debate	Biden	360
1	1st_debate	Trump	316
2	2nd_debate	Biden	401
3	2nd_debate	Trump	333
4	townhall	Biden	481
5	townhall	Trump	285

plt.figure(figsize = (10, 8))
sns.barplot(x = "source", y = "distinct_noun_counts", hue = "speaker", data = noun_counts)
plt.xlabel("Source", fontsize = 16)
plt.ylabel("Different noun counts", fontsize = 16)
plt.show()

png

Biden used 360 different nouns in the first debate, 401 different nouns in the second debate, and 481 different nouns in the town hall.
Trump used 316 different nouns in the first debate, 333 different nouns in the second debate, and 285 different nouns in the town hall.
Trump used an almost similar number of different nouns in every event
Biden used more different nouns than trump in every event.

4. Report the type-token ratio for each of the candidates (Trump & Biden), based on pooling the data from the two debates, and interpret the differences between the two candidates.

In text analysis, a ‘type’ is a unique word, whereas a ‘token’ is any word. For example, the sentence

“One fish, two fish, red fish, blue fish”

contains 8 tokens, but 5 types (one, two, red, blue, fish).

Interestingly, the type-token ratio can be used to monitor changes in the use of vocabulary items in children with under-developed vocabulary and/or word finding difficulties and, for example, in adults who have suffered a stroke and who consequently exhibit word retrieval difficulties and naming difficulties. Ref

data = [("Biden", 
         len(txt_biden[(txt_biden.source == "1st_debate") | (txt_biden.source == "2nd_debate")].text.str.split(expand = True).stack()),
         len(txt_biden[(txt_biden.source == "1st_debate") | (txt_biden.source == "2nd_debate")].text.str.split(expand = True).stack().unique())),
        ("Trump",
         len(txt_trump[(txt_trump.source == "1st_debate") | (txt_trump.source == "2nd_debate")].text.str.split(expand = True).stack()),
         len(txt_trump[(txt_trump.source == "1st_debate") | (txt_trump.source == "2nd_debate")].text.str.split(expand = True).stack().unique()))]

type_token_ratio = pd.DataFrame(data, columns = ["speaker", "num_of_token", "num_of_type"])

type_token_ratio["ratio"] = type_token_ratio.num_of_type / type_token_ratio.num_of_token

type_token_ratio

	speaker	num_of_token	num_of_type	ratio
0	Biden	5379	1346	0.250232
1	Trump	5702	1212	0.212557

plt.figure(figsize = (15, 7))

# bar graph for token
color = "darkblue"
ax1 = sns.barplot(x = "speaker", y = "num_of_token", color = color, alpha = 0.8, \
                  data = type_token_ratio)
ax1.set_xlabel("Speaker", fontsize = 16)
top_bar = mpatches.Patch(color = color, label = 'Num of Token')

# bar graph for type
color = "lightblue"
ax2 = sns.barplot(x = "speaker", y = "num_of_type",  color = color, alpha = 0.8, \
                  data = type_token_ratio)
ax2.set_ylabel("Number of Token or Type", fontsize = 16)
low_bar = mpatches.Patch(color = color, label = 'Num of Type')

plt.legend(handles=[top_bar, low_bar])

plt.show()

png

Trump’s number of tokens is greater than Biden’s number of tokens
But, Biden’s number of types is greater than Trump’s number of types
So, Biden’s type-token ratio is 0.25, which is higher than Trump’s type-token ratio of 0.21. That is, even though Trump used more words than Biden, there was more lexical variety in Biden’s speech than Trump’s speech.

5. Compare the use of named entities by the candidates (Trump & Biden) in their town hall meetings. What patterns, if any, do you see?

spaCy supports the following Named Entity types: PERSON, NORP (nationalities, religious and political groups), FAC (buildings, airports etc.), ORG (organizations), GPE (countries, cities etc.), LOC (mountain ranges, water bodies etc.), PRODUCT (products), EVENT (event names), WORK_OF_ART (books, song titles), LAW (legal document titles), LANGUAGE (named languages), DATE, TIME, PERCENT, MONEY, QUANTITY, ORDINAL and CARDINAL.

You might want to use some visualization techniques to support your claims.

biden_label_count = pd.DataFrame(txt_biden[txt_biden.source == "townhall"].text.apply(lambda x: ' '.join([x.label_ for x in nlp(x).ents])) \
                                 .str.split(expand = True).stack().value_counts()).reset_index().rename(columns = {"index" : "named_entity_type", 0 : "count"})

biden_label_count["speaker"] = "Biden"
biden_label_count["description"] = biden_label_count.named_entity_type.apply(spacy.explain)
biden_label_count = biden_label_count[["speaker", "named_entity_type", "count", "description"]]

biden_label_count

	speaker	named_entity_type	count	description
0	Biden	CARDINAL	63	Numerals that do not fall under another type
1	Biden	PERSON	49	People, including fictional
2	Biden	NORP	27	Nationalities or religious or political groups
3	Biden	ORG	24	Companies, agencies, institutions, etc.
4	Biden	GPE	23	Countries, cities, states
5	Biden	DATE	22	Absolute or relative dates or periods
6	Biden	MONEY	9	Monetary values, including unit
7	Biden	ORDINAL	5	"first", "second", etc.
8	Biden	PRODUCT	2	Objects, vehicles, foods, etc. (not services)
9	Biden	LAW	1	Named documents made into laws.
10	Biden	EVENT	1	Named hurricanes, battles, wars, sports events...
11	Biden	TIME	1	Times smaller than a day
12	Biden	QUANTITY	1	Measurements, as of weight or distance

plt.figure(figsize = (20, 8))
sns.barplot(x = "named_entity_type", y = "count", data = biden_label_count)
plt.xlabel("Named Entity types", fontsize = 16)
plt.ylabel("Count", fontsize = 16)
plt.title("Biden's named entity type count", fontsize = 20)
plt.show()

png

Biden used a total of 13 types of named entity types.
Among the named entity types used by Biden, the most common type is CARDINAL, followed by PERSON.
Types related to a specific group, such as NORP, ORG, and GPE, were the next most used, with about 25 times each.

trump_label_count = pd.DataFrame(txt_trump[txt_trump.source == "townhall"].text.apply(lambda x: ' '.join([x.label_ for x in nlp(x).ents])) \
                                 .str.split(expand = True).stack().value_counts()).reset_index().rename(columns = {"index" : "named_entity_type", 0 : "count"})

trump_label_count["speaker"] = "Trump"
trump_label_count["description"] = trump_label_count.named_entity_type.apply(spacy.explain)
trump_label_count = trump_label_count[["speaker", "named_entity_type", "count", "description"]]

trump_label_count

	speaker	named_entity_type	count	description
0	Trump	GPE	41	Countries, cities, states
1	Trump	CARDINAL	38	Numerals that do not fall under another type
2	Trump	PERSON	29	People, including fictional
3	Trump	DATE	22	Absolute or relative dates or periods
4	Trump	NORP	16	Nationalities or religious or political groups
5	Trump	ORG	15	Companies, agencies, institutions, etc.
6	Trump	ORDINAL	3	"first", "second", etc.
7	Trump	LAW	3	Named documents made into laws.
8	Trump	MONEY	1	Monetary values, including unit
9	Trump	QUANTITY	1	Measurements, as of weight or distance

plt.figure(figsize = (20, 8))
sns.barplot(x = "named_entity_type", y = "count", data = trump_label_count)
plt.xlabel("Named Entity types", fontsize = 16)
plt.ylabel("Count", fontsize = 16)
plt.title("Trump's named entity type count", fontsize = 20)
plt.show()

png

Trump used a total of 10 types of named entity types.
Among the named entity types used by Trump, the most common type is GPE, followed by CARDINAL.
Among types related to specific groups such as NORP, ORG, and GPE, Trump used the GPE types related to Countries, cities, and states the most with a total of 41 times. The NORP and ORG types were used about 15 times each, about half of GPE.

label_count = pd.concat([biden_label_count, trump_label_count])

plt.figure(figsize = (20, 8))
sns.barplot(x = "named_entity_type", y = "count", hue = "speaker", data = label_count)
plt.xlabel("Named Entity types", fontsize = 16)
plt.ylabel("Count", fontsize = 16)
plt.title("Named entity type count comparison", fontsize = 20)
plt.show()

png

Biden used more various entity types than Trump.
Biden used PERSON type words a lot, and Trump especially used GPE type words a lot, related to Countries, cities, and states.
In all entity types except GPE and LAW, Biden used more or the same words than Trump.

6. Generate a word cloud for Trump’s town hall (use only Trump’s words) and one for Biden’s town hall (again, use only Biden’s words). Compare and contrast the two word clouds.

Install a word cloud generating package like the one from https://github.com/amueller/word_cloud

biden_wordcloud = WordCloud().generate([" ".join([row for row in txt_biden[txt_biden.source == "townhall"].text.tolist()])][0])

trump_wordcloud = WordCloud().generate([" ".join([row for row in txt_trump[txt_trump.source == "townhall"].text.tolist()])][0])

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (30, 8))

# Biden's word cloud
ax1.imshow(biden_wordcloud, interpolation='bilinear')
ax1.set_title("Biden's word cloud", fontsize = 30)
ax1.axis("off")

# Trump's word cloud
ax2.imshow(trump_wordcloud, interpolation='bilinear')
ax2.set_title("Trump's word cloud", fontsize = 30)
ax2.axis("off")

plt.show()

png

Looking at the two word clouds, it can be seen that go and peopl appeared a lot in both the words of Biden and Trump.
In Biden’s words, ‘thing’, ‘presid’, and ‘abl’ appeared a lot, and in Trump’s words, ‘know’, ‘want’, and ‘happen’ appeared a lot.

7. Sentiment analysis of Trump vs. Biden in second debate

Note that you will want to review the video on Sentiment Analysis with NLTK and spaCy from the pre-recorded lectures.

Perform a sentiment analysis that compares the sentiment of the two candidates (Trump & Biden) in the second debate.

There are several ways to do this. In its simplest form, you could determine the sentiment for each utterance, the perform some mathematical summary (sum, mean, etc.) A slightly more elegant solution would involve tracking how sentiment changes over the course of the debate for each of the candidates. A further extension of your work might use the pandas.DataFrame.rolling() method to calculate rolling averages, although using that function will probably require you to also learn about time functions in pandas.

No matter what approach you choose, you should be sure to use visualizations in your answer. As always, provide an interpretation of your results, including any visualizations (i.e. don’t just provide a visualization, explain what’s interesting that you see in your visualizations).

sent_analyzer = SentimentIntensityAnalyzer()

def sentiment_scores(docx):
    return sent_analyzer.polarity_scores(docx.text)

nlp = spacy.load('en_core_web_sm')

Doc.set_extension("sentimenter", getter = sentiment_scores, force = True)

# calculate rolling average of sentimental rate for Biden's 2nd debate

txt_2nd_biden["rating"] = txt_2nd_biden.text.apply(lambda x: nlp(x)._.sentimenter['compound'])

txt_2nd_biden['time_index'] = np.arange(1, txt_2nd_biden.shape[0] + 1)

txt_2nd_biden["rating_cumsum"] = txt_2nd_biden.rating.cumsum()

txt_2nd_biden["rating_rolling_average"] = txt_2nd_biden.rating_cumsum / txt_2nd_biden.time_index

txt_2nd_biden

	text	source	rating	time_index	rating_cumsum	rating_rolling_average
9	220,000 Americans dead. You hear nothing else ...	2nd_debate	0.1518	1	0.1518	0.151800
10	The expectation is we’ll have another 200,000 ...	2nd_debate	-0.5423	2	-0.3905	-0.195250
11	What I would do is make sure we have everyone ...	2nd_debate	0.8834	3	0.4929	0.164300
12	We’re in a situation now where the New England...	2nd_debate	-0.6449	4	-0.1520	-0.038000
20	Make sure it’s totally transparent. Have the s...	2nd_debate	-0.0516	5	-0.2036	-0.040720
...	...	...	...	...	...	...
495	He takes everything out of context, but the po...	2nd_debate	0.3919	129	-0.8321	-0.006450
501	No, I’m going to rejoin Paris Accord and make ...	2nd_debate	-0.0258	130	-0.8579	-0.006599
508	I will say, I’m an American President. I repre...	2nd_debate	0.9001	131	0.0422	0.000322
509	We can grow this economy, we can deal with the...	2nd_debate	0.9628	132	1.0050	0.007614
511	Thank you.	2nd_debate	0.3612	133	1.3662	0.010272

133 rows × 6 columns

# calculate rolling average of sentimental rate for Trump's 2nd debate

txt_2nd_trump["rating"] = txt_2nd_trump.text.apply(lambda x: nlp(x)._.sentimenter['compound'])

txt_2nd_trump['time_index'] = np.arange(1, txt_2nd_trump.shape[0] + 1)

txt_2nd_trump["rating_cumsum"] = txt_2nd_trump.rating.cumsum()

txt_2nd_trump["rating_rolling_average"] = txt_2nd_trump.rating_cumsum / txt_2nd_trump.time_index

txt_2nd_trump

	text	source	rating	time_index	rating_cumsum	rating_rolling_average
1	How are you doing? How are you?	2nd_debate	0.0000	1	0.0000	0.000000
5	So as you know, 2.2 million people modeled out...	2nd_debate	-0.9190	2	-0.9190	-0.459500
6	There was a very big spike in Texas. It’s now ...	2nd_debate	0.3612	3	-0.5578	-0.185933
7	I can tell you from personal experience, I was...	2nd_debate	0.5574	4	-0.0004	-0.000100
14	No, it’s not a guarantee, but it will be by th...	2nd_debate	0.6736	5	0.6732	0.134640
...	...	...	...	...	...	...
499	Is he going to get China to do it?	2nd_debate	0.0000	186	6.1531	0.033081
502	[Crosstalk 00:24:31].	2nd_debate	0.0000	187	6.1531	0.032904
504	We have to make our country totally successful...	2nd_debate	0.3708	188	6.5239	0.034702
505	Before the plague came in, just before, I was ...	2nd_debate	0.6808	189	7.2047	0.038120
506	Success is going to bring us together. We are ...	2nd_debate	-0.9655	190	6.2392	0.032838

190 rows × 6 columns

plt.figure(figsize = (15,7))

# line plot for Biden's rolling average of sentimental
color = "tab:blue"

sns.lineplot(data = txt_2nd_biden.rating_rolling_average, color = color, linewidth = 3)
blue_line = mpatches.Patch(color = color, label = 'Biden')

# line plot for Biden's rolling average of sentimental
color = "tab:red"

sns.lineplot(data = txt_2nd_trump.rating_rolling_average, color = color, linewidth = 3)
red_line = mpatches.Patch(color = color, label = 'Trump')

plt.legend(handles=[blue_line, red_line])
plt.xlabel("Time index", fontsize = 16)
plt.ylabel("Rolling average of sentimental rate", fontsize = 16)
plt.show()

png

In both cases, the ups and downs of the sentimental rate are large at the beginning, and the ups and downs become smaller as time goes by.
Trump’s sentimental ups and downs are greater than Biden’s.
In general, Trump tends to have a higher average sentimental rate than Biden.

Youngjun Woo