Text Analysis with Bias

I will try using tidytext on a new dataset about Russian troll tweets. These are tweets from Twitter handles that are connected to the Internet Research Agency (IRA), a Russian “troll factory.”

Iván López Torres true
2022-02-15

The majority of these tweets were posted from 2015-2017, but the datasets encompass tweets from February 2012 to May 2018, documentation can be found on this link: https://github.com/fivethirtyeight/russian-troll-tweets/.

Three of the main categories of troll tweets that we will be focusing on are Left Trolls, Right Trolls, and News Feed. Left Trolls usually pretend to be BLM activists, aiming to divide the democratic party. Right trolls imitate Trump supporters, and News Feed handles are “local news aggregators”, typically linking to legitimate news.

For our upcoming analyses, some important variables are:

Metadata:

Original Dataset Dimension: [Columns=21, Rows=239350]

English Tweets Dataset Dimension: [Columns=21, Rows=175966]

Table 1: Summary of Latest 5 Tweets
author content language creation_date
WOKELUISA Michelle Obama is the most academically accomplished First Lady. She skipped second grade, graduated salutatorian at her Magnet high school for gifted students, went to Princeton (graduating cum laude) and then Harvard Law School. https://t.co/ee9RhFCpJx English 2018-03-21 23:29:00
WOKELUISA There’s actually one good thing about the Trump presidency. It has finally exposed ‘evangelical Christians’ for what they are - misogynists, pedophile supporters and Nazi sympathizers. English 2018-03-21 20:25:00
WOKELUISA This administration is more willing to ban entire RACES of people than to ban assault rifles English 2018-03-20 23:04:00
WOKELUISA Stay tuned to Dicki-Leaks! https://t.co/H3f65T4K4S English 2018-03-20 21:57:00
WOKELUISA Does anyone believe @CamAnalytica CEO Alexander Nix committed the misconduct all by himself? NO. Others had to be involved. Did Jared Kushner, who hired Cambridge Analytica for the Trump campaign, know? Also, why is Kushner still a Senior White House Advisor? https://t.co/k0k1lrKfIb English 2018-03-20 21:54:00

Next, I will plot basic exploratory data of the different locations which tweets were posted:

Exploratory plot of the account categories of the tweets:

A subset of the tweets was created to see how often the top words appear:

Sentiment analysis is a popular topic of research among the researchers after collecting data. The primary objective of the sentiment analysis is to investigate opinions, attitudes, and emotions of the users towards a subject matter of interest.

The bing lexicon categorizes words in a binary fashion into positive and negative categories.

I will report how many positive and negative words there are in the dataset:

Table 2: Positive and Negative Words in the Dataset
sentiment n
negative 91183
positive 61695

According to above table, these tweets are charged with more negative words (approximately 60%), introducing misconceptions and biases to the data.

Using above words, the next wordcloud has been created:

And this is a wordcloud colored by sentiment:

“Undoing” bias

Four things to know about race and gender bias in algorithms:

Cognitive science research shows that humans are unable to identify their own biases. And since humans create algorithms, bias blind-spots will multiply unless we create systems to shine a light, gauge risks, and systematically eliminate them.

Some people use these misconceptions to escape responsibility so it’s important to be careful.

Understanding the various causes of biases is the first step in the adoption of effective algorithmic hygiene. Even when flaws in the training data are corrected, the results may still be problematic because context matters during the bias detection phase.

Some decisions will be best served by algorithms and other AI tools, while others may need thoughtful consideration before computer models are designed. Further, testing and review of certain algorithms will also identify, and, at best, mitigate discriminatory outcomes.