I will try using tidytext on a new dataset about Russian troll tweets. These are tweets from Twitter handles that are connected to the Internet Research Agency (IRA), a Russian “troll factory.”
The majority of these tweets were posted from 2015-2017, but the datasets encompass tweets from February 2012 to May 2018, documentation can be found on this link: https://github.com/fivethirtyeight/russian-troll-tweets/.
Three of the main categories of troll tweets that we will be focusing on are Left Trolls, Right Trolls, and News Feed. Left Trolls usually pretend to be BLM activists, aiming to divide the democratic party. Right trolls imitate Trump supporters, and News Feed handles are “local news aggregators”, typically linking to legitimate news.
For our upcoming analyses, some important variables are:
Metadata:
Original Dataset Dimension: [Columns=
21
, Rows=239350
]
English Tweets Dataset Dimension: [Columns=
21
, Rows=175966
]
author | content | language | creation_date |
---|---|---|---|
WOKELUISA | Michelle Obama is the most academically accomplished First Lady. She skipped second grade, graduated salutatorian at her Magnet high school for gifted students, went to Princeton (graduating cum laude) and then Harvard Law School. https://t.co/ee9RhFCpJx | English | 2018-03-21 23:29:00 |
WOKELUISA | There’s actually one good thing about the Trump presidency. It has finally exposed ‘evangelical Christians’ for what they are - misogynists, pedophile supporters and Nazi sympathizers. | English | 2018-03-21 20:25:00 |
WOKELUISA | This administration is more willing to ban entire RACES of people than to ban assault rifles | English | 2018-03-20 23:04:00 |
WOKELUISA | Stay tuned to Dicki-Leaks! https://t.co/H3f65T4K4S | English | 2018-03-20 21:57:00 |
WOKELUISA | Does anyone believe @CamAnalytica CEO Alexander Nix committed the misconduct all by himself? NO. Others had to be involved. Did Jared Kushner, who hired Cambridge Analytica for the Trump campaign, know? Also, why is Kushner still a Senior White House Advisor? https://t.co/k0k1lrKfIb | English | 2018-03-20 21:54:00 |
Next, I will plot basic exploratory data of the different locations which tweets were posted:
Exploratory plot of the account categories of the tweets:
A subset of the tweets was created to see how often the top words appear:
Sentiment analysis is a popular topic of research among the researchers after collecting data. The primary objective of the sentiment analysis is to investigate opinions, attitudes, and emotions of the users towards a subject matter of interest.
The bing lexicon categorizes words in a binary fashion into positive and negative categories.
I will report how many positive and negative words there are in the dataset:
sentiment | n |
---|---|
negative | 91183 |
positive | 61695 |
According to above table, these tweets are charged with more negative words (approximately 60%), introducing misconceptions and biases to the data.
Using above words, the next wordcloud has been created:
And this is a wordcloud colored by sentiment:
Four things to know about race and gender bias in algorithms:
Automation bias occurs when just the introduction of an algorithm results in increasing the bias of human discretion.
The algorithms don’t create the bias but they do transmit it
There are a huge number of other biases. Race and gender bias are just the most obvious. As legally protected attributes, they’re the least likely to be included in meta-data and we don’t succeed at identifying accurate proxies.
It’s fixable!, all data embeds a worldview and all models have some bias. Most interventions just try to make the model biased towards more inclusive (and non-illegal) outcomes.
Cognitive science research shows that humans are unable to identify their own biases. And since humans create algorithms, bias blind-spots will multiply unless we create systems to shine a light, gauge risks, and systematically eliminate them.
Some people use these misconceptions to escape responsibility so it’s important to be careful.
Understanding the various causes of biases is the first step in the adoption of effective algorithmic hygiene. Even when flaws in the training data are corrected, the results may still be problematic because context matters during the bias detection phase.
Some decisions will be best served by algorithms and other AI tools, while others may need thoughtful consideration before computer models are designed. Further, testing and review of certain algorithms will also identify, and, at best, mitigate discriminatory outcomes.