Upon learning about Trump twitter archive, I immediately recognised the opportunity to subject the President’s tweets to some analysis. Without doubt President Trump, before many other politicians, understood the power of Twitter as a means of disseminating his message. His message has many characteristic features. Perhaps this is not surprising given the President’s former career as a reality television star. One phrase from the election campaign that stood out to me, an uninformed observer from outside the US, was ‘Crooked Hillary’. It seemed to me characteristic of many short, emotive phrases that the President has become famous for.
This brevity and distinctness seems to lend itself to Twitter’s hashtag system. There are also lots of other data collected on the archive, that can be used in conjunction with the inclusion of hashtags. Of course, inferring all the complexities of the President’s communication style from hashtags would be a foolish oversimplification. That said, I believe there may be some interesting insight that could be gleaned from the data. The first step then, would be to identify the most common hashtags and explore the data.
A quick search shows that the site archives the President’s tweets by year in JSON format, ranging all the way back to his first tweet on May 4th, 2009 up to the present day. With the help of Scrapy it is relatively easy to acquire and process for the purpose of analysis.
In the interest of orderliness, I decided to use the period from 2009-2018. This accounts for 36307 tweets. Prior to 27/01/2017, the site did not update the twitter feed in real time, thus approximately 4000 tweets are missing, after they were deleted by the President.
JSON files convert into a dictionary in Python, which is then converted into a Pandas data frame. I passed the annual archive of the tweets into an empty list and then concatenated them all into a single Pandas data frame. In the interest of understanding, I also created a hashtag column, with a binary classification, to get a feel for the number of tweets that contain a hashtag. I graphed this by year as a quick insight into Trump’s hashtag behaviour.
This is a far from perfect method, but it does provide some insight. Judged solely on these charts, the President seems to have increased the proportion of hashtags per tweet in 2015 and 2016. These years correspond to his election campaign. It could be that Trump found hashtags to be a particularly useful means of communication on Twitter. So, with the data consolidated, let’s take a look at what we can do with it.
As mentioned above, the method by which I created a hashtag column left too much room for error. It was unable to identify what hashtags were used, whether the # referred to something else nor whether there were multiple hashtags in a tweet.
Luckily, spaCy, a library for Natural Language Processing in Python, is able to do this. I added the pattern of a hashtag (#, followed by text) to the Matcher class, then used the Counter to make note of the number of hashtags that arose in the tweets.
To ensure that I was on the right path, I took note of the top 10. Interestingly, the majority of these are hashtags that refer to the election campaign, or arose during that time.
Next came the task of processing this data for use in analysis. I decided to use a method called ‘one hot encoding’. Much like I used earlier with my hashtag column, one hot encoding creates a binary indicator for a categorical variable. For example, if a tweet contains ‘#trump2016’, it is recorded with 1, or else if it does not it contain ‘#trump2016’ it contains 0.
In the interest of brevity, I decided to only add Trump’s 100 most common hashtags. This was done by looping through every tweet and creating a dictionary for every row. Within this loop I would then loop through all of the 100 most common hashtags and use the hashtag as a key and 1 or 0 as a value depending on whether the tweet included the hashtag in question. I then appended the dictionary to an empty list and converted it into a Pandas data frame. Finally, I concatenated these new entries into the original data frame.
A quick check to have a look at the new headers.
Now let’s have some fun. Twitter has two functions that can be though of as, to an extent at least, numerically reflecting popularity. These are the retweet and favourite functions. As someone who doesn’t use Twitter I was unsure which better reflected popularity. Luckily, some smart people at Wellesly College have already written a paper about this, suggesting that retweets indicate agreement, endorsement and trust, so I decided to follow their lead.
Plotting the distribution of the retweet count of the President’s tweets, unsurprisingly, shows a positive skew with a long tail on the right. This tells us nothing more than that the vast majority of the President’s tweets have received fewer than 1000 retweets. But it also suggests that there are some tweets that are several orders of magnitude more popular than the mean.
This is confirmed by the descriptive statistics of the retweet count.
This got me to thinking whether or not there are particular hashtags that are associated with these far more popular tweets. If there are, what is the lifecycle of such a hashtag? Are they messages the President can lean upon for a show of support when he deems fits? Or are they in reference to something more specific to the new cycle?
One way of identifying the tweets that are most popular with the President’s twitter followers, would be to identify those with the highest ‘retweet ratio’. That is, those hashtags that garner the highest number of retweets relative to their use. Next, of those tweets with the highest retweet ratio one could look at the frequency of their use over time and investigate their lifecycle. If the President had some ‘go-to’ hashtags, it is conceivable their use and popularity would endure over time.
Obtaining the retweet ratio was as simple as summing the retweets for a particular hashtag and dividing the total by the number of tweets that include a hashtag. I then graphed these in descending order on a bar chart. As this is an exploration, I decided to work with only the top 10.
Here is the a list of the top ten, with the corresponding retweet ratio.
Unsurprisingly, given the President’s enmity towards the media, #FakeNews tops the bill. A glance at the hashtag’s descriptive statistics with regards retweet count show it to be an outlier, with a mean of 19279 retweets and a standard deviation of 8971, compared to the sample mean of 4397 and a standard deviation of 9443.
The shape of it’s distribution is also quite different to that a of the sample.
Most striking, to me at least, was the number of entries. Only 19. To get a better idea of this hashtags lifecycle I graphed a cumulative sum of the rolling 7 day mean of the hashtags retweet count over the period of time from its first use to its last. This suggests a relatively enduring popularity since its inception. This period roughly represents the period from the beginning of Donald Trump’s Presidency until the end of the period.
Looking at the other of the hashtags with the highest retweet ratios it became apparent that they roughly fit into two categories. Those like #FakeNews that spanned multiple years and other much shorter term hashtags, that typically relate to an event in the news cycle.
Of the longer-term hashtags, this included: #FakeNews, #DrainTheSwamp, #MAGA (only this variation), #LESM, #CrookedHillary.
While the shorter-term hashtags, the period of their use ranging from a day to two weeks, included: #JobsNotMobs, #HurricaneHarvey, #MAGARally, #DebateNight.
The outlier was #debate, which has had irregular use and popularity. It also pre-dates Trump’s Presidency.
Of the longer-term hashtags #FakeNews, #DrainTheSwamp, #LESM (Law Enforcement Social Media) and #MAGA all date from just before Trump’s Presidency or during the election campaign and appear to run on to the end of the period in question. As an uniformed observer of US politics, they also seem to thematically reflect the core of the President’s message. So, perhaps it is unsurprising that they rank amongst his most popular hashtags.
#CrookedHillary, on the other hand, was used for just over a year, from during the election campaign, and appears to now be defunct. It might be felt that with Mrs. Clinton’s defeat in the Presidential election campaign of 2016 and her announcement that she will not run for President in 2020, that she no longer warrants the attention she hitherto received from the President.
This all suggests that the core of the President’s message has remained the same from the election up until now.
In terms of the methodology used, here are some thoughts for future investigations. Of the tweets that number amongst those with the highest retweet ratio, most date from Trump’s Presidency or his campaign. In the case of further investigation, it might be wise to separate tweets by period: pre-politics, Presidential nominee/President elect and Presidency. This might reveal some interesting insights into how Trump’s message has changed.
A weakness of the methodology was the it’s inability to differentiate between variation of the same hashtag. In the case of the 100 most frequent hashtags, there 3 variation of #MAGA (#MAGA, #MAKEAMERICAGREATAGAIN, #MakeAmericaGreatAgain). With a little tweaking, I’m sure this problem could be solved, and the variations aggregated.
For future investigation, it might be interesting to run some regression models and see if we can predict the popularity of a tweet based on the inclusion of certain hastags.
However, this seems like a good place to pause the analysis for now. Thank you for reading!