Recently, Matthias Radtke has written a very nice blog post on Topic Modeling of the codecentric Blog Articles, where he is giving a comprehensive introduction to Topic Modeling. In this article I am showing a real-world example of how we can use Data Science to gain insights from text data and social network analysis.
I am using publicly available Twitter data to characterize codecentric’s friends and followers for
- identifying the most “influential” followers and using text analysis tools like sentiment analysis to characterize their interests from their user descriptions
- performing Social Network Analysis on friends, followers and a subset of second degree connections to identify key players who will be able to pass on information to a wide reach of other users and
- combing this network analysis with topic modeling to identify meta-groups with similar interests.
Knowing the interests and social network positions of our followers allows us to identify key users who are likely to retweet posts that fall within their range of interests and who will reach a wide audience.
Via the Twitter REST API anybody can access Tweets, Timelines, Friends and Followers of users or hash-tags. One drawback of the REST API is its rate limit of 15 requests per application per rate limit window (15 minutes). An alternative would be to use Twitters’s Streaming API, if you wanted to continuously stream data of specific users, topics or hash-tags. Here though, I want to look at a snapshot of codecentric’s Twitter followers to show some of the possibilities that analyzing this information holds.
On July 15th, codecentric had 449 friends (users who codecentric follows) and 2732 followers (users who follow codecentric), while 261 of them are simultaneously friends & followers.
We now have the following information about these friends and followers:
- user name
- user screen name
- user description (the short introduction that each user can write about themselves)
- number of tweets per user
- number of followers per user
- number of friends per user
- date of account creation
- account location
- account language
This data can tell us a lot about who is interested in codecentric and what we do. We can e.g. start with a simple exploratory data analysis and look at what languages the accounts are set to – no need for fancy models (just yet)!
As we can see, the vast majority of friends and followers have English and German account settings. The insight derived from this is that tweeting in both, German and English will find an audience among our followers (even though English would probably be more inclusive, assuming that most, if not all, German followers will also be able to understand English tweets).
Who are codecentric’s most influential followers and what are they interested in?
We can also try to identify our most influential followers. These would be followers with a big network (i.e. who have many followers) and who also tweet/re-tweet a lot. If we capture these followers’ interests with one of our tweets, they are a) more likely to re-tweet and b) will reach a bigger audience by doing so!
The plot above shows the correlation between the number of followers codecentric’s followers have and how often they tweet.
Now that we know who our most influential followers are, we can analyze their short descriptions about themselves to find out what they are interested in. By proxy, this will give us an idea about which kind of tweets are most likely to capture their interest. Of course, this is not to say that these are the only people who (should) matter and that tweets should be tailored towards these interests only! Covering a wide range of topics makes for an interesting and authentic profile but since “knowledge is power”, it can be extremely valuable to know which tweets/posts are likely to increase visibility!
In order to extract information from the descriptions of the most influential followers (defined as the top 100 followers based on a score of follower count * average tweets per day), I am making use of text analysis and natural language processing tools.
To prepare the data, I am splitting the user descriptions into words, convert each word to its word stem and remove stop words.
We can now identify the most common words in these descriptions.
Not surprisingly, software development, agile and business are among the most common words. But also IoT, data and science occur frequently in our influential followers’ descriptions!
Instead of looking for the most common words, we can also look for the most common word pairs (bigrams).
This graph shows the most common word pairs in our influential followers’ descriptions (arrow colors represent how often the pair occurs). Because we are looking at a relatively small set of followers, none of the word pairs occur exceptionally often. Still, data science is the most common word pair!
Sentiment analysis describes a collection of natural language processing tools and resources that are used to identify subjective information in text, like positive or negative sentiment, joy, digust, fear, anger, etc.
Here, we can also use bigram analysis to identify negated meanings, i.e. words preceded by “not”, “no”, etc. In sentiment analysis, the meanings of negated words can then be reversed.
This plot shows the overall sentiment in the user descriptions of the most influential followers. Based on Bing Liu’s sentiment lexicon, we can score how many positive and negative words were used in each followers’ description. Because this lexicon is only available for the English language, we can only get realiable scores for followers with an English description (68 out of 100 followers have an English language setting). As we can see, the majority of followers have predominantly positive descriptions.