The major part of big data is unstructured data. When an organization wants to leverage its data or external information from social media with the goal to make better business decisions, a challenge is to retrieve important information from unstructured text documents written in natural language. The main goal of techniques from natural language processing (NLP) is to turn text into structured data that can be used for further analysis.
A particular example of NLP are probabilistic topic models that seek to discover common topics in a collection of documents. Unsupervised machine learning algorithms have been developed to find such topics, which can be used for organizing and managing the collection of documents. Topic models allow to address interesting data science related questions concerning, for instance, recommendations: “What articles are most relevant for a certain topic?”, and clustering: “What are newly published articles discussing and how similar are two articles?”. The derived topics can also be viewed as a dimensionality reduction and can be used as features for subsequent machine learning tasks (feature engineering).
In this article, we present results from a topic modeling in the codecentric blog. The topics are used to analyze the blog content and how it changes over time. Of course one could argue that authors usually assign their blog posts to a category and might use additional tags that give hints about its content. When no such labels are available in a very large collection of documents or if one wants to obtain a more objective clustering, topic modeling is an appropriate tool.
We perform the analysis using Apache Spark with its Python API in a Jupyter Notebook, which you may download here. Spark allows us to build a scalable machine learning (ML) pipeline containing latent Dirichlet allocation (LDA) topic modeling from its machine learning library (MLlib). A small Spark cluster can be easily set up, as described in this post. Another advantage in using Spark is that the developed prototypes of a data product can be easily translated to a production environment.
This post is organized in five sections:
LDA Topic Model
Model Training and Evaluation
Summary and Conclusion
The first three rather technical sections describe some theoretical concepts of LDA topic modeling as well as the implementation of data preprocessing and model training. Some readers might want to directly jump to the Results section.
LDA Topic Model
In natural language processing, a probabilistic topic model describes the semantic structure of a collection of documents, the so-called corpus. Latent Dirichlet allocation (LDA) is one of the most popular and successful models to discover common topics as a hidden structure of the collection of documents. According to the LDA model, text documents are represented by mixtures of topics. This means that a document concerns one or multiple topics in different proportions. A topic can be viewed as a cluster of similar words. More formally, the model assumes each topic to be characterized by a distribution over a fixed vocabulary, and each text document to be generated by a distribution of topics.
The basic assumption of LDA is that the documents have been generated in a two-step random process rather than having been written by a human. The generative process for a document consisting of N words is as follows. The most important model parameter is the number of topics k that has to be chosen in advance. In the first step, the mixture of topics is generated according to a Dirichlet distribution of k topics. Second, from the previously determined topic distribution, a topic is randomly chosen, which then generates a word from its distribution over the vocabulary. The second step is repeated for the N words of the document. Note that LDA is a bag-of-words model and the order of words appearing in the text as well as the order of the documents in the collection is neglected.
When starting with a collection of documents and considering the reverse direction of the generative process, LDA topic modeling is the method to infer what topics might have generated the collection of documents. Further details about LDA can be found in the original paper by Blei et al. or in this nice review about probabilistic topic models. Probabilistic topic models are a suite of algorithms that have been developed to estimate the distribution of topics from a corpus of text documents because there is no exact solution for these distributions.
We follow a typical workflow of data preparation for natural language processing (NLP). Textual data is transformed into numerical feature vectors required as input for the LDA machine learning algorithm. A similar approach is described in a recent blog post about spam detection.
A MySQL table of the blog posts is loaded into a Spark DataFrame using JDBC; an additional Spark submit argument contains the MySQL Connector jar file.
# read from mysql table, only use published posts sorted by date
df_posts = ((spark.read.format("jdbc")
).filter("post_type == 'post'").filter("post_status == 'publish'")
From the post content, we first have to extract the text that is decorated with various HTML tags. A beautiful Python library to achieve this is BeautifulSoup. An example raw text is shown in the notebook for the first entry of the post content. The textual data extracted from the HTML file is then normalized by removing numbers, punctuation and other special characters and using lowercase. A so-called tokenizer splits the sentences into words (tokens) that are separated by whitespace. These operations on the Spark DataFrame columns are performed via Spark’s user-defined functions (UDF).
extractText = udf(
lambda d: BeautifulSoup(d, "lxml").get_text(strip=False), StringType())
removePunct = udf(
lambda s: re.sub(r'[^a-zA-Z0-9]', r' ', s).strip().lower(), StringType())
# normalize the post content (remove html tags, punctuation and lower case..)
df_posts_norm = df_posts.withColumn("text", removePunct(extractText(df_posts.post_content)))
# breaking text into words
tokenizer = RegexTokenizer(inputCol="text", outputCol="words",
gaps=True, pattern=r'\s+', minTokenLength=2)
df_tokens = tokenizer.transform(df_posts_norm)
RegexTokenizer is an example of a Spark transformer. Inspired by the concept of scikit-learn, transformers and estimators can be connected to a pipeline, i.e., a machine learning workflow comprising the various stages of preprocessing, feature generation, and model training and evaluation.
We only want to analyze English blog posts and have to identify the language since no such tag is available in our data set. A simple classification between English and German as the primary language is achieved by comparing the fraction of stop words in the text. Stop words are the most common words of a given language such as “a”, “of”, “the”, “and” in English. Lists of stop words for different languages are provided by NLTK. The Fraction of English stop words in a given article is obtained by counting the number of English stop words that appear at least once in the text, divided by the total number of stop words in the list. Similarly, we calculate the fraction of German stop words and decide which language an article mainly uses by the larger of the two fractions.
from nltk.corpus import stopwords
englishSW = set(stopwords.words('english'))
germanSW = set(stopwords.words('german'))
nEngSW = len(englishSW)
nGerSW = len(germanSW)
RatioEng = udf(lambda l: len(set(l).intersection(englishSW)) / nEngSW)
RatioGer = udf(lambda l: len(set(l).intersection(germanSW)) / nGerSW)
df_tokens_en = (df_tokens.withColumn("ratio_en", RatioEng(df_tokens['words']))
.withColumn("Eng", col('ratio_en') > col('ratio_ge'))
Filtering out stop words and stemming
The last preprocessing steps are filtering out the English stop words, as these common words presumably do not help in identifying meaningful topics, and stemming the words such that, for instance, “test”, “tests”, “tested”, and “testing” are all reduced to their word stem “test”. The list of stop words is expanded by
moreStopWords, which we manually collect as follows. After having trained an LDA model, we inspect the topics and identify additional stop words, which are filtered out for the subsequent model training. This procedure is repeated, as long as stop words appear in the lists of top words.
swRemover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered")
swRemover.setStopWords(swRemover.getStopWords() + moreStopWords)
df_finalTokens = swRemover.transform(df_tokens_en)
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english", ignore_stopwords=False)
udfStemmer = udf(lambda l: [stemmer.stem(s) for s in l], ArrayType(StringType()))
df_finalTokens = df_finalTokens.withColumn("filteredStemmed",
The feature vectors are then generated following a simple bag-of-words approach using Spark’s
CountVectorizer. Each document is represented as a vector of counts, the length of which is given by the number of words in the vocabulary, which we set to 2500. The
CountVectorizer is an estimator that generates a model from which the tokenized documents are transformed into count vectors. Words have to appear at least in two different documents and at least four times in a document to be taken into account.
cv = CountVectorizer(inputCol="filteredStemmed", outputCol="features", vocabSize=2500, minDF=2, minTF=4)
cvModel = cv.fit(df_finalTokens)
countVectors = (cvModel
Model Training and Evaluation
The Spark implementation of LDA allows online variational inference as a method for learning the model. Data is incrementally processed in small batches, which allows scaling to very large data sets that might even arrive in a streaming fashion.
df_training, df_testing = countVectors.randomSplit([0.9, 0.1], 1)
numTopics = 20 # number of topics
lda = LDA(k = numTopics, seed = 1, optimizer="online", optimizeDocConcentration=True,
maxIter = 50, # number of iterations
learningDecay = 0.51, # kappa, learning rate
learningOffset = 64.0, # tau_0, larger values downweigh early iterations
subsamplingRate = 0.05, # mini batch fraction
ldaModel = lda.fit(df_training)
lperplexity = ldaModel.logPerplexity(df_testing)
In general, the data set is split into a training set and a testing set in order to evaluate the model performance via a measure such as the perplexity, i.e., a measure of how well the word counts of the test documents are represented by the topic’s word distributions. However, we find it more useful to evaluate the model manually by looking at the resulting topics and the corresponding distribution of words. A good result is obtained training a 20-topic LDA model on the entire corpus of the English codecentric blog articles. Using a more quantitive performance measure would allow a hyper-parameter tuning. A grid search for the optimal parameters such as the number of topics is facilitated by Spark’s pipeline concept. The ML models are saved for later usage.
In the following we present results of a 20-topic model trained on the entire data set of English codecentric blog articles that were published until and including November 2016. A visualization of the distribution of words for the two top topics is given by the word clouds in Fig.1 and Fig.2. The size of the words correspond to their relative weights; words having a large weight are more often generated by this topic. With the top words and by inspection of some documents discussing a given topic, it is often possible to manually assign somewhat summarizing labels to the topics. The topics that correspond to the word clouds in Fig.1 and Fig.2 are labeled “Agility” and “Testing”, respectively. Note that some words are reduced to a non-valid word stem like “stori” or “softwar”.
Figure 1. Word cloud of the topic labeled “Agility“.
Figure 2. Word cloud of the topic labeled “Testing“.
Labeling of topics and identifying top documents
The twelve most meaningful topics of our 20-topic model are listed in Tab.1. These topics are selected by hand and meaningful is of course a quite subjective measure. We exclude for instance topics where two very different themes appear. For each topic, we suggest a label that summarizes what the topic is about and provide the top words in the order of their probability to be generated. In order to identify the top document for a given topic, we order the documents by their probability to discuss that topic. The top document is defined as the document having the largest contribution from the given topic compared to all other documents.
Table 1. The twelve top topics of a 20-topic model trained on all English codecentric blog posts.
|topic||label||top words||top document|
|1||DevOps||build, plugin, run, imag, maven||How to enter a Docker container, Alexander Berresch|
|2||Memory Management||java, gc, time, jvm, memory||Useful JVM Flags – Part 2 (Flag Categories and JIT Compiler Diagnostics), Patrick Peschlow|
|3||Data/Search||data, index, field, query, operator||Big Data – What to do with it? (Part 1 of 2), Jan Malcomess|
|4||Reactive Systems||state, node, system, cluster, data||A Map of Akka, Heiko Seeberger|
|5||Math||method, latex, value, point, parameter||The Machinery behind Machine Learning – Part 1, Stefan Kühn|
|6||Spring||spring, public, class, configure, batch||Boot your own infrastructure – Extending Spring Boot in five steps, Tobias Flohre|
|7||Frontend||module, type, grunt, html, import||Elm Friday: Imports (Part VIII), Bastian Krol|
|8||Database||mongodb, document, id, db, name||Spring Batch and MongoDB, Tobias Trelle|
|10||Agility||team, develop, time, agile, work||What Agile Software Development has in common with Sailing, Thomas Jaspers|
|11||Mobile App||app, notif, object, return, null||New features in iOS 10 Notifications, Martin Berger|
Next we determine the number of documents having the same main topic. Remember that a document usually concerns several topics in different proportions. The main topic of a document is defined as the topic with the largest probability.
getMainTopicIdx = udf(lambda l: int(numpy.argmax([float(x) for x in l])), IntegerType())
countTopDocs = (ldaModel
For each document in our data set we identify the topic index for which the probability is the largest, i.e., the main topic. Grouping by the topic index, counting, and sorting results in the counts of documents per topics plotted in Fig.3. The most discussed topics in the entire collection of blog articles are topic 0 – “Testing”, topic 6 – “Spring”, and topic 10 – “Agility”.
Figure 3. For each topic, we count the number of documents that discuss the topic with the largest probability (main topic). Only the 12 most meaningful topics of the 20-topic model are shown.
Evolution of blog content over time
How many blog articles were published on a specific topic during one year? This question is addressed in Fig.4 illustrating for the top topics, “Testing”, “Spring”, and “Agility”, the number of documents that discuss the topic with the largest probability as a function of time. At first glance, it appears that “Agility” became less important after a hype in 2009, as seen by the red line in Fig.4. However, another explanation would be that in later years, agile methodologies are not exclusively discussed as a main topic in a document but rather co-appear with other topics in smaller proportions. A growing number of article are dedicated to both the topics “Spring” and “Testing”, with some oscillations for the latter. What might also be interesting to look at is the number of documents that discuss a specified topic with a probability larger than some threshold value rather than considering only the largest probability, as in Fig.4. However, we do not go into detail here and only provide a glimpse on possible analyses.
Figure 4. Time evolution of the number of documents with the same main topic. Results are shown for the three top topics obtained from the LDA model trained on the entire data set.
Evolution of topics over time
Another interesting question is at what time topics appear or disappear and how the words representing a topic change over time. For the results in Fig.4 only a single LDA model was trained on the entire data set. The resulting topic distribution is fixed and does not change over time. In order to study the evolution of topics over time in a systematic way, machine learning researchers have developed dynamic topic models.
Here, we take a simpler approach investigating how the distribution of topics change over time. Several different LDA models are trained on the blog articles of a specific year including articles from all previous years. Thus, we obtain topic distributions for the collection of documents published during the years 2008-2010, 2008-2011, …, 2008-2016. We then try to identify the same topics, which might contain different words. In principle, this approach allows to predict next years’ topics given all the articles from the previous years. Without going into details, we present as an example in Tab.2 the top ten words for the topic “Agility” from different LDA models trained with data until and including consecutive years.
Table 2. Top ten words for the topic “Agility” from different LDA models trained on blog articles until and including the given year. The order of the words from top to bottom represent the probability to be generated.
As can be seen in Fig.5, the probability of top words to appear in a text about “Agility” changes over time. For instance, there is a slight decrease in the use of the words “agile” and “scrum” in the period from 2010 until 2016.
Figure 5. Time evolution of some words in the topic “Agility”. The weights of the words, shown as a function of time, correspond to the probability to appear in a document about agility.
The topic distribution of this article
In order to test the trained LDA topic model, we now predict the topics for the present article. We use the LDA model trained on the entire data set and predict the present article before writing this paragraph. As a result, we obtain the topic distribution depicted in Fig.6 as a pie chart. The two main topics with about 20 percent are “Functional Programming” and “Data/Search”, which is quite appropriate. All other topics having less than 5 percent probability are collected in the “Other” part.
Figure 6. The topic distribution for this article predicted by the LDA model trained on the entire dataset of all English codecentric blog articles.
Summary and Conclusion
In this article, we analyze the content of the codecentric blog by means of Spark’s implementation of LDA topic modeling. Data preprocessing steps necessary for NLP are described. Training a 20-topic model on all blog posts allows to identify a number of meaningful topics. Some exploratory investigations on the time evolution of the blog content and the topics are performed using different LDA models trained on articles until a specified year. We thereby obtain hints on how topics and words have changed over time. In the last part we successfully predict the topics of the present blog article.
In a follow-up post, it would be interesting to use German blog posts and see whether the topics depend on the language. It might be worth to compare LDA with e.g. non-negative matrix factorization and more elaborate (dynamic) topic models with different features such as tf-idf. Further insight into how topics tend to co-occur could be gained by modeling the connection between topics in a graph in order to study relations between different topics. As a concluding remark, note that topic modeling is not restricted to text documents but can also be applied to other unstructured data such as images or video clips, e.g., video behavior mining, where visual features are interpreted as words.
David M. Blei, Andrew Y. Ng, Michael I. Jordan. “Latent Dirichlet Allocation” Journal of Machine Learning Research 3. 993-1022. 2003.
Blei, David. “Probabilisitic Topic Models.” Communications of the ACM. 55.4: 77-84. 2012.
Hoffman, Matthew, Francis R. Bach, and David M. Blei. “Online learning for latent dirichlet allocation.” Advances in neural information processing systems. 2010.
David M. Blei and John D. Lafferty. “Dynamic topic models.” In Proceedings of the 23rd international conference on Machine learning. ACM, 113-120. 2006.