Different Techniques in Topic Modeling: LDA, Mallet LDA, STM & HDP
In natural language processing, topic modeling is a type of statistical modeling that is used to discover abstract topics in a collection of documents. Though there are multiple techniques available in topic modeling implementation, evaluating the models has been challenging due to its unsupervised training process. There is no standard set of output metrics to compare for every corpus. However, it is equally important to identify if the trained model is good or bad and to have the ability to compare different models/techniques. In this blog, we will explore different techniques and evaluation methods for topic modeling through four of the most popular techniques: LDA, Mallet LDA, STM, and HDP.
In response to the murder of George Floyd, today’s leaders are starting new and different conversations across their organizations about systemic oppression and accountability. To understand the distinction of EDI corporate statements as cosmetic covering, conversation starter or commitment indicator, our research team set out an agenda to answer the following question:
- What do corporations commit to doing? Are there differences in themes in those commitments?
We built a web scraping application to scrape corporate statements from Fortune 100 companies and the CEO Action group in response to systemic racism in 2020. We confirmed and analyzed 202 available statements from 228 organizations. The sample count is visualized in the image below.
To answer our research question, I used different techniques in topic modeling to uncover the themes from the corporate statements efficiently.
Data cleaning and data preprocessing are as important as building a sophisticated machine learning model. The reliability of the model is highly dependent on the quality of data. Raw text might include hyperlinks, punctuations, and unwanted symbols in the data that might interfere with the performance of the model. Hence it is always important to preprocess the data before the modeling process. I have implemented the following to remove noise in the dataset.