Practical tutorial on how to model political statements with a state-of-the-art transformer-based topic model
Topic modeling (i.e., the identification of themes in a corpus of text data) has developed rapidly since the Latent Dirichlet Allocation (LDA) model was published. This classical topic model, however, does not capture the relationships between words well because it is based on the statistical concept of bag of words. Based on recent embed Top2Vec and ISSUE The models address their drawbacks by exploiting pre-trained language models to generate topics.
In this article, we will use Maarten Grootendorst (2022) ISSUE Identify the terms that represent themes in the transcripts of political speeches. It outperforms most traditional and modern topic models in topic modeling metrics on various corpora and has been used in companiesacademy (Chagnon, 2024), and the public sector. We will explore in Python code:
- how to preprocess data effectively
- how to create a Bigram topic model
- How to explore the most frequent terms over time.
As an example data set, we will use the Empoliticon: Dataset of Political Speeches, Context and Emotionslaunched under the…