The uses of topic modeling in CORECON: Potentials and challenges

Topic modeling is an umbrella term used to denote a host of semi-automated or fully automated corpus linguistic methods that aim to map the content of texts by identifying dominant themes. When driven by an algorithm that is trained in either a supervised or unsupervised manner on a dataset, the method makes it possible to retrieve a given number of topics from a collection of texts based on the parameters of the vocabulary used in them. This blogposts presents the potentials (and challenges) of using topic modeling in CORECON.

Topic modeling outputs several groups of words which are statistically related within a given corpus, providing insights into how lexical items (most commonly, nouns) appear. This is because it can analyze probabilities, relationships and patterns of co-occurring or clustering words (Brookes and McEnery, 2019; Grundmann, 2022). Because the co-occurrence is statistical, not semantic, there is sometimes a degree of randomness in the way these topics are identified and may give a sense of a “black box” research output, and one that is not possible to be explained or replicated well (Bednarek, 2024). At the same time, however, this method can uncover certain unexpected patterns in the use of words that a human researcher would most likely overlook.

A “topic” is a probability distribution of words that topic modeling algorithm identifies as more likely to co-occur than others based on the data it was trained on. For example, “if a topic model were to be trained on the last decade of the Guardian, it would not be surprising to find a topic in which the words Amazon, Google, Facebook, and Microsoft ranked as highly probable, corresponding to the Guardian’s coverage of the tech sector” (Recchia, 2020, p. 399).

In discourse studies, topic modeling has been used to complement keyword analysis (identifying words with higher salience or pervasiveness in a given corpus vis-à-vis the standard uses of the language in comparative genres). For some researchers (cf. Törnberg and Törnberg, 2016; Vanichkina and Bednarek, 2022), topic modeling can be seen as a complementary step of a larger analysis, enabling a more-or-less detailed probing of the “aboutness” of the sample. This can then lead to a more fine-grained analysis of topic-related concordance lines (Busso et al, 2022).

CORECON’s large English-language corpora can lend themselves to structural topic modeling, especially when it comes to the diverse sources included in the dataset through scraping the websites of various media outlets (see blogpost by Jędrzej Olejniczak). For example, topic modeling can be used to establish a narrow number of thematic domains of a large corpus, or to assign an appropriate label to a group of words that reoccur in the corpus, before delving in the nuances of collocation patterns (see blogpost by Katarzyna Molek-Kozakowska).

In a preliminary exercise devoted to topic modeling with natural language processing framework BERT, based on the first two years of Ukraine war coverage in the global English media, it was possible to identify topics related, among others, to (1) Russian and Ukrainian battles, placenames and capture of territory along the war frontlines; (2) nuclear power and energy issues as well as electricity blackouts and infrastructure destruction due to attacks; (3) financial, legal and military aid packages and assistance from the US and Western countries; (4) hosting refugees and granting visas according to special protection and solidarity schemes; (5) (sham) referendums and voting disturbances, to name just a few.

While not yielding any surprising results in CORECON so far, (structural) topic modeling techniques could also be combined with sentiment analysis to check if certain sentiments occur more frequently with certain topics. The words “annexation,” or “sham,” for example, can hardly be seen as positive, unlike the words “aid,” or “assistance.”

Meanwhile, specific outlets could be analyzed in more detail in order to check how and how often they report on certain identified topics, or how the coverage of these topics has been changing over time. For example, the connections between the Russian Federation and North Korean regime (captured in one prominent topic throughout 2022 and 2023) have been transformed into a de facto military alliance in 2024, so over time, the thematic and terminological choices in the coverage are likely to have evolved.

It is often recommended in literature (cf. Bednarek, 2024) to look beyond wordlist-generated topics based on singular words and attend to the possibilities of obtaining genre-specific topics. It is also a good idea to revisit the corpus-compiling parameters and to identify the core contextual factors that allowed for the emergence of a given number of statistically significant topics in the data. In CORECON, the prominent actions of the Russian military or the American administration have been reflected in the identified topics. Yet it is important to bear in mind that the algorithms will not see “topics” in a semantic way that humans are likely to pick up, generalize and interpret during the actual encounters with the texts.

Nevertheless, topic modeling can assist CORECON researchers in the capturing of content dimensions and the corresponding forms that represent this content in a corpus of data that is too large to read in its entirety and in inspiring further study designs that lead to in-depth analyses.

Text by Katarzyna Molek-Kozakowska

Bednarek, M. (2024). Topic modeling in corpus-based discourse analysis: Uses and critiques. Discourse Studies https://doi.org/10.1177/14614456241293075

Brookes, G. and McEnery, T. (2019). The utility of topic modeling for discourse studies. Discourse Studies 21(1): 3–21.

Busso, L., Petyko, M., Atkins, S., et al. (2022). Operation Heron: Latent topic changes in an abusive letter series. Corpora 17(2): 225–258.

Grundmann, R. (2022). Using large text news archives for the analysis of climate change discourse: Some methodological observations. Journal of Risk Research 25(3): 395–406.

Recchia, G. (2020). The fall and rise of AI: Investigating AI narratives with computational methods. In AI Narratives: A History of Imaginative Thinking about Intelligent Machines, S. Cave, K. Dihal, and S. Dillon (eds), pp. 382-407. Oxford University Press.

Törnberg, A. and Törnberg, P. (2016). Combining CDA and topic modeling: Analyzing discursive connections between Islamophobia and anti-feminism on an online forum. Discourse & Society 27(4): 401–422.

Vanichkina, D. and Bednarek, M. (2022) Australian obesity corpus manual. Available at: https://osf. io/h6n82.