Data collection is the base for any empirical study and corpus analysis is a well-established method used within critical discourse studies. Jędrzej Olejniczak in his blogpost wrote about data collection, management and processing in CORECON. Here I explain the motivations behind methodological choices and our rationales in data collection and data annotation protocols used when building CORECON corpus.
There are many approaches to corpus compilation in linguistics: they may rely on automated or manual data collection or a mix of both computation with human annotation (Mitra & Gilbert, 2021). Possibilities and limitations of the analysis are dependent on the choice of research objectives (Baker et al., 2008; Subtirelu & Baker, 2020). As CORECON aims to trace communication patterns or rhetorical strategies (not so much counting how often they appear in the media sphere), we opted for manual collection of Polish and Romanian media texts to keep control over what we include or exclude from the corpus, but also due to the lack of reliable tools for data scraping in these languages. English language corpus was collected mostly automatically to serve as reference corpus.
The collected data should be representative of the reality and phenomena under investigation, but what does representation even mean? At its core, representativeness refers to depicting or symbolizing something accurately. In discourse studies, the results of the analysis depend on what we include in the corpus. On the one hand, there is a risk of having a mass of data reflecting one phenomenon, leading to redundancy in discerning communication patterns. On the other hand, there is a risk of fragmentation, or collecting diverse types of data that map various strategies but are not representative enough of any patterns to be subjected to a comprehensive analysis.
At the beginning of the project, we faced questions about representation in data collection: How to grasp what is the most important and typical for the media in the chosen countries? How to collect the datasets which adequately represents what we want to describe? How to collect data systematically to enable comparisons between languages? How do we prevent excessively detailed or repetitive data that hinder new insights? As these are methodological choices, there is no one correct answer to these questions. We chose to prioritize data typical of widely accessible media outlets while not focusing too much on outliers, which, though interesting, reach only marginal audiences. We aimed for the media which are most opinion-forming, so we decided to choose the outlets which are most popular (watched, listened to, read, browsed) and cited by the other media. To identify them we referred to various media consumption reports by media tracking organizations (e.g., the Polish Institute of Media Monitoring). At the outset of the project we also assumed that data from social media platforms could be included, so we proposed the proportion of sourced materials to be of approximately 80% of texts from media outlets and 20% from social media for both Polish and Romanian.
Subsequently, the Polish team designed the corpus compilation protocol for media outlets and piloted it on a small sample of selected media. Later the Romanian sub-team followed the same procedure for choosing, collecting, and annotating Romanian media texts. In this post I use information mostly from the Polish sub-team. In further steps we chose to compile proportional amount of texts from specific media with criteria of diversification of ideological stance and already studied by other scholars (to enable comparisons or developments), namely: (TV stations) Polsat News, TVN24 , (newspapers) Gazeta Wyborcza, Polityka, Wprost, Newsweek, Do Rzeczy, wpolityce.pl (Tygodnik Sieci), (business press) Business Insider, Money, Bankier, Forsal, Forbes, (tabloids) SuperExpress and Fakt. We made the choices based on the circulation figures from the turn of 2023 and 2024. In this way, we laid the foundation for establishing common criteria for the two corpora (Polish and Romanian), ensuring systematic and comparable data collection across languages. We also launched data scraping for the English language corpus (with usage of World News API) which serves as point of reference in our studies.
We collected individual articles and posts for the Polish and Romanian corpora from each media outlet’s website separately (e.g., newsweek.pl, tvn24.pl). We used Google’s advanced search in incognito (private) mode to avoid personalized results. In the search field, we used the keyword “Ukraina [Ukraine]” (required to appear in the headline), then switched to the “News” tab and sorted the articles by relevance. We gathered articles published between February 24–28, 2022, and then for each full month from March 2022 through June 2024 (and will continue to do so in the future). From each outlet, we selected six articles for February-June 2022 and five articles per each following month featuring “Ukraina [Ukraine]” in the title. This allows us to have comparable amounts of data in each six-month period of the Ukraine war. Whenever we had doubts about how to tag a given article in terms of its genre or thematic domain, we highlighted it for further discussion and verification with other team members. We saved the dataset as an Excel spreadsheet with separate columns for the article headline, content, and metadata (e.g., publication date, source).
The media texts are quite diversified in terms of their genres which serve for different purposes so we decided to add annotation on basic genre classification including: report (a factual text about events, data, official statements, press releases), feature (an in-depth analysis and/or opinionated material drawn on the basis of political statements), (expert) interview, with military experts, diplomats, economists, scientists, policy advisors; advocacy (an outlet’s position piece, editorial or any text with strong appeal or advocacy), a social media video/podcast/post and other. In most of the cases it was relatively easy to classify the texts as report or interview, but there were some problems with classifying features or advocacy due to hybrid texts.
The texts about the Ukraine war are often focused on military and political aspects, but CORECON is also about other dimensions of the conflict. To balance the corpus and to avoid only political and military focus in texts, we decided to also create two annotation categories on text focus: one for political and military texts (PolMil), and the other for economic and social issues (EcoSoc). We aimed for a 50/50% proportion of both categories, but it turned out to be difficult to achieve it as PolMil texts predominated. In case of long and multidimensional texts that explain political decision on sending military equipment and mentions the financial context of such a decision, we also implemented coding verification procedures.
A part of CORECON corpus consists of social media texts. The Internet is an unprecedented source of natural language data (Koteyko, 2010), but it is much more difficult to collect representative data from it. In the case of the media outlets, it was possible to design quite comprehensive instructions and be sure that we included in the corpus what is most important. The data in social media platforms is fragmented, irregular, and spontaneous, which excluded systematic collection. Moreover, the posts are published by professional media outlets and opinion leaders, as well as grassroot media or influencers, parties and politicians popularizing their agendas. It is also difficult to determine which are important in the social media landscape because they find their niches and publics. Not everything published in social media about Ukraine is ‘media coverage’ of the conflict but sometimes it is related to its ‘reception.’ In the end we decided to not collect any posts from profiles of official media outlets and mix purposeful sampling (with the keywords, e.g., Ukraine, Ukrainian, Russia, Russian, Kremlin, Putin, Zelensky, war, conflict, special military operation, attack etc.) with the snowball sampling where applicable.
This ‘messy’ character of the social media platforms was not the only problem with the data collection in this media sphere. The situation is even more complicated when it comes to social media genres: blog posts, social media posts and comments (usually mixing the traditional media genres, some of them having headlines, some of them not having them, etc.), online TV programs, podcasts, etc. Especially problematic were videos and podcasts which would require transcribing this data before analysis, and it might be counterproductive due to the possible length of, e.g., transcription of a single podcast episode in comparison to Facebook or X post. Including such material would bias the whole corpus by its overrepresentation.
Considering all these problems and based on the annotation protocol for media outlets we created similar instructions for the social media, but with distinct categories. We included the header (where applicable) and the text. Metadata here included: entry number, social network site, publication & data collection date, name of the profile/handle, link, numbers of its subscribers or followers, views and reactions, and modality instead of genre. The annotation scheme also included text focus PolMil and EcoSoc. This part of the corpus is fragmented, and it does not cover all the most important phenomena from social media sphere.
After data compilation and annotation, we verified the content of the corpus and cleaned it up. This phase included correction of mistakes during the data annotation (e.g., wrongly classified genres or text focus, or wrong data formats), which would be problematic when comparing the Polish corpus with Romanian. We identified and removed extraneous segments from the text, e.g., the advertisements which we did not notice during the data collection. We uploaded the final corpus to the online repository https://corecon.omeka.net/ where it serves the CORECON team as a basic database for the studies. Sometimes we supplement this data by collecting additional smaller corpora for concrete studies (see our results https://grants.ulbsibiu.ro/corecon/results/). The CORECON corpus is open access resource and other researchers might also use it for their own sake.
Text by Marcin Deutschmann
Baker, P., Gabrielatos, C., KhosraviNik, M., Krzyżanowski, M., McEnery, T., & Wodak, R. (2008). A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press. Discourse & Society, 19(3), 273–306. https://doi.org/10.1177/0957926508088962
Koteyko, N. (2010). Mining the internet for linguistic and social data: An analysis of ‘carbon compounds’ in Web feeds. Discourse & Society, 21(6), 655–674. https://doi.org/10.1177/0957926510381220
Mitra, T., & Gilbert, E. (2021). CREDBANK: A Large-Scale Social Media Corpus With Associated Credibility Annotations. Proceedings of the International AAAI Conference on Web and Social Media, 9(1), 258–267. https://doi.org/10.1609/icwsm.v9i1.14625
Subtirelu, N. C., & Baker, P. (2020). Corpus-based approaches. In J. Flowerdrew & J. E. Richardson (Eds.), The Routledge Handbook of Critical Discourse Studies (pp. 106–119). Routledge, Taylor & Francis Group.