natural language processing

BelarusianGLUE: Analyzing Performance of Open-weight Models Scripta & e-Scripta vol. 25, 2025 floyd Tue, 08/19/2025 - 17:18
BelarusianGLUE: анализ на продуктивността на модели от отворен тип

We use BelarusianGLUE, a recently introduced benchmark, to analyze the performance of open-weight large language models (LLMs) on Belarusian language understanding tasks. The impact of prompting language, few-shot prompts, orthography (modern/classical/Latin), chat templates, and evaluation mode (discriminative/ generative) is investigated. Our findings suggest that more recent models generally perform better, but improvements are gradual. Fine-tuning on related Slavic languages doesn’t always improve Belarusian understanding. Classical orthography has limited impact, while latinization degrades performance. Analysis of specific tasks (sentiment analysis, Winograd schema challenge) reveals biases in the models, difficulties with understanding linguistic structure, and gaps in world knowledge and cultural context.

Subject: e-Scripta Keywords: natural language processing Belarusian language large language models language understanding evaluation
Evaluating Stanza and UDPipe for Morphosyntactic Annotation of Old Russian: A Case Study on Maximus the Greek Scripta & e-Scripta vol. 25, 2025 floyd Tue, 08/19/2025 - 17:15
Оценка на строфи и UDPipe за морфосинтактична анотация на староруски език: казусът Максим Грек

The automation of morphosyntactic annotation of Old Russian texts represents a key challenge in contemporary Slavistics, underscoring the need for computational tools capable of processing historical linguistic data with high accuracy. This study qualitatively evaluates the performance of two statistical taggers, Stanza and UDPipe, in annotating a text by Maximus the Greek, using the TOROT and RNC treebanks as reference corpora. The analysis assesses the accuracy of morphosyntactic annotation—specifically, part-of-speech tagging, morphological feature assignment, and lemmatisation—identifying recurring errors and structural limitations in applying these tools to historical Slavic texts. While both taggers facilitate annotation, they do not yet ensure a level of automation sufficient for fully reliable linguistic analysis. Key challenges include the misinterpretation of morphosyntactic relationships and inaccuracies in grammatical feature assignment. The comparison with their respective reference corpora highlights these issues, demonstrating the need for further refinement in automated annotation methods. This study critically examines the applicability of current NLP technologies to historical texts, emphasizing the necessity of adapting existing models.

Subject: e-Scripta Keywords: Stanza UDPipe natural language processing Morphosyntactic analysis Annotation Old Russian Maximus the Greek
Effektiver Einsatz von NLP-Methoden am Beispiel des Codex Suprasliensis Scripta & e-Scripta vol. 25, 2025 floyd Tue, 08/19/2025 - 17:07
Ефективно използване на методите на NLP въз основа на пример от Codex Suprasliensis

The integration of computational methods in historical philology is becoming increasingly essential, yet challenges persist in harmonizing linguistic and technical aspects of text analysis. This study presents a comprehensive and methodologically transparent use case that documents the entire computational philological workflow– from data acquisition and modeling to analysis and visualization–in a structured and reproducible manner. Using the Codex Suprasliensis, one of the most significant Old Slavic manuscripts, as a case study, we demonstrate how modern Natural Language Processing (NLP) techniques, particularly the Stanza library for morphosyntactic annotation and DataFrame-based corpus structuring, can facilitate the exploration of historical textual corpora. A special emphasis is placed on benchmarking Stanza’s performance in processing Old Church Slavonic, evaluating its segmentation, tagging, and parsing accuracy against existing Gold Standard datasets. Additionally, we discuss the role of DataFrame-based modeling in ensuring an efficient and transparent structuring of linguistic data, allowing for flexible transformations and reproducible analyses. To support further research and methodological validation, all functional and extensively annotated scripts–including the complete NLP pipeline–are permanently provided via the GitHub platform of the Berlin State Library. The findings highlight the importance of structured corpus processing in computational philology and contribute to the ongoing refinement of NLP methodologies for historical languages.

Subject: e-Scripta Keywords: Computational Philology natural language processing Old Church Slavonic Stanza and Corpus Annotation DataFrame-Based Text Structuring

Wege zur verbesserten automatischen Annotation des mittelbulgarischen Kirchenslawischen

Фабио Майо. Начини за подобряване на автоматичните анотации на средно­ български църковнославянски текстове

  • Summary/Abstract

    The last decade has brought an upswing in research on natural language processing. However, it is well known that historical language stages are largely underrepresented. Middle Bulgarian Church Slavonic, a language variety with a significant literary productivity, is a prime example. In the current paper, it is shown how annotated texts of related language varieties can be used to annotate texts written in Middle Bulgarian Church Slavonic, such as the 14th-century translation of the Dioptra. In particular, I present a way of adapting the available training data and of reducing the differences between training and test data, thereby improving the result of the automatic morphological annotation. Moreover, it is demonstrated that a comparison with the original work, written in Byzantine Greek, can further increase the results of the annotation by carefully disambiguating homonymous word forms. The presented results can benefit research on Middle Bulgarian Church Slavonic as it shows how texts in this variety can be annotated without authentic training data. The proposed method may be of use not only for Slavonic Studies, however. The method of using training data from genetically related language varieties in combination with translations may be used to annotate other underrepresented language varieties as well.


From Annotation to Modeling: Computational Horizons for Medieval Slavic Studies.

От анотиране към моделиране: компютърни хоризонти за славистичната медиевистика

  • Summary/Abstract

    This paper is a write-up of a keynote from El’Manuscript 2021, reflecting on the ways in which the field of computationally-supported medieval Slavic studies has and has not changed since the mid-2000’s. Looking towards developments in the broader fields of digital humanities and natural-language processing, it explores the ways that recent improvements in the tools at our disposal for mass digitization of manuscripts and text analysis at scale open up possibilities for working with manuscripts that have received very little attention. For these advancements to be feasible, however, scholars will need to prepare and share their digitized texts and annotations in ways that are not currently the norm, though a number of projects provide exemplary models of how these new conventions could be put into practice.


Subscribe to natural language processing