Stanza and Corpus Annotation | Scripta & e-Scripta

Scripta & e-Scripta vol. 25, 2025

Vladimir Neumann Effektiver Einsatz von NLP-Methoden am Beispiel des Codex Suprasliensis

Ефективно използване на методите на NLP въз основа на пример от Codex Suprasliensis

Summary/Abstract

The integration of computational methods in historical philology is becoming increasingly essential, yet challenges persist in harmonizing linguistic and technical aspects of text analysis. This study presents a comprehensive and methodologically transparent use case that documents the entire computational philological workflow– from data acquisition and modeling to analysis and visualization–in a structured and reproducible manner. Using the Codex Suprasliensis, one of the most significant Old Slavic manuscripts, as a case study, we demonstrate how modern Natural Language Processing (NLP) techniques, particularly the Stanza library for morphosyntactic annotation and DataFrame-based corpus structuring, can facilitate the exploration of historical textual corpora. A special emphasis is placed on benchmarking Stanza’s performance in processing Old Church Slavonic, evaluating its segmentation, tagging, and parsing accuracy against existing Gold Standard datasets. Additionally, we discuss the role of DataFrame-based modeling in ensuring an efficient and transparent structuring of linguistic data, allowing for flexible transformations and reproducible analyses. To support further research and methodological validation, all functional and extensively annotated scripts–including the complete NLP pipeline–are permanently provided via the GitHub platform of the Berlin State Library. The findings highlight the importance of structured corpus processing in computational philology and contribute to the ongoing refinement of NLP methodologies for historical languages.

Subject: e-Scripta

Keywords: Computational Philology natural language processing Old Church Slavonic Stanza and Corpus Annotation DataFrame-Based Text Structuring