Effektiver Einsatz von NLP-Methoden am Beispiel des Codex Suprasliensis

Ефективно използване на методите на NLP въз основа на пример от Codex Suprasliensis

Author(s): Vladimir Neumann
Subject(s): e-Scripta //
Published by: Institute for Literature BAS
Print ISSN: 1312-238X
Summary/Abstract:

The integration of computational methods in historical philology is becoming increasingly essential, yet challenges persist in harmonizing linguistic and technical aspects of text analysis. This study presents a comprehensive and methodologically transparent use case that documents the entire computational philological workflow– from data acquisition and modeling to analysis and visualization–in a structured and reproducible manner. Using the Codex Suprasliensis, one of the most significant Old Slavic manuscripts, as a case study, we demonstrate how modern Natural Language Processing (NLP) techniques, particularly the Stanza library for morphosyntactic annotation and DataFrame-based corpus structuring, can facilitate the exploration of historical textual corpora. A special emphasis is placed on benchmarking Stanza’s performance in processing Old Church Slavonic, evaluating its segmentation, tagging, and parsing accuracy against existing Gold Standard datasets. Additionally, we discuss the role of DataFrame-based modeling in ensuring an efficient and transparent structuring of linguistic data, allowing for flexible transformations and reproducible analyses. To support further research and methodological validation, all functional and extensively annotated scripts–including the complete NLP pipeline–are permanently provided via the GitHub platform of the Berlin State Library. The findings highlight the importance of structured corpus processing in computational philology and contribute to the ongoing refinement of NLP methodologies for historical languages.

Journal: Scripta & e-Scripta vol. 25, 2025

Page Range: 79-100
No. of Pages: 22
Language: German

Year: 2025
Issue No:: Scripta & e-Scripta vol. 25, 2025

Submitted on: 19 August 2025
LINK CEEOL:
Vladimir Neumann

Germany

Vladimir.neumann@sbb.spk-berlin.de

Staatsbibliothek zu Berlin

Description

Vladimir Neumann studied Slavic Studies in Bonn and received his doctorate in Berlin. He currently works as a subject specialist for Slavistics at the Berlin State Library, where he has been involved for over 20 years not only in the development of the East European collection, but also in the continuous expansion of Slavistik-Portal, which serves as a central hub for scholarly information in the field. In recent years, his work has focused on the conversion of all German-language and international Slavic bibliographies into database format, the digitization and OCR-based indexing of the library’s collection of Church Slavonic prints (Kirchenslavica Digital, 17th to 19th century), and the processing and transformation of historical multilingual Slavic dictionaries (MultiSlavDict). The latter project currently comprises around a dozen sources with more than 300,000 lemmas and approximately five million word forms. Vladimir Neumann also regularly offers online training courses at the Berlin State Library on topics such as Natural Language Processing, text corpora in Slavistics, and OCR techniques for Church Slavonic sources.
SUBJECT: e-Scripta //

KEYWORDS: Computational Philology // natural language processing // Old Church Slavonic // Stanza and Corpus Annotation // DataFrame-Based Text Structuring //

Effektiver Einsatz von NLP-Methoden am Beispiel des Codex Suprasliensis

Ефективно използване на методите на NLP въз основа на пример от Codex Suprasliensis

Journal: Scripta & e-Scripta vol. 25, 2025

Vladimir Neumann