48-ма Австрийска лингвистична конференция (48. Österreichische Linguistiktagung), панел „Digitale Slawistik“ („Цифрови славистични изследвания“)
-
Summary/Abstract
The integration of digital technologies has become increasingly important across academic disciplines in the 21st century, rapidly transforming research possibilities and methodologies in the humanities and social sciences. Since the emergence of transformer neural networks like BERT (Bidirectional Encoder Representations from Transformers) and the subsequent rise of Large Language Models, linguistic research has experienced yet another wave of profound changes, especially in the field of Natural Language Processing (NLP). Despite these developments, implementing digital technologies remains a desideratum in Slavic linguistics, particularly when working with low-resource historical varieties. The panel “Digitale Slawistik” (‘Digital Slavic Studies’), held as a part of the 48th Austrian Linguistics Conference (48. Österreichische Linguistiktagung) at the University of Innsbruck on December 18–19, 2024, aimed to address this gap by bringing together linguists who study Slavic languages and incorporate digital methods into their research practice. Over the two days, scholars from Germany, Austria, Italy, and the Czech Republic presented their research. The event was organised by Elias Bounatirou (University of Vienna), Anna Jouravel (University of Freiburg), Maximilian Grübsch (University of Vienna), and Ilia Afanasev (University of Vienna).
Computer-assisted Study of Historical Lemkian (Transcarpathian) Lects: Basic Vocabulary Approach
Scripta & e-Scripta vol. 25, 2025
floyd
Tue, 08/19/2025 - 17:28
Ilia Afanasev
Компютърно подпомагано изследване на исторически лемкийски (закарпатски) диалекти: подход към основния речник
This research presents the first step in digitising texts of historical Lemkian (Transcarpathian) dialects, recorded in 1930s, and transforming them into an open- access dataset. The developed dataset includes morphological tagging, lemmatisation, and data on the named entities and basic vocabulary items. This allows for the evaluation of pre- existing models for automatic tagging of basic vocabulary in Slavic on the new material quantitatively (checking their efficiency), qualitatively (going example-by-example), and formally (by analysing the research design of previous studies). The present pilot study shows that existing models are not able to efficiently detect enough Automatic Similarity Judgement Program (ASJP) basic vocabulary list items in the Lemkian texts (F1-score less than 0.5), finding only the words that formally completely coincide with their cognates in other Slavic languages (personal pronouns). The bar chart-based visualisation shows that the previously hypothesized formalisation of basic vocabulary items as similar in distribution to the named entities is incorrect, and a new formalisation is required. The main contribution of the work is an open-access dataset of historical Lemkian dialects.
Subject:
e-Scripta
Keywords:
Lemkian
Transcarpathian
dialectology
computer-assisted study
basic vocabulary
BelarusianGLUE: Analyzing Performance of Open-weight Models
Scripta & e-Scripta vol. 25, 2025
floyd
Tue, 08/19/2025 - 17:18
Maksim Aparovich
Volha Harytskaya
Vladislav Poritski
Oksana Volchek
Pavel Smrž
BelarusianGLUE: анализ на продуктивността на модели от отворен тип
We use BelarusianGLUE, a recently introduced benchmark, to analyze the performance of open-weight large language models (LLMs) on Belarusian language understanding tasks. The impact of prompting language, few-shot prompts, orthography (modern/classical/Latin), chat templates, and evaluation mode (discriminative/ generative) is investigated. Our findings suggest that more recent models generally perform better, but improvements are gradual. Fine-tuning on related Slavic languages doesn’t always improve Belarusian understanding. Classical orthography has limited impact, while latinization degrades performance. Analysis of specific tasks (sentiment analysis, Winograd schema challenge) reveals biases in the models, difficulties with understanding linguistic structure, and gaps in world knowledge and cultural context.
Subject:
e-Scripta
Keywords:
natural language processing
Belarusian language
large language models
language understanding evaluation
Evaluating Stanza and UDPipe for Morphosyntactic Annotation of Old Russian: A Case Study on Maximus the Greek
Scripta & e-Scripta vol. 25, 2025
floyd
Tue, 08/19/2025 - 17:15
Beatrice Bindi
Оценка на строфи и UDPipe за морфосинтактична анотация на староруски език: казусът Максим Грек
The automation of morphosyntactic annotation of Old Russian texts represents a key challenge in contemporary Slavistics, underscoring the need for computational tools capable of processing historical linguistic data with high accuracy. This study qualitatively evaluates the performance of two statistical taggers, Stanza and UDPipe, in annotating a text by Maximus the Greek, using the TOROT and RNC treebanks as reference corpora. The analysis assesses the accuracy of morphosyntactic annotation—specifically, part-of-speech tagging, morphological feature assignment, and lemmatisation—identifying recurring errors and structural limitations in applying these tools to historical Slavic texts. While both taggers facilitate annotation, they do not yet ensure a level of automation sufficient for fully reliable linguistic analysis. Key challenges include the misinterpretation of morphosyntactic relationships and inaccuracies in grammatical feature assignment. The comparison with their respective reference corpora highlights these issues, demonstrating the need for further refinement in automated annotation methods. This study critically examines the applicability of current NLP technologies to historical texts, emphasizing the necessity of adapting existing models.
Subject:
e-Scripta
Keywords:
Stanza
UDPipe
natural language processing
Morphosyntactic analysis
Annotation
Old Russian
Maximus the Greek
Концепцията „Нова етика“ в руския медиен дискурс: корпусен анализ
-
Summary/Abstract
This paper analyzes the concept of novaja ėtika (‘new ethics’) as represented in Russian media discourse, focusing on a comparison between two prominent Russian media sources: Lenta.ru and Meduza.io. The study is based on a corpus of 86 texts published between 2019 and 2024. Using corpus-based methods – including frequency analysis, topic modeling, and named entity recognition (NER) – the study identifies distinct differences in how the term novaja ėtika is conceptualized, discussed, and positioned in public debates. Lenta.ru is one of Russia’s largest mainstream news sites, known for its pro-government neutrality and wide readership across the country. Meduza.io, on the other hand, is an exile-founded outlet that is often critical of Russian state policy and targets a younger, urban, liberal audience. Lenta.ru frames novaja ėtika predominantly negatively, associating it with external pressures, cultural conflicts, and moral censorship. In contrast, Meduza.io approaches the concept analytically, emphasizing its philosophical foundations, discursive development, and socio-cultural implications. The analysis also highlights differences in the representation of actors, revealing that Lenta.ru focuses on geopolitical actors and institutional structures, while Meduza.io also prioritizes individual commentators and cultural influencers. The results illustrate broader discursive strategies and cultural cleavages in contemporary Russian media that reflect competing visions of social norms, public morality, and identity politics. As an exploratory study, it is subject to methodological limitations, including the size and scope of the corpus.
Effektiver Einsatz von NLP-Methoden am Beispiel des Codex Suprasliensis
Scripta & e-Scripta vol. 25, 2025
floyd
Tue, 08/19/2025 - 17:07
Vladimir Neumann
Ефективно използване на методите на NLP въз основа на пример от Codex Suprasliensis
The integration of computational methods in historical philology is becoming increasingly essential, yet challenges persist in harmonizing linguistic and technical aspects of text analysis. This study presents a comprehensive and methodologically transparent use case that documents the entire computational philological workflow– from data acquisition and modeling to analysis and visualization–in a structured and reproducible manner. Using the Codex Suprasliensis, one of the most significant Old Slavic manuscripts, as a case study, we demonstrate how modern Natural Language Processing (NLP) techniques, particularly the Stanza library for morphosyntactic annotation and DataFrame-based corpus structuring, can facilitate the exploration of historical textual corpora. A special emphasis is placed on benchmarking Stanza’s performance in processing Old Church Slavonic, evaluating its segmentation, tagging, and parsing accuracy against existing Gold Standard datasets. Additionally, we discuss the role of DataFrame-based modeling in ensuring an efficient and transparent structuring of linguistic data, allowing for flexible transformations and reproducible analyses. To support further research and methodological validation, all functional and extensively annotated scripts–including the complete NLP pipeline–are permanently provided via the GitHub platform of the Berlin State Library. The findings highlight the importance of structured corpus processing in computational philology and contribute to the ongoing refinement of NLP methodologies for historical languages.
Subject:
e-Scripta
Keywords:
Computational Philology
natural language processing
Old Church Slavonic
Stanza and Corpus Annotation
DataFrame-Based Text Structuring
Тематично фокусираният сложен предлог „в лице“. Диахронен анализ с помощта на BERT
-
Summary/Abstract
This article examines complex prepositions in Russian using the construction v lice as a case study. This denominal complex preposition, consisting of the primary preposition v and a noun, exemplifies the dynamic transitional processes between word classes. The central focus of the analysis is the question of which semantic properties are associated with the degree of establishment of such constructions. The article adopts both a synchronic and diachronic perspective, with particular attention to developments since the 19th century, during which complex prepositions increasingly entered scientific, technical, and journalistic writing styles. Using corpus-based methods and embedding- based techniques (BERT), the study reconstructs semantic shifts and identifies functional- semantic changes. In doing so, it contributes to the description of the internal dynamics of complex prepositions in Russian.
Digital Edition of Pop Punčov Sbornik: Project Note
Scripta & e-Scripta vol. 24, 2024
floyd
Thu, 10/03/2024 - 16:35
Ivan Šimko
The described resource is an online tool, designed for studying texts and diachronic variation of language. The core of its corpus is represented by the Pop Punčov Sbornik, a West Bulgarian manuscript from 1796, released together with smaller examples of 14th–19th century Balkan Slavic varieties. Aside from the data, providing a unique view of historical dialects, it also provides a user- friendly interface and modular structure, thus allowing both easy additions of new content and features, as well as training of students and lay people interested in historical literature. The resource also contains extensive documentation concerning both grammar and philological data about the sources.
Subject:
e-Scripta
Keywords:
Balkan Slavic
diachronic corpus
damaskini
CHURCH SLAVONIC
Statistical significance of the components of lexical synonymous series in ancient Bulgarian written manuscripts: search for a method
-
Summary/Abstract
The results of statistical experiments to find the characteristics of words that are traditionally considered as the Ohrid-Moravian and Preslav components of synonymous series – иерѣи ‘priest’ – жьрьць ‘priest, cleric’ – свѧщеньникъ ‘priest, clergyman’, колѣ- но ‘knee’, ‘kindred’ – племѧ ‘tribe, genus’, коньчина ‘demise, end’ – коньць ‘end’, кънигы ‘books’ – писаниѥ ‘scripture’, любодѣица ‘adulteress, fornicator’ – блѫдьница ‘harlot’ are presented. The use of information about the relative number of words in a subcorpus, about significant deviations from the average values, and the calculation of statistical characteristics of lexemes in each of the subcorpora made it possible, in particular, to detect opposed and non-opposed components of synonymous series. The methods used to identify the statistical characteristics of words have shown that the degree of opposition of synonyms can be different – statistically significant or statistically insignificant. On this basis, it is concluded that it is necessary to move away from the unconditional attribution of the components of the synonymic series to the Ohrid-Moravian and Preslav vocabulary: the relations between the components of each synonymic series are individual and can range from statistically opposed in the texts of different schools to
Serbian Early Printed Books from Venice: Creating Models for Automatic Text Recognition Using Transkribus
Scripta & e-Scripta vol. 22, 2022
floyd
Wed, 08/17/2022 - 08:39
Vladimir Polomac
Владимир Р. Поломац. Сръбски старопечатни книги от Венеция: cъздаване на модели за автоматично текстово разпознаване чрез Transkribus
The paper describes the process of creating a model for the automatic rec- ognition of Serbian Church Slavonic printed books from Venice (from Božidar and Vincenzo Vuković’s printery) by using the Transkribus software platform, based on the principles of artificial intelligence and machine learning. By using the example of Prayer Book (Euchologion) (1538–1540) from Božidar Vuković’s printery, it has been shown that a successful model for the automatic recognition of individual books (with around 5% of unrecognized characters) can also be trained on the material consisting of approximately 4000 words, and that the increased amount of training material (in our case around 38000 words) leads to the improvement of the model and reduced error rate (between 1–2% of unrecognized characters). The most notable result of the paper is manifested through the creation of a generic model for the automatic text recognition of Serbian Church Slavonic books from Božidar and Vincenzo Vuković’s printery. The ini- tial version of the generic model (called Dionisio 1.0. by the Božidar Vuković’s Italian pseudonym – Dionisio della Vecchia) is the first resource for the automatic recognition of the Serbian medieval Cyrillic script, publicly available to all users of the Transkribus software platform (see https://readcoop.eu/model/dionisio-1-0/).
Subject:
e-Scripta
Digital humanities
Keywords:
TRANSKRIBUS
Automatic Text Recognition
Serbian Early Printed Books
Ar- tificial Intelligence
MACHINE LEARNING
Venice
The Bamberg Cyrillic Alphabet – a Colour Facsimile
Scripta & e-Scripta vol. 22, 2022
floyd
Wed, 08/17/2022 - 08:37
Sebastian Kempgen
Себастиан Кемпген. Кирилската азбука от Бамберг – цветно факсимиле
The so-called Bamberg Cyrillic Alphabet (ca. 13th c.) is one of the oldest and most reliable xenographic Slavic alphabets, i.e. a Cyrillic alphabet added to a Latin manuscript of non-related content. It has been published and edited before in black- and-white, and it is presented here for the first time in a high-quality colour photograph, accompanied by a slightly revised tabular reprentation.
Subject:
e-Scripta
Digital humanities
Keywords:
Cyrillic Script
Historic Alphabets
Facsimile
Януш С. Биен. Трактатът на Паркош от типографска гледна точка
-
Summary/Abstract
The 15th century Latin manuscript containing a treatise by Parkosz was the very first proposal of Polish spelling. To account for all the phonemes of Polish some new letters were proposed, which are not available in the present day fonts. This makes difficult to quote the proposal when discussing the history of Polish spelling. The paper describes the transliteration proposed by the author which used the characters available in the Unicode standard. The ultimate solution is of course creating a specialized font and the paper mentions some aspects of this task.
Conference “Book and Script. Tradition and Modernity,” April, 8-9, 2022, Sofia
Scripta & e-Scripta vol. 22, 2022
floyd
Wed, 08/17/2022 - 08:32
Stefan Peev
Стефан Пеев. Конференция „Книга и шрифт. Традиция и съвременност”, 8-9 април 2022 г., София
The article provides a thorough review of the presented authors and reports during the scientific conference “Book and Script. Tradition and Modernity”. The scientific conference held on the 8th and 9th of April, 2022, is the first attempt of its kind for an interdisciplinary approach to examine the development of books and scripts from a historical and theoretical aspect. Twenty four papers were presented during the conference from the following institutions: Sofia University “St. Kliment Ohridski”, National Academy of Arts, Plovdiv University “Paisii Hilendarski”, Southwestern University “Neofit Rilski”, UniBit (University of Library Science and Information Technology), New Bulgarian University, Institute of Bulgarian Language at the Bulgarian Academy of Sciences, National Library “St. Cyril and Methodius”, Regional Library “P. Pavlovich” (Silistra), University of Zurich, Typeflow (Rijeka). The main topics during the conference were the script origin among Bulgarians; the early Slavic printed books; scripts and manuscripts; the Revival book and its characteristics; books and fonts in modern times; libraries, books and modern approaches in describing them. There was a general consensus that the interdisciplinary approach opens up new fields and horizons for research in the field of books and scripts.
Subject:
e-Scripta
Digital humanities
Keywords:
book
printed book
Revival book
manuscript
manuscript book
Cyrillic Script
Glagolitic script
script
font
print
typography
paper
illustrations
Subscribe to e-Scripta