Achim Rabus | Scripta & e-Scripta

Prof. Dr. Achim Rabus is the current Head of the Department of Slavonic Studies at the University of Freiburg, Germany. Rabus defended his PhD thesis on the language of East Slavic spiritual songs in 2008 and his Habilitationsschrift on Slavic language contact in 2014. Since 2009, Rabus has been a member of the Special Commission on the Computer- Supported Processing of Mediæval Slavonic Manuscripts and Early Printed Books to the International Committee of Slavists, and since 2018, the President of the Commission. His current research focuses on Slavic social dialectology, Handwritten Text Recognition, corpus and (digital) historical linguistics.

Rabus, Achim

achim.rabus@slavistik.uni-freiburg.de

Department of Slavic Linguistics, University of Freiburg, Germany

Germany

Ефективност на генерични модели HTR за историческа кирилица и глаголица: Сравнение на средства Scripta & e-Scripta vol. 23, 2023 floyd Sun, 12/03/2023 - 16:14 Achim Rabus Walker R. Thompson

Performance of Generic HTR Models on Historical Cyrillic and Glagolitic: Comparison of Engines

The present study offers a comparative evaluation of the performance of different AI-based digital tools for handwritten text recognition (HTR) on historical manuscripts and prints. The focus is on generic models capable of transcribing a range of texts in a similar script. The training dataset for these comprises Old Cyrillic ustav and poluustav manuscripts, on the one hand, and early Glagolitic printed books, on the other. We give an overview of the performance statistics for the HTR platforms Transkribus and eScriptorium as well as for the command-line tool Calamari. In each case, we additionally offer a close, qualitative analysis of select examples in order to convey a sense of the models’ real-world performance. In this way, our study supplies comparative data on the respective capabilities of these technologies that ought to be of interest to scholars working with them in digital humanities projects.

Subject: Language studies Language and Literature Studies Theoretical Linguistics Applied Linguistics Historical Linguistics Computational linguistics South Slavic Languages Philology Translation Studies Keywords: handwritten text recognition TRANSKRIBUS MACHINE LEARNING Cyrillic palaeography Glagolitic printings

Scripta & e-Scripta vol. 21, 2021

Achim Rabus Victor Baranov Alexandr Moldovan Foreword by the Guest Editors

Предговор от гост-редакторите

Summary/Abstract

The publications in the e-Scripta section are selected, reviewed and revised papers delivered at El’Manuscript 2021 (Freiburg/online) in April, 2021 (www. elmanuscript2021.uni-freiburg.de). El’Manuscript is a series of biennial international conferences entitled “Textual Heritage and Information Technologies” that brings together linguists, specialists in historical source criticism, IT specialists, and others involved in publishing and studying our textual heritage. It is the official conference of the Special Commission on the Computer-Supported Processing of Mediæval Slavonic Manuscripts and Early Printed Books to the International Committee of Slavists. In the 2021 iteration, it coincided with the meeting of the Humboldt research group linkage program DigiPalSlav (Slavic Department at Freiburg University and Institute for the Russian Language, Russian Academy of Sciences, Moscow) devoted to developing and applying digital tools for pre-modern Orthodox Slavic such as neural taggers and Handwritten Text Recognition models. These issues are also reflected in the thematic focus of the 2021 iteration of El’Manuscript and, consequently, in the topics of the papers submitted and accepted for publication. We would like to thank our external reviewers for their thorough work and for meeting our tight deadlines. Furthermore, we thank Elena Renje for her valuable support. Many thanks are due to the Humboldt Foundation for financing the publication of this volume. Finally, we are grateful to the editor of Scripta & e-Scripta, Anissava Miltenova, for her tireless work and support.

Subject: Digital humanities

Scripta & e-Scripta vol. 21, 2021

Achim Rabus Juliane Besters-Dilger Neural Morphological Tagging for Slavic: Strengths and Weaknesses

Морфологично тагиране на стари славянски текстове с помощта на тагер, използващ невронни мрежи: предимства и недостатъци

Summary/Abstract

The neural network tagger CLStM has been applied to the Old Russian Žitie Evfimija Velikogo (GIM, Chud. 20), a copy of the second half of the 14th century. The strengths of this tagger consist in its ability to automatically annotate an orthographically non-normalized text with dozens of pages within a few minutes, yielding a high accuracy with respect to part of speech and morphological features. Moreover, the tagger is capable of disambiguating case syncretism to a large extent, even in split constructions. Manual correction of the automatic tagging will result in a correctly tagged text considerably faster than when using a rule-based tagger or tagging completely manually. The weaknesses of the CLStM-tagger comprise certain examples of incorrect POS-tagging, sometimes incomplete or incorrect attribution of morphological categories to some parts of speech. Superscript letters and punctuation can pose special problems, normalization of punctuation will achieve better tagging results. The proportion of correct tags is higher when the token has been seen during the training process; unknown words (OOV) show a higher error rate. In the paper, we analyze the strengths and weaknesses of the tagger by providing specific examples. Furthermore, we demonstrate how to use automatically tagged, uncorrected data for quantitative analysis.

Subject: Digital humanities

Keywords: Neural network tagger POS and full morphology tagging context sensitivity punctuation quantitative analysis

Scripta & e-Scripta vol. 19, 2019

Achim Rabus Recognizing Handwritten Text in Slavic Manuscripts: a Neural-Network Approach Using Transkribus

Summary/Abstract

The paper discusses the automatic text recognition capabilities of neural network models specifically trained to recognize different styles of Church Slavonic handwriting within the software platform Transkribus. Computed character error rates of the models are in the range of 3 to 5 percent; real-life performance shows that specifically trained models, by and large, recognize simple (non-superscript) characters correctly most of the time. The error rate is higher with superscript letters, abbreviations, and word separation. Combined models consisting of training data from different sources are capable of transcribing different styles of Slavic handwriting with low error rates. Automatic text recognition using Transkribus and the models presented in this paper can help improve the efficiency of the process of digitizing Church Slavonic manuscripts and thus boost the number of digitized sources available in the future.

Subject: Language and Literature Studies Language studies Studies of Literature Philology Theory of Literature Foreign languages learning Applied Linguistics Computational linguistics Translation Studies

Keywords: CHURCH SLAVONIC TRANSKRIBUS AUTOMATIC TRANSCRIPTION MACHINE LEARNING NEURAL NETWORKS ARTIFICIAL INTELLIGENCE

New Developments in Tagging Pre-modern Orthodox Slavic Texts Scripta & e-Scripta vol. 18, 2018 floyd Fri, 12/28/2018 - 08:08 Achim Rabus Susanne Mocken Yves Scherrer

Pre-modern Orthodox Slavic texts pose certain difficulties when it comes to part-of-speech and full morphological tagging. Orthographic and morphological heterogeneity makes it hard to apply resources that rely on normalized data, which is why previous attempts to train part-of-speech (POS) taggers for pre-modern Slavic often apply normalization routines. In the current paper, we further explore the normalization path; at the same time, we use the statistical CRF-tagger MarMoT and a newly developed neural network tagger that cope better with variation than previously applied rule-based or statistical taggers. Furthermore, we conduct transfer experiments to apply Modern Russian resources to pre-modern data. Our experiments show that while transfer experiments could not improve tagging performance significantly, state-of-the-art taggers reach between 90% and more than 95% tagging accuracy and thus approach the tagging accuracy of modern standard languages with rich morphology. Remarkably, these results are achieved without the need for normalization, which makes our research of practical relevance to the Paleoslavistic community.

Subject: Church Slavonic Natural language processing Part of speech tagging Old Russian Neural networks Language studies Language and Literature Studies Theoretical Linguistics Studies of Literature Eastern Slavic Languages Philology Theory of Literature

Scripta & e-Scripta vol. 14-15, 2015

Achim Rabus Ruprecht von Waldenfels Recycling the Metropolitan: Building an Electronic Corpus on the Basis of the Edition of the Velikie Minei Čet’i

Summary/Abstract

We describe the creation of the Velikie Minei Čet’i (VMČ) Corpus supplementing the latest volume of the printed edition of the Macarian Great Menaion Reader. Instead of an independently compiled historical corpus, the VMČ corpus is entirely based on the paper edition, thus following the principle of multiple use (‘recycling’) of textual data and the work invested in edition projects. We briefly describe the procedure of extraction from the edition text, dwell on the search interface designed to facilitate sophisticated yet intuitive queries, and give examples of issues that can be much more easily researched with this resource than with the paper edition. We conclude that such a supplementary corpus is both feasible and useful and hope that in the future, more editions will be accompanied by an electronic version.

Subject: Language studies Language and Literature Studies Library and Information Science Electronic information storage and retrieval Theoretical Linguistics Historical Linguistics Comparative Linguistics Philology

Proposal for a unified encoding of Early Cyrillic glyphs in the Unicode Private Use Area Scripta & e-Scripta vol. 8-9, 2010 floyd Sun, 12/26/2010 - 11:06 Ralf Cleminson Victor Baranov Achim Rabus David Birnbaum Heinz Miklas

The paper proposes an encoding standard for early Cyrillic characters and glyphs that are still missing in the Universal Character Set (UCS) of the Unicode Standard and for different reasons will probably never be included, but are nevertheless used by the paleoslavistic community. This micro-standard is meant to expand, not to replace the Unicode standard and follows the path chosen by the Medieval Unicode Font Initiative (MUFI) a few years ago for the Latin script (see http://www.hit.uib.no/mufi/). Starting from the inventory of Old Cyrillic originally proposed at the conference held in Belgrade on 15–17 October 2007 (see BP), and taking in view the recommendations given by Birnbaum et al. 2008 and the MUFI-consortium, the chosen set is limited to 178 units with a specific function (characters and composites, superscript characters, modifier characters, and punctuation marks), which are located in the Private Use Area (PUA). Their positions (code points) are coordinated with MUFI. This set we will call PUA1. In the future a second set PUA2 will be proposed for a number of ligatures and paleographic variants that may not be coordinated with MUFI and are intended for special publications addressed to Slavistic readers. It is hoped that the proposed PUA encoding for Early Cyrillic Symbols, for which we choose the abbreviation CYFI, will establish itself as a sort of micro-standardization. Designers of scholarly fonts are encouraged to include these symbols according to this proposal (see code points in the appendix).

Subject: Language and Literature Studies Early Cyrillic glyphs Unicode 5.1 Private Use Area

Subscribe to Achim Rabus