Achim Rabus

Prof. Dr. Achim Rabus is the current Head of the Department of Slavonic Studies at the University of Freiburg, Germany. Rabus defended his PhD thesis on the language of East Slavic spiritual songs in 2008 and his Habilitationsschrift on Slavic language contact in 2014. Since 2009, Rabus has been a member of the Special Commission on the Computer- Supported Processing of Mediæval Slavonic Manuscripts and Early Printed Books to the International Committee of Slavists, and since 2018, the President of the Commission. His current research focuses on Slavic social dialectology, Handwritten Text Recognition, corpus and (digital) historical linguistics.

Department of Slavic Linguistics, University of Freiburg, Germany
Germany

Ефективност на генерични модели HTR за историческа кирилица и глаголица: Сравнение на средства

Performance of Generic HTR Models on Historical Cyrillic and Glagolitic: Comparison of Engines


Foreword by the Guest Editors

Предговор от гост-редакторите

  • Summary/Abstract

    The publications in the e-Scripta section are selected, reviewed and revised papers delivered at El’Manuscript 2021 (Freiburg/online) in April, 2021 (www. elmanuscript2021.uni-freiburg.de). El’Manuscript is a series of biennial international conferences entitled “Textual Heritage and Information Technologies” that brings together linguists, specialists in historical source criticism, IT specialists, and others involved in publishing and studying our textual heritage. It is the official conference of the Special Commission on the Computer-Supported Processing of Mediæval Slavonic Manuscripts and Early Printed Books to the International Committee of Slavists. In the 2021 iteration, it coincided with the meeting of the Humboldt research group linkage program DigiPalSlav (Slavic Department at Freiburg University and Institute for the Russian Language, Russian Academy of Sciences, Moscow) devoted to developing and applying digital tools for pre-modern Orthodox Slavic such as neural taggers and Handwritten Text Recognition models. These issues are also reflected in the thematic focus of the 2021 iteration of El’Manuscript and, consequently, in the topics of the papers submitted and accepted for publication. We would like to thank our external reviewers for their thorough work and for meeting our tight deadlines. Furthermore, we thank Elena Renje for her valuable support. Many thanks are due to the Humboldt Foundation for financing the publication of this volume. Finally, we are grateful to the editor of Scripta & e-Scripta, Anissava Miltenova, for her tireless work and support.


Neural Morphological Tagging for Slavic: Strengths and Weaknesses

Морфологично тагиране на стари славянски текстове с помощта на тагер, използващ невронни мрежи: предимства и недостатъци

  • Summary/Abstract

    The neural network tagger CLStM has been applied to the Old Russian Žitie Evfimija Velikogo (GIM, Chud. 20), a copy of the second half of the 14th century. The strengths of this tagger consist in its ability to automatically annotate an orthographically non-normalized text with dozens of pages within a few minutes, yielding a high accuracy with respect to part of speech and morphological features. Moreover, the tagger is capable of disambiguating case syncretism to a large extent, even in split constructions. Manual correction of the automatic tagging will result in a correctly tagged text considerably faster than when using a rule-based tagger or tagging completely manually. The weaknesses of the CLStM-tagger comprise certain examples of incorrect POS-tagging, sometimes incomplete or incorrect attribution of morphological categories to some parts of speech. Superscript letters and punctuation can pose special problems, normalization of punctuation will achieve better tagging results. The proportion of correct tags is higher when the token has been seen during the training process; unknown words (OOV) show a higher error rate. In the paper, we analyze the strengths and weaknesses of the tagger by providing specific examples. Furthermore, we demonstrate how to use automatically tagged, uncorrected data for quantitative analysis.


Recognizing Handwritten Text in Slavic Manuscripts: a Neural-Network Approach Using Transkribus


New Developments in Tagging Pre-modern Orthodox Slavic Texts

  • Summary/Abstract

    Pre-modern Orthodox Slavic texts pose certain difficulties when it comes to part-of-speech and full morphological tagging. Orthographic and morphological heterogeneity makes it hard to apply resources that rely on normalized data, which is why previous attempts to train part-of-speech (POS) taggers for pre-modern Slavic often apply normalization routines. In the current paper, we further explore the normalization path; at the same time, we use the statistical CRF-tagger MarMoT and a newly developed neural network tagger that cope better with variation than previously applied rule-based or statistical taggers. Furthermore, we conduct transfer experiments to apply Modern Russian resources to pre-modern data. Our experiments show that while transfer experiments could not improve tagging performance significantly, state-of-the-art taggers reach between 90% and more than 95% tagging accuracy and thus approach the tagging accuracy of modern standard languages with rich morphology. Remarkably, these results are achieved without the need for normalization, which makes our research of practical relevance to the Paleoslavistic community.


Recycling the Metropolitan: Building an Electronic Corpus on the Basis of the Edition of the Velikie Minei Čet’i


Proposal for a unified encoding of Early Cyrillic glyphs in the Unicode Private Use Area

  • Summary/Abstract

    The paper proposes an encoding standard for early Cyrillic characters and glyphs that are still missing in the Universal Character Set (UCS) of the Unicode Standard and for different reasons will probably never be included, but are nevertheless used by the paleoslavistic community. This micro-standard is meant to expand, not to replace the Unicode standard and follows the path chosen by the Medieval Unicode Font Initiative (MUFI) a few years ago for the Latin script (see http://www.hit.uib.no/mufi/). Starting from the inventory of Old Cyrillic originally proposed at the conference held in Belgrade on 15–17 October 2007 (see BP), and taking in view the recommendations given by Birnbaum et al. 2008 and the MUFI-consortium, the chosen set is limited to 178 units with a specific function (characters and composites, superscript characters, modifier characters, and punctuation marks), which are located in the Private Use Area (PUA). Their positions (code points) are coordinated with MUFI. This set we will call PUA1. In the future a second set PUA2 will be proposed for a number of ligatures and paleographic variants that may not be coordinated with MUFI and are intended for special publications addressed to Slavistic readers. It is hoped that the proposed PUA encoding for Early Cyrillic Symbols, for which we choose the abbreviation CYFI, will establish itself as a sort of micro-standardization. Designers of scholarly fonts are encouraged to include these symbols according to this proposal (see code points in the appendix).


Subscribe to Achim Rabus