Achim Rabus
Department of Slavic Linguistics, University of Freiburg, Germany

New Developments in Tagging Pre-modern Orthodox Slavic Texts

  • Summary/Abstract

    Pre-modern Orthodox Slavic texts pose certain difficulties when it comes to part-of-speech and full morphological tagging. Orthographic and morphological heterogeneity makes it hard to apply resources that rely on normalized data, which is why previous attempts to train part-of-speech (POS) taggers for pre-modern Slavic often apply normalization routines. In the current paper, we further explore the normalization path; at the same time, we use the statistical CRF-tagger MarMoT and a newly developed neural network tagger that cope better with variation than previously applied rule-based or statistical taggers. Furthermore, we conduct transfer experiments to apply Modern Russian resources to pre-modern data. Our experiments show that while transfer experiments could not improve tagging performance significantly, state-of-the-art taggers reach between 90% and more than 95% tagging accuracy and thus approach the tagging accuracy of modern standard languages with rich morphology. Remarkably, these results are achieved without the need for normalization, which makes our research of practical relevance to the Paleoslavistic community.

Recycling the Metropolitan: Building an Electronic Corpus on the Basis of the Edition of the Velikie Minei Čet’i

  • Summary/Abstract

    We describe the creation of the Velikie Minei Čet’i (VMČ) Corpus supplementing the latest volume of the printed edition of the Macarian Great Menaion Reader. Instead of an independently compiled historical corpus, the VMČ corpus is entirely based on the paper edition, thus following the principle of multiple use (‘recycling’) of textual data and the work invested in edition projects. We briefly describe the procedure of extraction from the edition text, dwell on the search interface designed to facilitate sophisticated yet intuitive queries, and give examples of issues that can be much more easily researched with this resource than with the paper edition. We conclude that such a supplementary corpus is both feasible and useful and hope that in the future, more editions will be accompanied by an electronic version.

Proposal for a unified encoding of Early Cyrillic glyphs in the Unicode Private Use Area

  • Summary/Abstract

    The paper proposes an encoding standard for early Cyrillic characters and glyphs that are still missing in the Universal Character Set (UCS) of the Unicode Standard and for different reasons will probably never be included, but are nevertheless used by the paleoslavistic community. This micro-standard is meant to expand, not to replace the Unicode standard and follows the path chosen by the Medieval Unicode Font Initiative (MUFI) a few years ago for the Latin script (see Starting from the inventory of Old Cyrillic originally proposed at the conference held in Belgrade on 15–17 October 2007 (see BP), and taking in view the recommendations given by Birnbaum et al. 2008 and the MUFI-consortium, the chosen set is limited to 178 units with a specific function (characters and composites, superscript characters, modifier characters, and punctuation marks), which are located in the Private Use Area (PUA). Their positions (code points) are coordinated with MUFI. This set we will call PUA1. In the future a second set PUA2 will be proposed for a number of ligatures and paleographic variants that may not be coordinated with MUFI and are intended for special publications addressed to Slavistic readers. It is hoped that the proposed PUA encoding for Early Cyrillic Symbols, for which we choose the abbreviation CYFI, will establish itself as a sort of micro-standardization. Designers of scholarly fonts are encouraged to include these symbols according to this proposal (see code points in the appendix).

