Neural Morphological Tagging for Slavic: Strengths and Weaknesses

Морфологично тагиране на стари славянски текстове с помощта на тагер, използващ невронни мрежи: предимства и недостатъци

  • Author(s):
  • Subject(s): Digital humanities //
  • Published by: Institute for Literature BAS
  • Print ISSN: 1312-238X
  • Summary/Abstract:

    The neural network tagger CLStM has been applied to the Old Russian Žitie Evfimija Velikogo (GIM, Chud. 20), a copy of the second half of the 14th century. The strengths of this tagger consist in its ability to automatically annotate an orthographically non-normalized text with dozens of pages within a few minutes, yielding a high accuracy with respect to part of speech and morphological features. Moreover, the tagger is capable of disambiguating case syncretism to a large extent, even in split constructions. Manual correction of the automatic tagging will result in a correctly tagged text considerably faster than when using a rule-based tagger or tagging completely manually. The weaknesses of the CLStM-tagger comprise certain examples of incorrect POS-tagging, sometimes incomplete or incorrect attribution of morphological categories to some parts of speech. Superscript letters and punctuation can pose special problems, normalization of punctuation will achieve better tagging results. The proportion of correct tags is higher when the token has been seen during the training process; unknown words (OOV) show a higher error rate. In the paper, we analyze the strengths and weaknesses of the tagger by providing specific examples. Furthermore, we demonstrate how to use automatically tagged, uncorrected data for quantitative analysis.

  • Page Range: 79-92
    No. of Pages: 14
    Language: English
    Year: 2021
    Issue No:: Scripta & e-Scripta vol. 21, 2021

    Submitted on:

  • Achim Rabus

    Department of Slavic Linguistics, University of Freiburg, Germany

    Prof. Dr. Achim Rabus is the current Head of the Department of Slavonic Studies at the University of Freiburg, Germany. Rabus defended his PhD thesis on the language of East Slavic spiritual songs in 2008 and his Habilitationsschrift on Slavic language contact in 2014. Since 2009, Rabus has been a member of the Special Commission on the Computer- Supported Processing of Mediæval Slavonic Manuscripts and Early Printed Books to the International Committee of Slavists, and since 2018, the President of the Commission. His current research focuses on Slavic social dialectology, Handwritten Text Recognition, corpus and (digital) historical linguistics.

    Juliane Besters-Dilger

    Slavonic Studies at the University of Freiburg

    Prof. Dr. Juliane Besters-Dilger is the former Head of the Department of Slavonic Studies at the University of Freiburg, Germany. Among others, her research interests concern editing Old/Middle Russian and Church Slavonic texts and glossaries, e.g. the “Commented Acts of the Apostles” (text, commentary, index of wordforms), extracted from the Great Menaion Reader of Macarius, Metropolitan of Moscow.

  • SUBJECT: Digital humanities //