Using Handwritten Text Recognition on bilingual Evenki-Russian manuscripts of Konstantin Rychkov

Използване на инструменти за разпознаване на ръкописни текстове (HTR) върху двуезични евенкско-руски ръкописи от колекцията на Константин Ричков

scripta_cover_21.jpg
  • Author(s):
  • Subject(s): Manuscript // Digital humanities //
  • Published by: Institute for Literature BAS
  • Print ISSN: 1312-238X
  • Summary/Abstract:

    We report on applying Handwritten Text Recognition (HTR) to manuscripts from the archive of Konstantin Rychkov preserved at IOM RAS, St. Petersburg, within the INEL project. Folklore texts in Evenki (Tungusic) were collected in Western Siberia in 1910s. We used services provided by the Transkribus platform. The necessary step of Layout Analysis proved to be time-consuming due to the organization of the parallel Evenki- Russian text on the page without following a strict separation line. HTR models have been trained successively on different amounts of data up to 521 pages. The best Character Error Rate attained on validation data for the largest dataset is 4.50% for models trained on all characters. The distribution of errors is non-uniform: most errors are due to just a few problematic issues, especially diacritics such as the accent marking stress. It is written high above the line and frequently cut off from the line images at the preprocessing stage. After excluding the stress mark from training data and recognition, the lowest CER dropped to 2.90%. We compared two recognition engines, HTR+ and PyLaia. The HTR+ model trained without stress marks made less errors in letters, while PyLaia performed better with respect to diacritics.


  • Page Range: 233-244
    No. of Pages: 12
    Language: English
    Year: 2021
    Issue No:: Scripta & e-Scripta vol. 21, 2021

    Submitted on:

  • LINK CEEOL:
  • Alexandre Arkhipov

    Germany
    Universität Hamburg
    Description

    Alexandre Arkhipov is a Research Fellow at the Institute for Finno- Ugric/Uralic Studies, Universität Hamburg, and the research coordinator of the INEL project. He is also the head of Department of linguistic and cultural ecology at the Institute of the World Culture, Lomonosov Moscow State University. His research interests include linguistic typology, linguistic fieldwork and language documentation, and acoustic phonetics. He studied languages of the Caucasus, Volga region, Siberia and the Far East, as well as Basque.

    Anna Barinskaya

    Germany
    Universität Hamburg
    Description

    Anna Barinskaya is a Bachelor student at Universität Hamburg.

    Roman Shtefura

    Germany
    Universität Hamburg
    Description
    Roman Shtefura was a Master student at Universität Hamburg.