Digital humanities

Serbian Early Printed Books from Venice: Creating Models for Automatic Text Recognition Using Transkribus

Владимир Р. Поломац. Сръбски старопечатни книги от Венеция: cъздаване на модели за автоматично текстово разпознаване чрез Transkribus

  • Summary/Abstract

    The paper describes the process of creating a model for the automatic rec- ognition of Serbian Church Slavonic printed books from Venice (from Božidar and Vincenzo Vuković’s printery) by using the Transkribus software platform, based on the principles of artificial intelligence and machine learning. By using the example of Prayer Book (Euchologion) (1538–1540) from Božidar Vuković’s printery, it has been shown that a successful model for the automatic recognition of individual books (with around 5% of unrecognized characters) can also be trained on the material consisting of approximately 4000 words, and that the increased amount of training material (in our case around 38000 words) leads to the improvement of the model and reduced error rate (between 1–2% of unrecognized characters). The most notable result of the paper is manifested through the creation of a generic model for the automatic text recognition of Serbian Church Slavonic books from Božidar and Vincenzo Vuković’s printery. The ini- tial version of the generic model (called Dionisio 1.0. by the Božidar Vuković’s Italian pseudonym – Dionisio della Vecchia) is the first resource for the automatic recognition of the Serbian medieval Cyrillic script, publicly available to all users of the Transkribus software platform (see

The Bamberg Cyrillic Alphabet – a Colour Facsimile

Себастиан Кемпген. Кирилската азбука от Бамберг – цветно факсимиле

  • Summary/Abstract

    The so-called Bamberg Cyrillic Alphabet (ca. 13th c.) is one of the oldest and most reliable xenographic Slavic alphabets, i.e. a Cyrillic alphabet added to a Latin manuscript of non-related content. It has been published and edited before in black- and-white, and it is presented here for the first time in a high-quality colour photograph, accompanied by a slightly revised tabular reprentation.

Parkosz’s Treatise from a Typographic Point of View

Януш С. Биен. Трактатът на Паркош от типографска гледна точка

  • Summary/Abstract

    The 15th century Latin manuscript containing a treatise by Parkosz was the very first proposal of Polish spelling. To account for all the phonemes of Polish some new letters were proposed, which are not available in the present day fonts. This makes difficult to quote the proposal when discussing the history of Polish spelling. The paper describes the transliteration proposed by the author which used the characters available in the Unicode standard. The ultimate solution is of course creating a specialized font and the paper mentions some aspects of this task.

Conference “Book and Script. Tradition and Modernity,” April, 8-9, 2022, Sofia

Стефан Пеев. Конференция „Книга и шрифт. Традиция и съвременност”, 8-9 април 2022 г., София

  • Summary/Abstract

    The article provides a thorough review of the presented authors and reports during the scientific conference “Book and Script. Tradition and Modernity”. The scientific conference held on the 8th and 9th of April, 2022, is the first attempt of its kind for an interdisciplinary approach to examine the development of books and scripts from a historical and theoretical aspect. Twenty four papers were presented during the conference from the following institutions: Sofia University “St. Kliment Ohridski”, National Academy of Arts, Plovdiv University “Paisii Hilendarski”, Southwestern University “Neofit Rilski”, UniBit (University of Library Science and Information Technology), New Bulgarian University, Institute of Bulgarian Language at the Bulgarian Academy of Sciences, National Library “St. Cyril and Methodius”, Regional Library “P. Pavlovich” (Silistra), University of Zurich, Typeflow (Rijeka). The main topics during the conference were the script origin among Bulgarians; the early Slavic printed books; scripts and manuscripts; the Revival book and its characteristics; books and fonts in modern times; libraries, books and modern approaches in describing them. There was a general consensus that the interdisciplinary approach opens up new fields and horizons for research in the field of books and scripts.

Wege zur verbesserten automatischen Annotation des mittelbulgarischen Kirchenslawischen

Фабио Майо. Начини за подобряване на автоматичните анотации на средно­ български църковнославянски текстове

  • Summary/Abstract

    The last decade has brought an upswing in research on natural language processing. However, it is well known that historical language stages are largely underrepresented. Middle Bulgarian Church Slavonic, a language variety with a significant literary productivity, is a prime example. In the current paper, it is shown how annotated texts of related language varieties can be used to annotate texts written in Middle Bulgarian Church Slavonic, such as the 14th-century translation of the Dioptra. In particular, I present a way of adapting the available training data and of reducing the differences between training and test data, thereby improving the result of the automatic morphological annotation. Moreover, it is demonstrated that a comparison with the original work, written in Byzantine Greek, can further increase the results of the annotation by carefully disambiguating homonymous word forms. The presented results can benefit research on Middle Bulgarian Church Slavonic as it shows how texts in this variety can be annotated without authentic training data. The proposed method may be of use not only for Slavonic Studies, however. The method of using training data from genetically related language varieties in combination with translations may be used to annotate other underrepresented language varieties as well.

Foreword by the Guest Editors

Предговор от гост-редакторите

  • Summary/Abstract

    The publications in the e-Scripta section are selected, reviewed and revised papers delivered at El’Manuscript 2021 (Freiburg/online) in April, 2021 (www. El’Manuscript is a series of biennial international conferences entitled “Textual Heritage and Information Technologies” that brings together linguists, specialists in historical source criticism, IT specialists, and others involved in publishing and studying our textual heritage. It is the official conference of the Special Commission on the Computer-Supported Processing of Mediæval Slavonic Manuscripts and Early Printed Books to the International Committee of Slavists. In the 2021 iteration, it coincided with the meeting of the Humboldt research group linkage program DigiPalSlav (Slavic Department at Freiburg University and Institute for the Russian Language, Russian Academy of Sciences, Moscow) devoted to developing and applying digital tools for pre-modern Orthodox Slavic such as neural taggers and Handwritten Text Recognition models. These issues are also reflected in the thematic focus of the 2021 iteration of El’Manuscript and, consequently, in the topics of the papers submitted and accepted for publication. We would like to thank our external reviewers for their thorough work and for meeting our tight deadlines. Furthermore, we thank Elena Renje for her valuable support. Many thanks are due to the Humboldt Foundation for financing the publication of this volume. Finally, we are grateful to the editor of Scripta & e-Scripta, Anissava Miltenova, for her tireless work and support.

From Annotation to Modeling: Computational Horizons for Medieval Slavic Studies.

От анотиране към моделиране: компютърни хоризонти за славистичната медиевистика

  • Summary/Abstract

    This paper is a write-up of a keynote from El’Manuscript 2021, reflecting on the ways in which the field of computationally-supported medieval Slavic studies has and has not changed since the mid-2000’s. Looking towards developments in the broader fields of digital humanities and natural-language processing, it explores the ways that recent improvements in the tools at our disposal for mass digitization of manuscripts and text analysis at scale open up possibilities for working with manuscripts that have received very little attention. For these advancements to be feasible, however, scholars will need to prepare and share their digitized texts and annotations in ways that are not currently the norm, though a number of projects provide exemplary models of how these new conventions could be put into practice.

Interdisciplinary Analyses of the Codex Marianus, Vienna Part (Cod. Vind. slav. 146).

Интердисциплинарни анализи на Мариинското евангелие, виенската част (Cod. Vind. slav. 146)

  • Summary/Abstract

    The article contains some results of analyses of the Vienna part of the Codex Marianus (ÖNB, Vind. slav. 146), undertaken by an interdisciplinary group of scholars and scientists from the Centre of Image and Material Analysis in Cultural Heritage (CIMA ‒ within two Austrian Science Fund-projects devoted to the ancient Glagolitic heritage. The investigation consisted of four parts, codicological, multispectral, chemical and philological. While the codicological survey served to get as much information as possible about the writing material (source of parchment, methods of preparation, writing process, deletions, condition), color and multispectral recordings had been made to preserve the manuscript at its best and to provide an apt basis for further investigations. The chemical analysis was executed with two portable spectroscopes (XRF and rFTIR) and aimed to get exact information on the parchment, the inks, paints and binders, and to collect data for a comparative study of parchment degradation. The philologists analysed the fragment comparatively with all other Old Church Slavonic-Glagolitic manuscripts preserved to get as much information as possible about their scribes.

Towards Fundamental Principles for Creating Electronic Corpus of Serbian Medieval Charters and Letters Scripta & e-Scripta vol. 21, 2021 floyd Sat, 11/20/2021 - 08:15
За основните принципи за създаване на електронен корпус от сръбски средновековни грамоти и послания

The paper defines the elementary principles for creating an electronic corpus of Serbian medieval charters and letters. The commitment to the principle of maximum representativeness of the corpus of medieval charters and letters, determined entirely by the preserved written legacy (based on manuscripts, microfilms or photographs), excludes the indispensability of applying the principle of balance, while simultaneously satisfying the principle of reliability, since charters and letters known solely by the edition are not included in the corpus. The selection of texts is done according to the diplomatic criterion by excluding the transcripts and copies of documents already available in the original, as well as later transcripts, chronologically and linguistically distant from the assumed original. This approach to the selection of texts is justified by the size of the corpus, as well as by the exceptional cultural and historical significance of medieval charters and letters. The definition of the metadata about corpus texts is determined by their general diplomatic properties, as well as the corpus search needs for diatopic, diachronic and genre variations. Conversion of texts into electronic form strives for fidelity to the original, encompassing the preservation of abbreviations, superscript letters and original punctuation, as well as the absence of accent marks and contemporary rules of capitalization.

Subject: Digital humanities Keywords: Historical Corpus Linguistics Old Serbian language Serbian Church Slavonic Serbian Medieval Charters and Letters 12th–16th century
On Sentence Segmentation in Diachronic Texts Scripta & e-Scripta vol. 21, 2021 floyd Sat, 11/20/2021 - 08:11
Върху сегментирането на изреченията в диахронните текстове

The article discusses a proposal of a minimal set of criteria for sentence segmentation (an obligatory stage in the corpus processing and annotation, especially with respect to the syntactic annotation) of medieval texts. In the context of a review of different definitions of a sentence (unit) and approaches to sentence segmentation, various criteria are discussed: structural, thematic, graphic, on the basis of sample sentences in order to define the minimal criteria. The discussion of the different factors is illustrated by sample sentences from two texts from 14th and 17th c. The proposed criteria aim at considering mainly structural characteristics while trying to avoid textual and semantic interpretation though these can also present challenges because the interpretation of the (syntactic) structure is inevitably related to the interpretation of the (semantic) content.

Subject: Digital humanities Keywords: corpus annotation sentence sentence segmentation

Content structuring in St Petersburg Corpus of Hagiographic Texts (SCAT)

Структуриране на съдържанието на Санкт-Петербургския корпус от агиографски текстове (СКАТ)

  • Summary/Abstract

    The St Petersburg Corpus of Hagiographic Texts (SCAT) has launched two new mark-up formats. The first innovation is the comprehensive format developed for the division of hagiographic texts into parts, which are both explicitly marked as section headings and extrapolated through comparison with texts of the similar genre. The second innovation is an elaborate format representing the full range of various types of biblical, patristic and liturgical quotations occurring in the lives of saints. For the time being, three morphologically annotated manuscript texts have been marked up according to these guidelines, and we are planning to add two more texts in the near future. Close cooperation with the IHRIM research laboratory (Lyon) and wide use of their techniques and technology makes it possible to obtain some illuminating cross- format statistical data and thus offer new insights into the canons and rules of the Old Russian hagiography.

Neural Morphological Tagging for Slavic: Strengths and Weaknesses

Морфологично тагиране на стари славянски текстове с помощта на тагер, използващ невронни мрежи: предимства и недостатъци

  • Summary/Abstract

    The neural network tagger CLStM has been applied to the Old Russian Žitie Evfimija Velikogo (GIM, Chud. 20), a copy of the second half of the 14th century. The strengths of this tagger consist in its ability to automatically annotate an orthographically non-normalized text with dozens of pages within a few minutes, yielding a high accuracy with respect to part of speech and morphological features. Moreover, the tagger is capable of disambiguating case syncretism to a large extent, even in split constructions. Manual correction of the automatic tagging will result in a correctly tagged text considerably faster than when using a rule-based tagger or tagging completely manually. The weaknesses of the CLStM-tagger comprise certain examples of incorrect POS-tagging, sometimes incomplete or incorrect attribution of morphological categories to some parts of speech. Superscript letters and punctuation can pose special problems, normalization of punctuation will achieve better tagging results. The proportion of correct tags is higher when the token has been seen during the training process; unknown words (OOV) show a higher error rate. In the paper, we analyze the strengths and weaknesses of the tagger by providing specific examples. Furthermore, we demonstrate how to use automatically tagged, uncorrected data for quantitative analysis.

Integration of the Old East Slavic Epigraphical Databases, Corpora and Indices

Интегриране на бази данни, корпуси и индекси на стара източнославянска епиграфика

  • Summary/Abstract

    The paper presents results, including work in progress, related to two databases of “non-bookish” / vernacular Old East Slavic writing, viz. the databases of birchbark letters and epigraphy. The aim of the project is the interlinking of visual, archeological/ historical and linguistic information. The epigraphical database represents different interpretations of a single inscription, providing the outline of versions proposed in the existing literature. These sources, an archeographical database and a linguistic corpus making part of a larger Russian National corpus, are intended to be easily synchronized, expanded, and updated. An online work station for the morphological annotation of texts is a part of this project. An important function performed by this platform is creating an index to the corpus that can be used in the linguistic description of the dialect, verifying the index and the data of the book Old Novgorod Dialect. Addenda by Andrei Zaliznjak that is being prepared for a posthumous publication. New linguistic discoveries have been made during the implementation of the project.

Eliminating variation of linguistic units of the Slavonic historical corpus to facilitate search, demonstration and statistical analysis Scripta & e-Scripta vol. 21, 2021 floyd Sat, 11/20/2021 - 07:34
Елиминиране на вариативността на лингвистичните единици в славянския исторически корпус с цел улесняване на търсенето, визуализирането и статистическия анализ

The work demonstrates the methods and techniques of elimination of variation of linguistic units in the transcriptions of the medieval Slavonic manuscripts of the historical corpus “Manuscript” ( The textual corpus, the material of which is presented by the machine-readable copies which resemble the original most closely, provides the user with such tools of transformation (modification) of linguistic units which enable the creation of queries and getting of retrievals corresponding to the task to be solved. In case of an inexact search the user has the possiblity to delete titlos and diacritics, reduction of the versions of letters to their basic form, indication of the mask of the linguistic units being searched in the form of a regular expression, use of the letters of the contemporary Cyrillic alphabet. To ensure operations over lemmas by means of the statistic modules of the corpus, it is necessary to automatically assign a given textual form to exactly one lemma. Due to grammatical homonymy, incorrect lemmatization would result in a situation where quantitative data based on word forms and data based on lemmas do not match each other. In order to assign word forms to the correct lemma, we apply a rule-based approach, taking into account the formal and quantitative characteristics of the linguistic units (such as their morphological variation or invariation, their frequency in the sub-corpus, the matching or mismatching with the lemma form, the frequency of relationships between the textual forms and dictionary paradigms of variable words, the results of manual elimination of the homonymy). The reduction of textual forms to unified, normalized, transliterated or initial forms is a necessary procedure for extracting of data from the historical corpus for the distributive-statistical analysis of the semantics of linguistic units.

Subject: Digital humanities Keywords: historical corpus search and demonstration of data LINGUISTIC STATISTICS

The annotation of verbal aspect in diachrony: parameters, algorithms and problems

Анотиране на глаголния вид в диахрония: параметри, алго- ритми и проблеми

  • Summary/Abstract

    Digital annotation of verbal aspect in Old Russian and Church Slavonic texts is a challenging and quite complicated task that requires a complex approach. While studying Slavic aspect systems synchronically, we always know whether the verb is perfective, imperfective or biaspectual, however, this is often not the case for the research of aspect in a diachronic perspective. The determination of the aspectual status of a particular verb for earlier stages is possible only after considering together different parameters such as: actionality, lexical semantic, morphology, functional distribution, syntactic restrictions, collocations, statistics etc. All essential parameters should be annotated sufficiently for an effective use of a corpora. That would enable a researcher to collect quickly the information necessary to build aspectual profile of a verb. It is also important to understand the hierarchy of the parameters, as they might have different degrees of importance, and for this purpose a special algorithm should be developed. The preliminary results, related to the parameters of annotation and the algorithm for aspect determination (using ‘Morphy’, the System for digital morphological annotation of Old Russian and Church Slavonic manuscripts, developed in Vinogradov Russian Language Institute RAS), are discussed in the paper.

Administrative documents of the Don Cossack Host in the 18th – 19th centuries: the issue of the creation of a linguistic corpus

Административните документи на Донската казашка армия от XVIII–XIX век: проблемът за изграждане на лингвистичен корпус

  • Summary/Abstract

    The article presents basic principles of designing the diachronic linguistic corpus of documents of the Don Cossack Host offices from the State Archive of the Volgograd region, Russia, including collecting documents for the text corpus, arranging the technical base of automatic processing and text editing, scheduling automated tagging, morphological annotation, and corpus software tools. The authors explain some technical aspects of corpus processing and text corpus constituency. It is considered reasonable to add any document to the corpus, the draft texts with the crossed-out fragments included, as it ensures accurate registration of grammar and vocabulary of the language at a certain historical period. A set of language marker types is worked over for automated meta-tagging. The corpus software tools are defined to enable accurate annotation of obsolete fonts so that they can be processed in a pair with regular language units and expressions in morphological and genre meta-tagging; in cases of partial text adaptation, the authentic old graphic symbols may have to be preserved.

Collocations with a component -ьн(o) in Russian Chronicles: the quantitative-statistical analysis (based on the corpus of Russian Chronicles of the IAS “Manuscript”)

Колации с компонентьн(o) в руските летописи: количествено-статистически анализ (върху подкорпуса на руските летописи в ИАС «Манускрипт»

  • Summary/Abstract

    The article is dedicated to the quantitative and statistical research of linguistic units in the ancient Russian chronicles. The relevant samples were obtained by using the n-gram module of the information-analytical system (IAS) «Manuscript», which allows identifying textual combinations with various numbers of components. The module makes it possible to carry out a statistical analysis of linguistic units using measures of association. It is the aim of this work to prove that the remainder of an indivisible noun that has preserved semantic and grammatical unity is present in the chronicles. This gives insight into the formation of the part of speech system of the Old Russian language. The tools of the IAS “Manuscript” allowed the conclusion that the analyzed suffixal forms in -о perform predominantly a predicative function in the syntagmas. Within the framework of this research, collocations with a component in -ьн(о) were identified that are not lexically stable (not idiomatic) but grammatically stable, that is, they represent colligations. On the whole, this paper demonstrates the effectiveness of statistical measures in extracting collocations from Old Russian texts in order to perform a complex analysis.

Texts of corpus of Russian dialects of Udmurtia as a source of linguistic and culturological information Scripta & e-Scripta vol. 21, 2021 floyd Fri, 11/19/2021 - 19:15
Текстовете от корпуса на руските диалекти от Удмуртия като източник на лингвистична и културологична информация

The corpus of Russian dialects of Udmurtia, created on the platform of the linguistic and geographical information system (LGIS) “Dialect” (URL: http://dialect., contains recordings of oral speech of residents of 166 localities of the Udmurt Republic in the 1970s–1990s. The texts are presented mainly in the form of scanned copies of the pages of notebooks, in which transcription of the conversations of the collectors with the informants is given. There are 9300 scanned copies of pages, all records are certified. The existing markup provides the creation of token samples and visualization of contexts at the user’s request ( FindQuestPage), which allows us to analyze the features of the lexical composition, as well as some phonetic and grammatical features of the Russian dialects of Udmurtia. Dialect words from texts of the corpus can be mapped in lexical, word building and semantic maps of Russian dialects of Udmurtia. At present, the texts of the corpus are available in the public domain http://manuscripts. ru/dialect-test/notebooks. Recordings of dialect speech can serve as a source of non- linguistic information, namely about historical events and personalities, material and spiritual culture, customs and traditions of the local population, national composition and interethnic relations in Udmurtia of the 20th century. In the paper examples of texts of the corpus in Russian transcription with all lexical, phonetic and grammatical features of dialects, information about speakers, time and places of recordings are given.

Subject: Digital humanities Keywords: linguistic corpus Russian dialects of Udmurtia LGIS Dialect history cultural science ethnography
A Bilingual Digital Edition of La Belle et la Bête and its Russian Translation by Kh. Demidova Scripta & e-Scripta vol. 21, 2021 floyd Fri, 11/19/2021 - 19:05
Двуезичното дигитално издание на La Belle et la Bête и неговият руски превод от Х. Демидова
This paper presents a digital edition of the manuscript of the first Russian translation of Leprince de Beaumont’s The Beauty and the Beast fairy tale (1756), aligned to its French original. The translation was made in 1758 by a twelve year-old girl, Khionia Demidova (1746-1792), and dedicated to her elder brother. Its original manuscript is conserved at the scientific library of Saratov State University (no. 456). This document is interesting from several points of view: the “naive” translation made by a young girl allows us to understand how the French literature was perceived in the 18th century Russia, what aspects of the French language and socio-cultural phenomena of the Western Europe were difficult to understand, and how the socio-cultural phenomena of the Western Europe were perceived. The peculiarities of Khionia’s spelling and punctuation provide data on her knowledge of Russian grammar and orthography. The digital edition includes a multi-layer transcription of the source document aligned with a digital fac-simile and the original French text. It is published online on the TXM-IHRIM web portal ( The workflow of the edition Microsoft Word, Oxgarage and TXM may be reused for similar editions and text corpora. Subject: Digital humanities Keywords: digital editing TXM Russian literature of the 18th century Russian-French literary contacts

Scribe vs. authorship clustering in historic manuscripts with LiViTo: A case study with visual & linguistic features

Групиране по преписвачи и авторство на ръкописи с историческо съдържание с помощта на LiViTo: казус с анализ на визуалните и лингвистичните особености

  • Summary/Abstract

    In cases where there is a larger collection of manuscripts, the scribe or author of which is unknown or in doubt, analyzing such manuscripts can take a lot of time and effort. The more pages and potential writers are involved, the more complicated it is to get tangible results. LiViTo is a free tool2 that requires a minimum of experience with the command line and allows a simplified search for keywords, revisions, and clustering of historical manuscripts. We present the application of LiViTo on the “lab case” of the biographies of Czech Protestant refugees from the 18th–19th century. Most of these manuscripts include stories of farmers’ and craftsmen’s families who fled to Berlin because of their religious beliefs. The examination of this type of biographies and manuscripts using the methods of Digital Humanities takes place for the first time for Czech. Using extracts from the research project in which LiViTo was developed, individual functions of the tool are explained. In addition, individual findings relating to the manuscripts and the potential further development of the tool are presented.

Deep Mining of the Collection of Old Prints ‘Kirchenslavica digital’

Цифровизиране с извличане на семантични данни на сбирката от старопечатни книги Kirchenslavica digital

  • Summary/Abstract
    The article deals with various efforts of the Staatsbibliothek zu Berlin (SBB) to make its collection of about 250 Church-Slavic prints from the 17th to the 19th century accessible in terms of content using the methods of modern information technology from the Digital Humanities sector. The focus is on full-text indexing of the heterogeneous Church Slavonic prints using HTR+ language models from the programme Transkribus. Depending on whether they are Moscow, Kiev or Old Believer prints, these models require different approaches and corresponding adaptations that take into account the printing area and printing period. Prints such as Kirillova kniga (1644) or Gistorija Ioanna Damaskina (1637) and many others are processed at large scale, whereby the developed character recognition models are constantly refined by training new verified data. The full texts generated in this way are permanently stored in various XML formats (ALTO, PAGE) on the one hand in a central repository for subsequent use, and on the other hand they are merged with original digital copies in the IIIF-compatible Digital Library of the SBB. As a further element, the Church Slavonic full texts will be indexed using special SOLR analyzers for efficient searches (Tokinising, Translit, N-Grams) and made searchable in subject portals (including the Slavistik-Portal) using modern text-image web design.

Using Handwritten Text Recognition (HTR) Tools to Transcribe Historical Multilingual Lexica

Използване на приложения за разпознаване на ръкописни текстове (HTR) при транскрибиране на многоезични исторически лексикони

  • Summary/Abstract
    The paper discusses some results obtained as part of an ongoing project at the Slavic Institute of Heidelberg University to produce automatic transcriptions of an early 18th century trilingual printed dictionary (Fedor Polikarpov’s Leksikon trejazyčnyj) and, on a preliminary basis, of a 17th century trilingual manuscript (Epifanij Slavineckii’s working copy of his Greek–Slavic–Latin dictionary) using the handwritten text recognition (HTR) platforms Transkribus and eScriptorium. It is argued that there are considerable advantages to employing such tools in terms of the simplification and acceleration of work on multilingual edition projects. Moreover, a comparison of our experience working with Transkribus and eScriptorium is given, along with an overview of the practical benefits and challenges of working with each of these platforms.

Using Handwritten Text Recognition on bilingual Evenki-Russian manuscripts of Konstantin Rychkov

Използване на инструменти за разпознаване на ръкописни текстове (HTR) върху двуезични евенкско-руски ръкописи от колекцията на Константин Ричков

  • Summary/Abstract

    We report on applying Handwritten Text Recognition (HTR) to manuscripts from the archive of Konstantin Rychkov preserved at IOM RAS, St. Petersburg, within the INEL project. Folklore texts in Evenki (Tungusic) were collected in Western Siberia in 1910s. We used services provided by the Transkribus platform. The necessary step of Layout Analysis proved to be time-consuming due to the organization of the parallel Evenki- Russian text on the page without following a strict separation line. HTR models have been trained successively on different amounts of data up to 521 pages. The best Character Error Rate attained on validation data for the largest dataset is 4.50% for models trained on all characters. The distribution of errors is non-uniform: most errors are due to just a few problematic issues, especially diacritics such as the accent marking stress. It is written high above the line and frequently cut off from the line images at the preprocessing stage. After excluding the stress mark from training data and recognition, the lowest CER dropped to 2.90%. We compared two recognition engines, HTR+ and PyLaia. The HTR+ model trained without stress marks made less errors in letters, while PyLaia performed better with respect to diacritics.

Subscribe to Digital humanities