TRANSKRIBUS | Scripta & e-Scripta

Scripta & e-Scripta vol. 23, 2023

Achim Rabus Walker R. Thompson Ефективност на генерични модели HTR за историческа кирилица и глаголица: Сравнение на средства

Performance of Generic HTR Models on Historical Cyrillic and Glagolitic: Comparison of Engines

Summary/Abstract

The present study offers a comparative evaluation of the performance of different AI-based digital tools for handwritten text recognition (HTR) on historical manuscripts and prints. The focus is on generic models capable of transcribing a range of texts in a similar script. The training dataset for these comprises Old Cyrillic ustav and poluustav manuscripts, on the one hand, and early Glagolitic printed books, on the other. We give an overview of the performance statistics for the HTR platforms Transkribus and eScriptorium as well as for the command-line tool Calamari. In each case, we additionally offer a close, qualitative analysis of select examples in order to convey a sense of the models’ real-world performance. In this way, our study supplies comparative data on the respective capabilities of these technologies that ought to be of interest to scholars working with them in digital humanities projects.

Subject: Language studies Language and Literature Studies Theoretical Linguistics Applied Linguistics Historical Linguistics Computational linguistics South Slavic Languages Philology Translation Studies

Keywords: handwritten text recognition TRANSKRIBUS MACHINE LEARNING Cyrillic palaeography Glagolitic printings

Serbian Early Printed Books from Venice: Creating Models for Automatic Text Recognition Using Transkribus Scripta & e-Scripta vol. 22, 2022 floyd Wed, 08/17/2022 - 08:39 Vladimir Polomac

Владимир Р. Поломац. Сръбски старопечатни книги от Венеция: cъздаване на модели за автоматично текстово разпознаване чрез Transkribus

The paper describes the process of creating a model for the automatic rec- ognition of Serbian Church Slavonic printed books from Venice (from Božidar and Vincenzo Vuković’s printery) by using the Transkribus software platform, based on the principles of artificial intelligence and machine learning. By using the example of Prayer Book (Euchologion) (1538–1540) from Božidar Vuković’s printery, it has been shown that a successful model for the automatic recognition of individual books (with around 5% of unrecognized characters) can also be trained on the material consisting of approximately 4000 words, and that the increased amount of training material (in our case around 38000 words) leads to the improvement of the model and reduced error rate (between 1–2% of unrecognized characters). The most notable result of the paper is manifested through the creation of a generic model for the automatic text recognition of Serbian Church Slavonic books from Božidar and Vincenzo Vuković’s printery. The ini- tial version of the generic model (called Dionisio 1.0. by the Božidar Vuković’s Italian pseudonym – Dionisio della Vecchia) is the first resource for the automatic recognition of the Serbian medieval Cyrillic script, publicly available to all users of the Transkribus software platform (see https://readcoop.eu/model/dionisio-1-0/).

Subject: e-Scripta Digital humanities Keywords: TRANSKRIBUS Automatic Text Recognition Serbian Early Printed Books Ar- tificial Intelligence MACHINE LEARNING Venice

Scripta & e-Scripta vol. 21, 2021

Walker R. Thompson Using Handwritten Text Recognition (HTR) Tools to Transcribe Historical Multilingual Lexica

Използване на приложения за разпознаване на ръкописни текстове (HTR) при транскрибиране на многоезични исторически лексикони

Summary/Abstract

The paper discusses some results obtained as part of an ongoing project at the Slavic Institute of Heidelberg University to produce automatic transcriptions of an early 18th century trilingual printed dictionary (Fedor Polikarpov’s Leksikon trejazyčnyj) and, on a preliminary basis, of a 17th century trilingual manuscript (Epifanij Slavineckii’s working copy of his Greek–Slavic–Latin dictionary) using the handwritten text recognition (HTR) platforms Transkribus and eScriptorium. It is argued that there are considerable advantages to employing such tools in terms of the simplification and acceleration of work on multilingual edition projects. Moreover, a comparison of our experience working with Transkribus and eScriptorium is given, along with an overview of the practical benefits and challenges of working with each of these platforms.

Subject: Digital humanities

Keywords: TRANSKRIBUS eScriptorium handwritten text recognition HTR pre-modern Early modern multilingual lexica

Using Handwritten Text Recognition on bilingual Evenki-Russian manuscripts of Konstantin Rychkov Scripta & e-Scripta vol. 21, 2021 floyd Fri, 11/19/2021 - 15:10 Alexandre Arkhipov Anna Barinskaya Roman Shtefura

Използване на инструменти за разпознаване на ръкописни текстове (HTR) върху двуезични евенкско-руски ръкописи от колекцията на Константин Ричков

We report on applying Handwritten Text Recognition (HTR) to manuscripts from the archive of Konstantin Rychkov preserved at IOM RAS, St. Petersburg, within the INEL project. Folklore texts in Evenki (Tungusic) were collected in Western Siberia in 1910s. We used services provided by the Transkribus platform. The necessary step of Layout Analysis proved to be time-consuming due to the organization of the parallel Evenki- Russian text on the page without following a strict separation line. HTR models have been trained successively on different amounts of data up to 521 pages. The best Character Error Rate attained on validation data for the largest dataset is 4.50% for models trained on all characters. The distribution of errors is non-uniform: most errors are due to just a few problematic issues, especially diacritics such as the accent marking stress. It is written high above the line and frequently cut off from the line images at the preprocessing stage. After excluding the stress mark from training data and recognition, the lowest CER dropped to 2.90%. We compared two recognition engines, HTR+ and PyLaia. The HTR+ model trained without stress marks made less errors in letters, while PyLaia performed better with respect to diacritics.

Subject: Manuscript Digital humanities Keywords: TRANSKRIBUS PyLaia Russian Evenki bilingual manuscript

Scripta & e-Scripta vol. 19, 2019

Achim Rabus Recognizing Handwritten Text in Slavic Manuscripts: a Neural-Network Approach Using Transkribus

Summary/Abstract

The paper discusses the automatic text recognition capabilities of neural network models specifically trained to recognize different styles of Church Slavonic handwriting within the software platform Transkribus. Computed character error rates of the models are in the range of 3 to 5 percent; real-life performance shows that specifically trained models, by and large, recognize simple (non-superscript) characters correctly most of the time. The error rate is higher with superscript letters, abbreviations, and word separation. Combined models consisting of training data from different sources are capable of transcribing different styles of Slavic handwriting with low error rates. Automatic text recognition using Transkribus and the models presented in this paper can help improve the efficiency of the process of digitizing Church Slavonic manuscripts and thus boost the number of digitized sources available in the future.

Subject: Language and Literature Studies Language studies Studies of Literature Philology Theory of Literature Foreign languages learning Applied Linguistics Computational linguistics Translation Studies

Keywords: CHURCH SLAVONIC TRANSKRIBUS AUTOMATIC TRANSCRIPTION MACHINE LEARNING NEURAL NETWORKS ARTIFICIAL INTELLIGENCE

Subscribe to TRANSKRIBUS