Computational linguistics | Scripta & e-Scripta

Scripta & e-Scripta vol. 23, 2023

Achim Rabus Walker R. Thompson Ефективност на генерични модели HTR за историческа кирилица и глаголица: Сравнение на средства

Performance of Generic HTR Models on Historical Cyrillic and Glagolitic: Comparison of Engines

Summary/Abstract

The present study offers a comparative evaluation of the performance of different AI-based digital tools for handwritten text recognition (HTR) on historical manuscripts and prints. The focus is on generic models capable of transcribing a range of texts in a similar script. The training dataset for these comprises Old Cyrillic ustav and poluustav manuscripts, on the one hand, and early Glagolitic printed books, on the other. We give an overview of the performance statistics for the HTR platforms Transkribus and eScriptorium as well as for the command-line tool Calamari. In each case, we additionally offer a close, qualitative analysis of select examples in order to convey a sense of the models’ real-world performance. In this way, our study supplies comparative data on the respective capabilities of these technologies that ought to be of interest to scholars working with them in digital humanities projects.

Subject: Language studies Language and Literature Studies Theoretical Linguistics Applied Linguistics Historical Linguistics Computational linguistics South Slavic Languages Philology Translation Studies

Keywords: handwritten text recognition TRANSKRIBUS MACHINE LEARNING Cyrillic palaeography Glagolitic printings

Scripta & e-Scripta vol. 23, 2023

Victor Baranov Roman M. Gnutikov Maria Novak Способы демонстрции данных славянского исторического полнотекстового корпуса “Манускрипт”

Data Demonstration Techniques in Slavonic Historical Text Corpus “Manuscript”

Summary/Abstract

The article discusses theoretical and practical issues of creating tools for demonstrating medieval Slavonic text corpus at the “Manuscript” website (http:// manuscripts.ru/). The specific features of the historical corpus and its sources are: the limited number of manuscripts, variability of medieval graphics and orthography, complex structure, and composition of original documents. They require special instruments and techniques for data preparation (information about a text and its physical media, analytical tagging of fragments, variability, and other), and visualization of data sampling, including texts. The article focuses on the ways of solving two opposite tasks: the texts’ demonstration in a form as close as possible to the original and their simplified form, and, consequently, the possibilities of their transformation. The first task should be solved by preparing a transcription via a specialized editing tool, which interacts with the full-text database with a complete set of required characters, text formatting, and make-up to fit the original page. To solve the second problem, analytical tagging (chapters and verses, authors of texts, structure of manuscript, main text and marginalia, and so forth) and linguistic tagging (including lemmatization) are performed to make data search and data transformation available when displayed. The latter allows users to see a text in modern Cyrillic or Latin, syllables, meaning of analytical fragments, links between the main text and its marginalia, and so forth. The ability to data search based on deep tagging and the digital edition (LIM, MS 37, 13th c., 291 f.) which has been included in the “Manuscript” historical corpus (http://manuscripts.ru/mns/main?P_TEXT=94065041&p_lang=EN).

Subject: Language studies Language and Literature Studies Theoretical Linguistics Applied Linguistics Historical Linguistics Computational linguistics Philology Translation Studies

Keywords: Medieval Slavonic manuscripts digital edition transcription analytical and linguistic tagging Apostolus Christinopolitanus

Scripta & e-Scripta vol. 19, 2019

Achim Rabus Recognizing Handwritten Text in Slavic Manuscripts: a Neural-Network Approach Using Transkribus

Summary/Abstract

The paper discusses the automatic text recognition capabilities of neural network models specifically trained to recognize different styles of Church Slavonic handwriting within the software platform Transkribus. Computed character error rates of the models are in the range of 3 to 5 percent; real-life performance shows that specifically trained models, by and large, recognize simple (non-superscript) characters correctly most of the time. The error rate is higher with superscript letters, abbreviations, and word separation. Combined models consisting of training data from different sources are capable of transcribing different styles of Slavic handwriting with low error rates. Automatic text recognition using Transkribus and the models presented in this paper can help improve the efficiency of the process of digitizing Church Slavonic manuscripts and thus boost the number of digitized sources available in the future.

Subject: Language and Literature Studies Language studies Studies of Literature Philology Theory of Literature Foreign languages learning Applied Linguistics Computational linguistics Translation Studies

Keywords: CHURCH SLAVONIC TRANSKRIBUS AUTOMATIC TRANSCRIPTION MACHINE LEARNING NEURAL NETWORKS ARTIFICIAL INTELLIGENCE

Scripta & e-Scripta vol. 18, 2018

Tsvetana Dimitrova Andrej Boyadzhiev Electronic Edition and Linguistic Annotation of Slavic Fragments

Summary/Abstract

The paper introduces a project on edition and linguistic annotation of Medieval and Early Modern South Slavic manuscript fragments. The main topic is implementation of various approaches on integration of electronic edtion, manuscript description and linguistic annotation. A corpus will include fragments from parchment manuscripts kept in Bulgarian repositories. We will illustrate the approach with several pieces of texts from various fragments. The representation will be supplied with textual, as well as part-of-speech and basic syntactic annotation. On the basis of it an attempt will be made at experimental anaphora and related morpho-syntactic annotation. The work will offer a discussion on the features that will be useful for such annotation. The project relies on eXist database (http://exist-db.org) and the initiatives: Repertorium (http://repertorium.obdurodon.org/), PROIEL (http://www.hf.uio.no/ifikk/english/ research/ projects/proiel/) and TOROT (http://site.uit.no/slavhistcorp/files/2015/04/Eckhoff.pdf).

Subject: Language studies Language and Literature Studies Theoretical Linguistics Applied Linguistics Studies of Literature Computational linguistics South Slavic Languages Philology South Slavic manuscripts Fragments Linguistic annotation Linguistic corpora Electronic text edition Electronic description XML technologies

Scripta & e-Scripta vol. 18, 2018

Anissava Miltenova Терминология в палеославистике и создание сети между существующими цифровыми корпусами

Terminology in Palaeoslavistics and Set up Networking between Existing Digital Corpora

Summary/Abstract

The paper discusses problems and points of view related to set up networking between Scripta Bulgarica project (http://www.scripta-bulgarica.eu/bg), Repertorium of Old Bulgarian literature and letters (http://repertorium.obdurodon.org/), and also other corpuses (e.g. Codex Suprasliensis from the 10th century: http://suprasliensis.obdurodon.org/, etc.) for further improvement of linking between data bases. The proposed networking will connect transcribed texts with terminology in palaeoslavistics, and other on-line resources, such as electronic editions of individual sites, electronic dictionaries, encyclopedias, bibliographic arrays and so on. The networking will decided a number of problems that can not yet solve in a satisfactory way. The results will be useful not only for the palaeoslavists but also for librarians, teachers, and students, representatives of mass media and the general public interested in Slavic literacy.

Subject: Ontology of terms Palaeoslavistic Computer technologies Standardization of formats XML approach Language studies Language and Literature Studies Theoretical Linguistics Applied Linguistics Studies of Literature Computational linguistics Bulgarian Literature South Slavic Languages Philology

Writing Old Cyrillic and Glagolitic in GNU/Linux with the Bulgarian Phonetic Traditional Keyboard Layout Scripta & e-Scripta vol. 14-15, 2015 floyd Sat, 07/11/2015 - 08:15 Andrej Boyadzhiev

The paper proposes several approaches for extending the possibility to write Medieval Slavonic Cyrillic and Glagolitic letters in GNU/Linux environment. This is achived by extension of existing keyboard layout, inclusion of newly defined Glagolitic one and by adding more combinations of keys through the multi key (compose key) technique. The proposal is tested and works in openSUSE GNU/LINUX distributions versions 11.3 through 13.2, the rolling release version Tumbleweed with KDE4, Plasma 5 and GNOME desktop environments.

Subject: Language and Literature Studies Library and Information Science Electronic information storage and retrieval Applied Linguistics Philology Computational linguistics Other Cataloguing Archiving

Subscribe to Computational linguistics