Deep Mining of the Collection of Old Prints ‘Kirchenslavica digital’

Цифровизиране с извличане на семантични данни на сбирката от старопечатни книги Kirchenslavica digital

Author(s): Vladimir Neumann
Subject(s): Digital humanities //
Published by: Institute for Literature BAS
Print ISSN: 1312-238X
Summary/Abstract:
The article deals with various efforts of the Staatsbibliothek zu Berlin (SBB) to make its collection of about 250 Church-Slavic prints from the 17th to the 19th century accessible in terms of content using the methods of modern information technology from the Digital Humanities sector. The focus is on full-text indexing of the heterogeneous Church Slavonic prints using HTR+ language models from the programme Transkribus. Depending on whether they are Moscow, Kiev or Old Believer prints, these models require different approaches and corresponding adaptations that take into account the printing area and printing period. Prints such as Kirillova kniga (1644) or Gistorija Ioanna Damaskina (1637) and many others are processed at large scale, whereby the developed character recognition models are constantly refined by training new verified data. The full texts generated in this way are permanently stored in various XML formats (ALTO, PAGE) on the one hand in a central repository for subsequent use, and on the other hand they are merged with original digital copies in the IIIF-compatible Digital Library of the SBB. As a further element, the Church Slavonic full texts will be indexed using special SOLR analyzers for efficient searches (Tokinising, Translit, N-Grams) and made searchable in subject portals (including the Slavistik-Portal) using modern text-image web design.

Journal: Scripta & e-Scripta vol. 21, 2021

Page Range: 207-216
No. of Pages: 10
Language: English

Year: 2021
Issue No:: Scripta & e-Scripta vol. 21, 2021

Submitted on: 19 November 2021
LINK CEEOL:
Vladimir Neumann

Germany

Vladimir.neumann@sbb.spk-berlin.de

Staatsbibliothek zu Berlin

Description

Vladimir Neumann studied Slavic Studies in Bonn and received his doctorate in Berlin. He currently works as a subject specialist for Slavistics at the Berlin State Library, where he has been involved for over 20 years not only in the development of the East European collection, but also in the continuous expansion of Slavistik-Portal, which serves as a central hub for scholarly information in the field. In recent years, his work has focused on the conversion of all German-language and international Slavic bibliographies into database format, the digitization and OCR-based indexing of the library’s collection of Church Slavonic prints (Kirchenslavica Digital, 17th to 19th century), and the processing and transformation of historical multilingual Slavic dictionaries (MultiSlavDict). The latter project currently comprises around a dozen sources with more than 300,000 lemmas and approximately five million word forms. Vladimir Neumann also regularly offers online training courses at the Berlin State Library on topics such as Natural Language Processing, text corpora in Slavistics, and OCR techniques for Church Slavonic sources.
SUBJECT: Digital humanities //

KEYWORDS: CHURCH SLAVONIC // Old prints // Transkribus; automatic transcription // model training // data processing and retrieval //

Deep Mining of the Collection of Old Prints ‘Kirchenslavica digital’

Цифровизиране с извличане на семантични данни на сбирката от старопечатни книги Kirchenslavica digital

Journal: Scripta & e-Scripta vol. 21, 2021

Vladimir Neumann