Deep Mining of the Collection of Old Prints ‘Kirchenslavica digital’Scripta & e-Scripta vol. 21, 2021floydFri, 11/19/2021 - 16:15Vladimir Neumann
Цифровизиране с извличане на семантични данни на сбирката от старопечатни книги Kirchenslavica digital
The article deals with various efforts of the Staatsbibliothek zu Berlin (SBB) to
make its collection of about 250 Church-Slavic prints from the 17th to the 19th century
accessible in terms of content using the methods of modern information technology from
the Digital Humanities sector. The focus is on full-text indexing of the heterogeneous
Church Slavonic prints using HTR+ language models from the programme Transkribus.
Depending on whether they are Moscow, Kiev or Old Believer prints, these models require
different approaches and corresponding adaptations that take into account the printing area
and printing period. Prints such as Kirillova kniga (1644) or Gistorija Ioanna Damaskina
(1637) and many others are processed at large scale, whereby the developed character
recognition models are constantly refined by training new verified data. The full texts
generated in this way are permanently stored in various XML formats (ALTO, PAGE)
on the one hand in a central repository for subsequent use, and on the other hand they are
merged with original digital copies in the IIIF-compatible Digital Library of the SBB.
As a further element, the Church Slavonic full texts will be indexed using special SOLR
analyzers for efficient searches (Tokinising, Translit, N-Grams) and made searchable in
subject portals (including the Slavistik-Portal) using modern text-image web design.
Subject:Digital humanitiesKeywords:CHURCH SLAVONICOld printsTranskribus; automatic transcriptionmodel trainingdata processing and retrieval