Victor Baranov

Victor A. Baranov, Kalashnikov Izhevsk State Technical University, Doctor of Philological Sciences, Professor. Field of specialization: The history of Russian language, dialectology, phonetics of the modern Russian language, computer linguistics, full- text databases, publication of ancient Slavonic manuscripts, corpus linguistics, linguistic statistics, head of the project The historical corpus “Manuscript” (manuscripts.ru).

Baranov, Victor, Prof., DSc. Kalazhnikov Izhevsk State Technical University, Russia
Russia

Способы демонстрции данных славянского исторического полнотекстового корпуса “Манускрипт”

Data Demonstration Techniques in Slavonic Historical Text Corpus “Manuscript”

  • Summary/Abstract

    The article discusses theoretical and practical issues of creating tools for demonstrating medieval Slavonic text corpus at the “Manuscript” website (http:// manuscripts.ru/). The specific features of the historical corpus and its sources are: the limited number of manuscripts, variability of medieval graphics and orthography, complex structure, and composition of original documents. They require special instruments and techniques for data preparation (information about a text and its physical media, analytical tagging of fragments, variability, and other), and visualization of data sampling, including texts. The article focuses on the ways of solving two opposite tasks: the texts’ demonstration in a form as close as possible to the original and their simplified form, and, consequently, the possibilities of their transformation. The first task should be solved by preparing a transcription via a specialized editing tool, which interacts with the full-text database with a complete set of required characters, text formatting, and make-up to fit the original page. To solve the second problem, analytical tagging (chapters and verses, authors of texts, structure of manuscript, main text and marginalia, and so forth) and linguistic tagging (including lemmatization) are performed to make data search and data transformation available when displayed. The latter allows users to see a text in modern Cyrillic or Latin, syllables, meaning of analytical fragments, links between the main text and its marginalia, and so forth. The ability to data search based on deep tagging and the digital edition (LIM, MS 37, 13th c., 291 f.) which has been included in the “Manuscript” historical corpus (http://manuscripts.ru/mns/main?P_TEXT=94065041&p_lang=EN).


Foreword by the Guest Editors

Предговор от гост-редакторите

  • Summary/Abstract

    The publications in the e-Scripta section are selected, reviewed and revised papers delivered at El’Manuscript 2021 (Freiburg/online) in April, 2021 (www. elmanuscript2021.uni-freiburg.de). El’Manuscript is a series of biennial international conferences entitled “Textual Heritage and Information Technologies” that brings together linguists, specialists in historical source criticism, IT specialists, and others involved in publishing and studying our textual heritage. It is the official conference of the Special Commission on the Computer-Supported Processing of Mediæval Slavonic Manuscripts and Early Printed Books to the International Committee of Slavists. In the 2021 iteration, it coincided with the meeting of the Humboldt research group linkage program DigiPalSlav (Slavic Department at Freiburg University and Institute for the Russian Language, Russian Academy of Sciences, Moscow) devoted to developing and applying digital tools for pre-modern Orthodox Slavic such as neural taggers and Handwritten Text Recognition models. These issues are also reflected in the thematic focus of the 2021 iteration of El’Manuscript and, consequently, in the topics of the papers submitted and accepted for publication. We would like to thank our external reviewers for their thorough work and for meeting our tight deadlines. Furthermore, we thank Elena Renje for her valuable support. Many thanks are due to the Humboldt Foundation for financing the publication of this volume. Finally, we are grateful to the editor of Scripta & e-Scripta, Anissava Miltenova, for her tireless work and support.


Eliminating variation of linguistic units of the Slavonic historical corpus to facilitate search, demonstration and statistical analysis Scripta & e-Scripta vol. 21, 2021 floyd Sat, 11/20/2021 - 07:34
Елиминиране на вариативността на лингвистичните единици в славянския исторически корпус с цел улесняване на търсенето, визуализирането и статистическия анализ

The work demonstrates the methods and techniques of elimination of variation of linguistic units in the transcriptions of the medieval Slavonic manuscripts of the historical corpus “Manuscript” (manuscripts.ru). The textual corpus, the material of which is presented by the machine-readable copies which resemble the original most closely, provides the user with such tools of transformation (modification) of linguistic units which enable the creation of queries and getting of retrievals corresponding to the task to be solved. In case of an inexact search the user has the possiblity to delete titlos and diacritics, reduction of the versions of letters to their basic form, indication of the mask of the linguistic units being searched in the form of a regular expression, use of the letters of the contemporary Cyrillic alphabet. To ensure operations over lemmas by means of the statistic modules of the corpus, it is necessary to automatically assign a given textual form to exactly one lemma. Due to grammatical homonymy, incorrect lemmatization would result in a situation where quantitative data based on word forms and data based on lemmas do not match each other. In order to assign word forms to the correct lemma, we apply a rule-based approach, taking into account the formal and quantitative characteristics of the linguistic units (such as their morphological variation or invariation, their frequency in the sub-corpus, the matching or mismatching with the lemma form, the frequency of relationships between the textual forms and dictionary paradigms of variable words, the results of manual elimination of the homonymy). The reduction of textual forms to unified, normalized, transliterated or initial forms is a necessary procedure for extracting of data from the historical corpus for the distributive-statistical analysis of the semantics of linguistic units.

Subject: Digital humanities Keywords: historical corpus search and demonstration of data LINGUISTIC STATISTICS

Создание и использование исторических корпусов славянских письменных памятников

Creation and Using of Historical Corpora of Slavonic Manuscripts

  • Summary/Abstract
    The requirements for historical corpora of medieval texts 1) are determined by properties of the data and the historical-linguistic, textological and linguo-textological tasks to be solved; 2) and should be realized with the help of special tagging, processing procedures, query parameters and retrieval demonstrations. The corpus should a) have metadata concerning both texts and manuscripts, and involving both linguistic and analytical tagging; b) support the rendering of documents (facsimile and transcription), concordances, lists, and comparison of subcorpora data; c) simplify graphic-orthographic variation during data search and visualization; d) provide tools both for processing and searching linguistic material and its further analysis according to traditional methods; and e) support problem description and resolution by applying corpus methods that engage with the quantity, distribution, co-occurrence, and variation of linguistic units in big data arrays. The realization of these requirements is demonstrated on a subcorpus of three copies of chronicles (Laurentian, Hypatian, Radzivilovsky) from the historical corpus project “Manuscript” (manuscripts.ru).

Исторический корпус как цель и инструмент корпусной палеославистик

Diachronic OCS Corpus as an Object and an Instrument of Corpus Palaeoslavitic


Proposal for a unified encoding of Early Cyrillic glyphs in the Unicode Private Use Area

  • Summary/Abstract

    The paper proposes an encoding standard for early Cyrillic characters and glyphs that are still missing in the Universal Character Set (UCS) of the Unicode Standard and for different reasons will probably never be included, but are nevertheless used by the paleoslavistic community. This micro-standard is meant to expand, not to replace the Unicode standard and follows the path chosen by the Medieval Unicode Font Initiative (MUFI) a few years ago for the Latin script (see http://www.hit.uib.no/mufi/). Starting from the inventory of Old Cyrillic originally proposed at the conference held in Belgrade on 15–17 October 2007 (see BP), and taking in view the recommendations given by Birnbaum et al. 2008 and the MUFI-consortium, the chosen set is limited to 178 units with a specific function (characters and composites, superscript characters, modifier characters, and punctuation marks), which are located in the Private Use Area (PUA). Their positions (code points) are coordinated with MUFI. This set we will call PUA1. In the future a second set PUA2 will be proposed for a number of ligatures and paleographic variants that may not be coordinated with MUFI and are intended for special publications addressed to Slavistic readers. It is hoped that the proposed PUA encoding for Early Cyrillic Symbols, for which we choose the abbreviation CYFI, will establish itself as a sort of micro-standardization. Designers of scholarly fonts are encouraged to include these symbols according to this proposal (see code points in the appendix).


Полнотекстовые базы данных как основа для электронных изданий средневековых рукописей в Интернете: требования, реализация, перспективы

Full-text Data Bases as Foundation of Electronic Publications of Mediaeval Manuscripts in Internet: Requirements, Realization, Perspectives

  • Summary/Abstract

    Статья посвящена вопросам хранения, обработки в базах данных и публикации в Интернете транскрипций древних славянских письменных памятников. Основное внимание уделено требованиям, которые должны предъявляться к информационно-аналитическим системам исследовательской направленности, содержащим сведения как о самих рукописях и текстах, так и о их текстологических и лингвистических единицах. Помимо известных пользовательских компонентов – средств (1) навигации, (2) создания запросов, (3) упорядочения и визуализации выборок, – подобные системы должны иметь и необходимые компоненты, позволяющие создавать полнотекстовые электронные коллекции и библиотеки, – (1) модули ввода и редактирования данных (текстов и их единиц) и информации о них, (2) средства установления связей между единицами текстов, рукописей и их частей, (3) справочники, словари и авторитетные файлы для мета-, аналитического и лингвистического описания и разбора, (4) средства автоматизированной трансформации единиц. Примером многофункциональной системы, удовлетворяющей указанным требованиям, является информационно-аналитическая система (ИАС) «Манускрипт», создаваемая с 2003 года в Удмуртском и Ижевском техническом государственных университетах (руководитель работы – Виктор А. Баранов, URL портала проекта – http://manuscripts.ru/). В статье представлены функциональные возможности основных модулей системы – (1) специализированного редактора OldEd, (2) модуля грамматических словарей, (3) web-модуля поиска и лемматизации текстовых прецедентов, (4) web-формы поиска на основе мета- и аналитической информации, (5) web-модуля запросов и представления материалов коллекций.


Subscribe to Victor Baranov