Victor Baranov | Scripta & e-Scripta

Victor A. Baranov, Kalashnikov Izhevsk State Technical University, Doctor of Philological Sciences, Professor. Field of specialization: The history of Russian language, dialectology, phonetics of the modern Russian language, computer linguistics, full- text databases, publication of ancient Slavonic manuscripts, corpus linguistics, linguistic statistics, head of the project The historical corpus “Manuscript” (manuscripts.ru).

Baranov, Victor

victor.a.baranov@gmail.com

Baranov, Victor, Prof., DSc. Kalazhnikov Izhevsk State Technical University, Russia

Russia

Scripta & e-Scripta vol. 24, 2024

Victor Baranov Cтатистическая значимость компонентов лексических синонимических рядов в древнеболгарских письменных памятниках: поиск метода

Statistical significance of the components of lexical synonymous series in ancient Bulgarian written manuscripts: search for a method

Summary/Abstract

The results of statistical experiments to find the characteristics of words that are traditionally considered as the Ohrid-Moravian and Preslav components of synonymous series – иерѣи ‘priest’ – жьрьць ‘priest, cleric’ – свѧщеньникъ ‘priest, clergyman’, колѣ- но ‘knee’, ‘kindred’ – племѧ ‘tribe, genus’, коньчина ‘demise, end’ – коньць ‘end’, кънигы ‘books’ – писаниѥ ‘scripture’, любодѣица ‘adulteress, fornicator’ – блѫдьница ‘harlot’ are presented. The use of information about the relative number of words in a subcorpus, about significant deviations from the average values, and the calculation of statistical characteristics of lexemes in each of the subcorpora made it possible, in particular, to detect opposed and non-opposed components of synonymous series. The methods used to identify the statistical characteristics of words have shown that the degree of opposition of synonyms can be different – statistically significant or statistically insignificant. On this basis, it is concluded that it is necessary to move away from the unconditional attribution of the components of the synonymic series to the Ohrid-Moravian and Preslav vocabulary: the relations between the components of each synonymic series are individual and can range from statistically opposed in the texts of different schools to

Subject: e-Scripta

Keywords: Old Bulgarian writing Western Bulgarian Eastern Bulgarian lexical synonyms statistics text corpus

Scripta & e-Scripta vol. 23, 2023

Victor Baranov Roman M. Gnutikov Maria Novak Способы демонстрции данных славянского исторического полнотекстового корпуса “Манускрипт”

Data Demonstration Techniques in Slavonic Historical Text Corpus “Manuscript”

Summary/Abstract

The article discusses theoretical and practical issues of creating tools for demonstrating medieval Slavonic text corpus at the “Manuscript” website (http:// manuscripts.ru/). The specific features of the historical corpus and its sources are: the limited number of manuscripts, variability of medieval graphics and orthography, complex structure, and composition of original documents. They require special instruments and techniques for data preparation (information about a text and its physical media, analytical tagging of fragments, variability, and other), and visualization of data sampling, including texts. The article focuses on the ways of solving two opposite tasks: the texts’ demonstration in a form as close as possible to the original and their simplified form, and, consequently, the possibilities of their transformation. The first task should be solved by preparing a transcription via a specialized editing tool, which interacts with the full-text database with a complete set of required characters, text formatting, and make-up to fit the original page. To solve the second problem, analytical tagging (chapters and verses, authors of texts, structure of manuscript, main text and marginalia, and so forth) and linguistic tagging (including lemmatization) are performed to make data search and data transformation available when displayed. The latter allows users to see a text in modern Cyrillic or Latin, syllables, meaning of analytical fragments, links between the main text and its marginalia, and so forth. The ability to data search based on deep tagging and the digital edition (LIM, MS 37, 13th c., 291 f.) which has been included in the “Manuscript” historical corpus (http://manuscripts.ru/mns/main?P_TEXT=94065041&p_lang=EN).

Subject: Language studies Language and Literature Studies Theoretical Linguistics Applied Linguistics Historical Linguistics Computational linguistics Philology Translation Studies

Keywords: Medieval Slavonic manuscripts digital edition transcription analytical and linguistic tagging Apostolus Christinopolitanus

Foreword by the Guest Editors Scripta & e-Scripta vol. 21, 2021 floyd Sat, 11/20/2021 - 08:50 Achim Rabus Victor Baranov Alexandr Moldovan

Предговор от гост-редакторите

The publications in the e-Scripta section are selected, reviewed and revised papers delivered at El’Manuscript 2021 (Freiburg/online) in April, 2021 (www. elmanuscript2021.uni-freiburg.de). El’Manuscript is a series of biennial international conferences entitled “Textual Heritage and Information Technologies” that brings together linguists, specialists in historical source criticism, IT specialists, and others involved in publishing and studying our textual heritage. It is the official conference of the Special Commission on the Computer-Supported Processing of Mediæval Slavonic Manuscripts and Early Printed Books to the International Committee of Slavists. In the 2021 iteration, it coincided with the meeting of the Humboldt research group linkage program DigiPalSlav (Slavic Department at Freiburg University and Institute for the Russian Language, Russian Academy of Sciences, Moscow) devoted to developing and applying digital tools for pre-modern Orthodox Slavic such as neural taggers and Handwritten Text Recognition models. These issues are also reflected in the thematic focus of the 2021 iteration of El’Manuscript and, consequently, in the topics of the papers submitted and accepted for publication. We would like to thank our external reviewers for their thorough work and for meeting our tight deadlines. Furthermore, we thank Elena Renje for her valuable support. Many thanks are due to the Humboldt Foundation for financing the publication of this volume. Finally, we are grateful to the editor of Scripta & e-Scripta, Anissava Miltenova, for her tireless work and support.

Subject: Digital humanities

Scripta & e-Scripta vol. 21, 2021

Victor Baranov Roman M. Gnutikov Eliminating variation of linguistic units of the Slavonic historical corpus to facilitate search, demonstration and statistical analysis

Елиминиране на вариативността на лингвистичните единици в славянския исторически корпус с цел улесняване на търсенето, визуализирането и статистическия анализ

Summary/Abstract

The work demonstrates the methods and techniques of elimination of variation of linguistic units in the transcriptions of the medieval Slavonic manuscripts of the historical corpus “Manuscript” (manuscripts.ru). The textual corpus, the material of which is presented by the machine-readable copies which resemble the original most closely, provides the user with such tools of transformation (modification) of linguistic units which enable the creation of queries and getting of retrievals corresponding to the task to be solved. In case of an inexact search the user has the possiblity to delete titlos and diacritics, reduction of the versions of letters to their basic form, indication of the mask of the linguistic units being searched in the form of a regular expression, use of the letters of the contemporary Cyrillic alphabet. To ensure operations over lemmas by means of the statistic modules of the corpus, it is necessary to automatically assign a given textual form to exactly one lemma. Due to grammatical homonymy, incorrect lemmatization would result in a situation where quantitative data based on word forms and data based on lemmas do not match each other. In order to assign word forms to the correct lemma, we apply a rule-based approach, taking into account the formal and quantitative characteristics of the linguistic units (such as their morphological variation or invariation, their frequency in the sub-corpus, the matching or mismatching with the lemma form, the frequency of relationships between the textual forms and dictionary paradigms of variable words, the results of manual elimination of the homonymy). The reduction of textual forms to unified, normalized, transliterated or initial forms is a necessary procedure for extracting of data from the historical corpus for the distributive-statistical analysis of the semantics of linguistic units.

Subject: Digital humanities

Keywords: historical corpus search and demonstration of data LINGUISTIC STATISTICS

Scripta & e-Scripta vol. 19, 2019

Victor Baranov Создание и использование исторических корпусов славянских письменных памятников

Creation and Using of Historical Corpora of Slavonic Manuscripts

Summary/Abstract

The requirements for historical corpora of medieval texts 1) are determined by properties of the data and the historical-linguistic, textological and linguo-textological tasks to be solved; 2) and should be realized with the help of special tagging, processing procedures, query parameters and retrieval demonstrations. The corpus should a) have metadata concerning both texts and manuscripts, and involving both linguistic and analytical tagging; b) support the rendering of documents (facsimile and transcription), concordances, lists, and comparison of subcorpora data; c) simplify graphic-orthographic variation during data search and visualization; d) provide tools both for processing and searching linguistic material and its further analysis according to traditional methods; and e) support problem description and resolution by applying corpus methods that engage with the quantity, distribution, co-occurrence, and variation of linguistic units in big data arrays. The realization of these requirements is demonstrated on a subcorpus of three copies of chronicles (Laurentian, Hypatian, Radzivilovsky) from the historical corpus project “Manuscript” (manuscripts.ru).

Subject: Language and Literature Studies Language studies Studies of Literature Philology Theory of Literature Theoretical Linguistics Applied Linguistics

Keywords: HISTORICAL SLAVONIC CORPUS RUSSIAN CHRONICLES LINGUISTIC STATISTICS

Scripta & e-Scripta vol. 14-15, 2015

Victor Baranov Исторический корпус как цель и инструмент корпусной палеославистик

Diachronic OCS Corpus as an Object and an Instrument of Corpus Palaeoslavitic

Summary/Abstract

The author of the article describes the features of diachronic corpuses, created on the base of medieval Slavic codices. Their specificity in terms of compliance and transcription of the original objects is presented and the ratio of the markup standards and characteristics of Old Church Slavonic texts, specialized forms for searching and for displaying samples as well. The formation of a new applied section in medieval studies: corpus palaeoslavitic with computer tools is argumented and defined.

Subject: Language studies Literature Studies Library and Information Science Philology Information Architecture Electronic information storage and retrieval Theoretical Linguistics Historical Linguistics Comparative Linguistics Western Slavic Languages Eastern Slavic Languages South Slavic Languages

Scripta & e-Scripta vol. 8-9, 2010

Ralf Cleminson Victor Baranov Achim Rabus David Birnbaum Heinz Miklas Proposal for a unified encoding of Early Cyrillic glyphs in the Unicode Private Use Area

Summary/Abstract

The paper proposes an encoding standard for early Cyrillic characters and glyphs that are still missing in the Universal Character Set (UCS) of the Unicode Standard and for different reasons will probably never be included, but are nevertheless used by the paleoslavistic community. This micro-standard is meant to expand, not to replace the Unicode standard and follows the path chosen by the Medieval Unicode Font Initiative (MUFI) a few years ago for the Latin script (see http://www.hit.uib.no/mufi/). Starting from the inventory of Old Cyrillic originally proposed at the conference held in Belgrade on 15–17 October 2007 (see BP), and taking in view the recommendations given by Birnbaum et al. 2008 and the MUFI-consortium, the chosen set is limited to 178 units with a specific function (characters and composites, superscript characters, modifier characters, and punctuation marks), which are located in the Private Use Area (PUA). Their positions (code points) are coordinated with MUFI. This set we will call PUA1. In the future a second set PUA2 will be proposed for a number of ligatures and paleographic variants that may not be coordinated with MUFI and are intended for special publications addressed to Slavistic readers. It is hoped that the proposed PUA encoding for Early Cyrillic Symbols, for which we choose the abbreviation CYFI, will establish itself as a sort of micro-standardization. Designers of scholarly fonts are encouraged to include these symbols according to this proposal (see code points in the appendix).

Subject: Language and Literature Studies Early Cyrillic glyphs Unicode 5.1 Private Use Area

Scripta & e-Scripta vol 6, 2008

Victor Baranov Полнотекстовые базы данных как основа для электронных изданий средневековых рукописей в Интернете: требования, реализация, перспективы

Full-text Data Bases as Foundation of Electronic Publications of Mediaeval Manuscripts in Internet: Requirements, Realization, Perspectives

Summary/Abstract

Статья посвящена вопросам хранения, обработки в базах данных и публикации в Интернете транскрипций древних славянских письменных памятников. Основное внимание уделено требованиям, которые должны предъявляться к информационно-аналитическим системам исследовательской направленности, содержащим сведения как о самих рукописях и текстах, так и о их текстологических и лингвистических единицах. Помимо известных пользовательских компонентов – средств (1) навигации, (2) создания запросов, (3) упорядочения и визуализации выборок, – подобные системы должны иметь и необходимые компоненты, позволяющие создавать полнотекстовые электронные коллекции и библиотеки, – (1) модули ввода и редактирования данных (текстов и их единиц) и информации о них, (2) средства установления связей между единицами текстов, рукописей и их частей, (3) справочники, словари и авторитетные файлы для мета-, аналитического и лингвистического описания и разбора, (4) средства автоматизированной трансформации единиц. Примером многофункциональной системы, удовлетворяющей указанным требованиям, является информационно-аналитическая система (ИАС) «Манускрипт», создаваемая с 2003 года в Удмуртском и Ижевском техническом государственных университетах (руководитель работы – Виктор А. Баранов, URL портала проекта – http://manuscripts.ru/). В статье представлены функциональные возможности основных модулей системы – (1) специализированного редактора OldEd, (2) модуля грамматических словарей, (3) web-модуля поиска и лемматизации текстовых прецедентов, (4) web-формы поиска на основе мета- и аналитической информации, (5) web-модуля запросов и представления материалов коллекций.

Subject: Language and Literature Studies Data bases Mediaeval manuscripts Electronic publications

Subscribe to Victor Baranov