Eliminating variation of linguistic units of the Slavonic historical corpus to facilitate search, demonstration and statistical analysis

Елиминиране на вариативността на лингвистичните единици в славянския исторически корпус с цел улесняване на търсенето, визуализирането и статистическия анализ

Author(s): Victor Baranov Roman M. Gnutikov
Subject(s): Digital humanities //
Published by: Institute for Literature BAS
Print ISSN: 1312-238X
Summary/Abstract:

The work demonstrates the methods and techniques of elimination of variation of linguistic units in the transcriptions of the medieval Slavonic manuscripts of the historical corpus “Manuscript” (manuscripts.ru). The textual corpus, the material of which is presented by the machine-readable copies which resemble the original most closely, provides the user with such tools of transformation (modification) of linguistic units which enable the creation of queries and getting of retrievals corresponding to the task to be solved. In case of an inexact search the user has the possiblity to delete titlos and diacritics, reduction of the versions of letters to their basic form, indication of the mask of the linguistic units being searched in the form of a regular expression, use of the letters of the contemporary Cyrillic alphabet. To ensure operations over lemmas by means of the statistic modules of the corpus, it is necessary to automatically assign a given textual form to exactly one lemma. Due to grammatical homonymy, incorrect lemmatization would result in a situation where quantitative data based on word forms and data based on lemmas do not match each other. In order to assign word forms to the correct lemma, we apply a rule-based approach, taking into account the formal and quantitative characteristics of the linguistic units (such as their morphological variation or invariation, their frequency in the sub-corpus, the matching or mismatching with the lemma form, the frequency of relationships between the textual forms and dictionary paradigms of variable words, the results of manual elimination of the homonymy). The reduction of textual forms to unified, normalized, transliterated or initial forms is a necessary procedure for extracting of data from the historical corpus for the distributive-statistical analysis of the semantics of linguistic units.

Journal: Scripta & e-Scripta vol. 21, 2021

Page Range: 107-121
No. of Pages: 15
Language: English

Year: 2021
Issue No:: Scripta & e-Scripta vol. 21, 2021

Submitted on: 20 November 2021
LINK CEEOL:
Victor Baranov

Russia

victor.a.baranov@gmail.com

Baranov, Victor, Prof., DSc. Kalazhnikov Izhevsk State Technical University, Russia

Description

Victor A. Baranov, Kalashnikov Izhevsk State Technical University, Doctor of Philological Sciences, Professor. Field of specialization: The history of Russian language, dialectology, phonetics of the modern Russian language, computer linguistics, full- text databases, publication of ancient Slavonic manuscripts, corpus linguistics, linguistic statistics, head of the project The historical corpus “Manuscript” (manuscripts.ru).

Roman M. Gnutikov

romaashka@gmail.com

Udmurt State University

Description

Roman M. Gnutikov, Udmurt State University. Programmer, main developer of the components, modules and programs of The historical corpus “Manuscript”.
SUBJECT: Digital humanities //

KEYWORDS: historical corpus // search and demonstration of data // LINGUISTIC STATISTICS //

Eliminating variation of linguistic units of the Slavonic historical corpus to facilitate search, demonstration and statistical analysis

Елиминиране на вариативността на лингвистичните единици в славянския исторически корпус с цел улесняване на търсенето, визуализирането и статистическия анализ

Journal: Scripta & e-Scripta vol. 21, 2021

Victor Baranov

Roman M. Gnutikov