Victor Baranov
Roman M. Gnutikov
Eliminating variation of linguistic units of the Slavonic historical corpus to facilitate search, demonstration and statistical analysis
Елиминиране на вариативността на лингвистичните единици в славянския исторически корпус с цел улесняване на търсенето, визуализирането и статистическия анализ
-
Summary/Abstract
The work demonstrates the methods and techniques of elimination of variation of linguistic units in the transcriptions of the medieval Slavonic manuscripts of the historical corpus “Manuscript” (manuscripts.ru). The textual corpus, the material of which is presented by the machine-readable copies which resemble the original most closely, provides the user with such tools of transformation (modification) of linguistic units which enable the creation of queries and getting of retrievals corresponding to the task to be solved. In case of an inexact search the user has the possiblity to delete titlos and diacritics, reduction of the versions of letters to their basic form, indication of the mask of the linguistic units being searched in the form of a regular expression, use of the letters of the contemporary Cyrillic alphabet. To ensure operations over lemmas by means of the statistic modules of the corpus, it is necessary to automatically assign a given textual form to exactly one lemma. Due to grammatical homonymy, incorrect lemmatization would result in a situation where quantitative data based on word forms and data based on lemmas do not match each other. In order to assign word forms to the correct lemma, we apply a rule-based approach, taking into account the formal and quantitative characteristics of the linguistic units (such as their morphological variation or invariation, their frequency in the sub-corpus, the matching or mismatching with the lemma form, the frequency of relationships between the textual forms and dictionary paradigms of variable words, the results of manual elimination of the homonymy). The reduction of textual forms to unified, normalized, transliterated or initial forms is a necessary procedure for extracting of data from the historical corpus for the distributive-statistical analysis of the semantics of linguistic units.
Subject: Digital humanities