Ilia Afanasev
Computer-assisted Study of Historical Lemkian (Transcarpathian) Lects: Basic Vocabulary Approach
Компютърно подпомагано изследване на исторически лемкийски (закарпатски) диалекти: подход към основния речник
-
Summary/Abstract
This research presents the first step in digitising texts of historical Lemkian (Transcarpathian) dialects, recorded in 1930s, and transforming them into an open- access dataset. The developed dataset includes morphological tagging, lemmatisation, and data on the named entities and basic vocabulary items. This allows for the evaluation of pre- existing models for automatic tagging of basic vocabulary in Slavic on the new material quantitatively (checking their efficiency), qualitatively (going example-by-example), and formally (by analysing the research design of previous studies). The present pilot study shows that existing models are not able to efficiently detect enough Automatic Similarity Judgement Program (ASJP) basic vocabulary list items in the Lemkian texts (F1-score less than 0.5), finding only the words that formally completely coincide with their cognates in other Slavic languages (personal pronouns). The bar chart-based visualisation shows that the previously hypothesized formalisation of basic vocabulary items as similar in distribution to the named entities is incorrect, and a new formalisation is required. The main contribution of the work is an open-access dataset of historical Lemkian dialects.
Subject: e-Scripta