The article presents basic principles of designing the diachronic linguistic corpus of documents of the Don Cossack Host offices from the State Archive of the Volgograd region, Russia, including collecting documents for the text corpus, arranging the technical base of automatic processing and text editing, scheduling automated tagging, morphological annotation, and corpus software tools. The authors explain some technical aspects of corpus processing and text corpus constituency. It is considered reasonable to add any document to the corpus, the draft texts with the crossed-out fragments included, as it ensures accurate registration of grammar and vocabulary of the language at a certain historical period. A set of language marker types is worked over for automated meta-tagging. The corpus software tools are defined to enable accurate annotation of obsolete fonts so that they can be processed in a pair with regular language units and expressions in morphological and genre meta-tagging; in cases of partial text adaptation, the authentic old graphic symbols may have to be preserved.
Subject: Digital humanities Keywords: diachronic linguistic corpus administrative documents Don Cossack Host meta-tagging morphological tags.Copyright © 2024. All rights reserved.