This paper examines the problem of solving software engineering tasks in developing information systems for natural language processing. Generating corpora of text data is highlighted as a specific task of this problem. An analysis of the basic CorDeGen method was carried out, which is one of the corpus generation methods specially developed for this specific problem. This study shows that this method has a limited scope due to the use of “artificial” terms to fill the texts.The paper proposes a new modified DBCorDeGen method that solves this shortcoming thanks to the use of an additional dictionary of terms that is supplied to the input of the method. The DBCorDeGen method preserves most of the characteristic features of the basic method, which are important for its use in solving software engineering tasks: determinism, speed of operation (including the possibility of combining with parallel modification), the possibility of a priori description of the structure and properties of the generated corpus. The only disadvantage compared to the basic method is the increase in the number of input parameters, however, compared to other methods of generating corpora presented in the literature, it is relatively small, and due to it, the scope of application of corpora generated by this method significantly increases.As an experimental test of the proposed modified DBCorDeGen method, the task of sentiment analysis of the texts of the generated corpus is considered. The study shows that when using the basic CorDeGen method, it is impossible to obtain sentiment analysis results different from neutral polarity for all texts. When using the proposed method, it is possible to obtain different results using different dictionaries. Thus, it is confirmed that the proposed DBCorDeGen method has a larger scope than the basic method.
Read full abstract