Abstract

Today much attention is paid to processing textual information in order to form thematic groups and to systematize documents. This is stipulated by growing popularity of the Internet as a means of communication and requires to categorize short technical texts, which, in turn, is characterized by complexity of traditional approaches - preprocessing and digitization of documents and identification of "classifying" features. Specificity of the study at each stage is determined by the characteristics of the texts - small size, similar vocabulary, a large number of highly specialized symbols and signs, synonymity of terms.There has been suggested the procedure of preparing texts for analysis, reducing the dimensions of "term-document" matrix using singular decomposition method which allows to solve the problem of small-rank approximation of the original matrix. There are classification methods used such as k-nearest neighbors method and discriminant analysis based on Fisher elementary functions (texts on assignment of instruments was taken as an example). The Fisher classification procedure uses discriminant variables and the approach of maximizing the differences between classes to obtain the classification function. An object belongs to the class for which the value of classifying function is the greatest. There has been given assessment of the results obtained and the inadequate accuracy of classification when applying TF-IDF measure under experimental conditions. To improve the quality of classification, a combined method has been proposed to select words at the first step using TF-IDF measure. The dictionary of terms and phrases is to be used at the second stage for classifying texts. According to the obtained data, it has been offered to carry out classification by discriminant analysis and k-closest neighbors method. The proposed combined method is planned to be refined and upgraded in the future.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call