The task of developing effective text information classification systems requires the thoughtful analysis and synthesis of variable components of technology. These components strongly affect the practical efficiency and the requirements to the data. For this purpose, a typical technology was discussed, comparing the regular “learning from features” approach versus the more advanced “deep learning” approach, that studies from data. In order to implement the technology, the first approach was tested, which included the means (methods, algorithms) for analysis of the features of the source text, by applying the dimensionality transformation, and building model solutions that allow the correct classification of data by a set of features. As a result, all the steps of the technology are described, which allowed to determine the way of presenting data in terms of hidden features in data, their presentation in a standard visual form and evaluate the solution, as well as its practical efficiency, based on this set of features. In a depth study, the informational core of the document was studied, using the regression and T-stochastic grouping of features for dimensionality reduction.The separate results contain estimation of practical efficiency of the algorithms in terms of time and relative performance for each step of the proposed technology. This estimation gives a possibility to obtain the best algorithm of intelligent data processing that is useful for a given dataset and application. In order to estimate the best suited algorithm for separation in reduced dimension an experiment was carried out which allowed the selection of the best range of data classification algorithms, in particular boosting methods. As a result of the analysis of the technology, the necessary steps of this technology were discussed and the classification on real text data was conducted, which allowed to identify the most important stages of the technology for text classification.
Read full abstract