Abstract

In this paper, a new approach has been proposed for multi-label text data class verification and adjustment. The approach helps to make semi-automated revisions of class assignments to improve the quality of the data. The data quality significantly influences the accuracy of the created models, for example, in classification tasks. It can also be useful for other data analysis tasks. The proposed approach is based on the combination of the usage of the text similarity measure and two methods: latent semantic analysis and self-organizing map. First, the text data must be pre-processed by selecting various filters to clean the data from unnecessary and irrelevant information. Latent semantic analysis has been selected to reduce the vectors dimensionality of the obtained vectors that correspond to each text from the analysed data. The cosine similarity distance has been used to determine which of the multi-label text data class should be changed or adjusted. The self-organizing map has been selected as the key method to detect similarity between text data and make decisions for a new class assignment. The experimental investigation has been performed using the newly collected multi-label text data. Financial news data in the Lithuanian language have been collected from four public websites and classified by experts into ten classes manually. Various parameters of the methods have been analysed, and the influence on the final results has been estimated. The final results are validated by experts. The research proved that the proposed approach could be helpful to verify and adjust multi-label text data classes. 82% of the correct assignments are obtained when the data dimensionality is reduced to 40 using the latent semantic analysis, and the self-organizing map size is reduced from 40 to 5 by step 5.

Highlights

  • Nowadays, the amount of information is growing at a very high rate, and systems store it in various formats

  • The analysis shows that there are not many studies performed using the Lithuanian language text data, especially using the multi-label text data

  • We used the following pre-processing filters: removed numbers, tokens were converted to the lower case, used Lithuanian language snowball stemming algorithm (Jocas, 2020), erased punctuation, used smaller than three characters tokens’ length, and used the Lithuanian language stop words list

Read more

Summary

Introduction

The amount of information is growing at a very high rate, and systems store it in various formats. Kurasova such as preparing, processing, and analysing such types of data. One of the unstructured data types is text. There are many different tasks where text analysis is used, but usually it is applied in text data classification and clustering, semantic analysis, context analysis, etc. The sentiment analysis is a branch of the classification tasks, where text data needs to be classified according to the sentiment, usually positive, negative, and neutral. It is often applied in social network analysis, movie reviews and comments analysis, etc. The context analysis is the highest level of text data analysis, when the text data is categorized or classified, but the meaning of the text is tried to take into account (Hernández-Alvarez and Gomez, 2016)

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call