Abstract

In this research, an analysis of the qualitative characteristics of messages in the Telegram messenger was carried out, which are used as raw data for further analysis of textual content. A thorough review of the parameters of these messages, such as their format, size, presence of noise, and speed. The main goal of the article is to model the optimal approach to saving a large amount of data before the important stage of text analysis. During the research, a detailed analysis of literary sources devoted to this topic was carried out. The article examines the main advantages and disadvantages of existing data preprocessing algorithms, as well as problems related to data purity and their impact on potential research results. As part of the software experiments, the impact of data preprocessing on the size of the saved data for further use, as well as on the speed of input data generation, was evaluated. Among the proposed methods, the method of saving cleared tokens in string format and the method of saving word codes in string format together with the word-code dictionary were highlighted. This is aimed at ensuring the effective distribution of tasks of the text analysis system during the period of the day.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.