The creation of effective systems for filtering media texts is due to the need to develop artificial intelligence systems, which is a large language model that should be trained using “correct” text samples that do not contain signs of disinformation, infodemic and unreliability. The article presents the results of automatic detection of high-quality media texts, as well as text samples with infodemic features carried out using a trained natural language model based on a manually labeled corpus. Manual marking of the corpus was carried out by experts based on the parameterization of the text content. The goal of our work is to build a model of the language of media messages, assess the quality and identify detection errors caused by the linguistic characteristics of texts. Creating a model of the language of media messages is a condition for increasing the efficiency and quality of artificial intelligence systems. It has been established that the test use of a trained natural language model allows filtering media texts with fairly high accuracy. The support vector machine method proved to be most effective. The share of incorrectly recognized informative texts that meet the criteria of reliability and novelty is low and amounts to 6.2 percent. The percentage of incorrectly recognized uninformative texts is approximately 3.9 percent, which indicates a fairly high efficiency of the developed model. The errors in the detection of informative texts are associated with the use of proper names (anthroponyms, toponyms) and numerals in the headings. Linguistic features of misclassified texts containing signs of fake and misinformation comprise text samples using statements with speech verbs that are often used in informative texts.
Read full abstract