Abstract
Background. In the twenty-first century, the information space is a full-fledged battlefield. In the Ukrainian information space, the problem of text toxicity and hate speech is becoming increasingly important. Therefore, the interest of researchers in markers of negative textual tone, especially in media texts, is constantly growing. The article describes the structure and results of a separate module of the automatic system of statistical parameterization of Ukrainian-language texts “TextAttributor 1.0” – determination of the text toxicity index. The tasks are solved by two methods: the method of dictionaries and rules (calculation of statistical parameters) and the method of machine learning. The results of the study are based on the material of the corpus of online media texts of political discourse with a volume of 10 million word occurrences. To achieve this goal, a lexicographic database was created, including three dictionaries: Emotiogens, Hate Speech, and Toxic Compounds, and training and control samples of texts were formed to estimate the parameters of the selected model using machine learning. The project chose a computationally efficient architecture based on the fastText methodology and tools. The toxicity index is calculated by verbally identifying the negative sentiment of the text based on the linguistic and software-generated and is detected by the system-generated linguistic examination of the text, which displays a statistical map of semantic classes of negative vocabulary by classification markers of lexicographic lists, and the output of the neural network. Сonclusions.The “TextAttributor 1.0” system is at the stage of testing and improving its functionality, so the article describes an intermediate β-version of the system, but the results obtained in determining toxicity show that the developed methodology for quantifying verbal means by semantic parameters (negative emotionality) using dictionaries and rules and machine learning is effective in achieving the tasks set and makes it possible not only to determine the boundary between toxic and neutral text but also to approach the solution of the problem based on the lexical categories inherent in the text. The methodology for developing a module for determining the toxicity of media text in the “TextAttributor 1.0” system was described and published on the web application page in April 2024, but this information is published for the first time in the form of a research article.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have