Leading software providers typically implement customer technical support functions, which are crucial for promoting and enhancing the competitiveness of their products and services in global markets. The high volume and heterogeneity of support tickets (functional, temporal, linguistic, etc.) highlight the importance of efficient classification systems. Effective classification optimizes the distribution of these tickets among support center specialists and automates their processing using an established knowledge base. However, classifying these tickets is a loosely formalized task. For companies that have accumulated substantial data on customer requests, automating classification through machine learning methods and natural language processing models, such as Word2Vec, FastText, BERT, and GPT, becomes feasible. It is generally accepted that classification effectiveness primarily depends on the model employed. Nevertheless, the quality of these models is significantly influenced by the nature of the training data. Literature review of the reveals significant research interest in methods for the automatic classification of tickets specifically tailored to the operational conditions of software provider support centers. However, there is a noticeable gap in the literature regarding the impact of data preprocessing on the quality of these models. The article aims to clarify the techniques of data preprocessing and analyze their impact on the effectiveness of text classification, considering the specificity of software provider support centers. This study examines the stages of the automatic classification process for tickets, accounting for the unique characteristics of the data (customer text requests). A relevant set of specified methodological and instrumental tools was developed and tested using open data from a global software provider (DevExpress). The testing involved a database of 165,000 tickets. The study's results indicate that preprocessing can improve classification metrics such as F-measure, Precision, and Recall from 77% to 79%. Additionally, preprocessing significantly reduces the dimensionality of text data (by 48.2%) and increases model training speed (by 26.5%) without loss of accuracy, ensuring cost-efficiency and operational efficiency in the use of computational resources.
Read full abstract