Abstract

AbstractNowadays, enterprises need to handle a continually growing amount of text data generated internally by their employees and externally by current or potential customers. Accordingly, the attention of managers shifts to an efficient usage of this data to address related business challenges. However, it is usually hard to extract the meaning out of unstructured text data in an automatic way. There are multiple discussions and no general opinion in the research and practitioners’ community on the design of text classification tasks, specifically the choice of text representation techniques and classification algorithms. One essential point in this discussion is about building solutions that are both accurate and understandable for humans. Being able to evaluate the classification decision is a critical success factor of a text classification task in an enterprise setting, be it legal documents, medical records, or IT tickets. Hence, our study aims to investigate the core design elements of a typical text classification pipeline and their contribution to the overall performance of the system. In particular, we consider text representation techniques and classification algorithms, in the context of their explainability, providing ultimate insights from our IT ticket complexity prediction case study. We compare the performance of a highly explainable text representation technique based on the case study tailored linguistic features with a common TF-IDF approach. We apply interpretable machine learning algorithms such as kNN, its enhanced versions, decision trees, naïve Bayes, logistic regression, as well as semi-supervised techniques to predict the ticket class label of low, medium, or high complexity. As our study shows, simple, explainable algorithms, such as decision trees and naïve Bayes, demonstrate remarkable performance results when applied with our linguistic features-based text representation. Furthermore, we note that text classification is inherently related to Granular Computing.KeywordsText classificationExplainabilityLinguisticsMachine learningTF-IDFIT tickets

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call