Short text classification applied to item description: Some methods evaluation

Gilsiley Henrique Darú,Antonio Castelo,Felipe Daltrozo da Motta Motta,Gustavo Valentim Loch

doi:10.5433/1679-0375.2022v43n2p189

Abstract

The increasing demand for information classification based on content in the age of social media and e-commerce has led to the need for automated product classification using their descriptions. This study aims to evaluate various techniques for this task, with a focus on descriptions written in Portuguese. A pipeline is implemented to preprocess the data, including lowercasing, accent removal, and unigram tokenization. The bag of words method is then used to convert text into numerical data, and five classification techniques are applied: argmaxtf, argmaxtfnorm, argmaxtfidf from information retrieval, and two machine learning methods logistic regression and support vector machines. The performance of each technique is evaluated using simple accuracy via thirty-fold cross validation. The results show that logistic regression achieves the highest mean accuracy among the evaluated techniques.

Full Text