An Auto-ML Approach Applied to Text Classification

Douglas Nunes De Oliveira,Luiz Henrique De Campos Merschmann

doi:10.1145/3539637.3557054

Abstract

Automated Machine Learning (AutoML) is a research area that aims to help humans solve Machine Learning (ML) problems by automatically discovering good model pipelines (algorithms and their hyperparameters for every stage of a machine learning process) for a given dataset. Since we have a combinatorial optimization problem for which it is impossible to evaluate all possible pipelines, most AutoML systems use Evolutionary Algorithm (EA) or Bayesian Optimization (BO) to find a good solution. As these systems usually evaluate the pipelines’ performance using the k-fold cross-validation method, the chance of finding an overfitted solution increases with the number of pipelines evaluated. Therefore, to avoid the aforementioned issue, we propose an Auto-ML system, named Auto-ML System for Text Classification (ASTeC), that uses the Bootstrap Bias Corrected CV (BBC-CV) to evaluate the pipelines’ performance. More specifically, the proposed system combines EA, BO, and BBC-CV to find a good model pipeline for the text classification task. We evaluate our proposal by comparing it against two state-of-the-art systems, the Tree-based Pipeline Optimization Tool (TPOT) and Google Cloud AutoML service. To do so, we use seven public datasets composed of written Brazilian Portuguese texts from the sentiment analysis domain. Statistical tests show that our system is equivalent to or better than both of them in all evaluated datasets.

Full Text