Abstract

Progress in natural language processing has been dictated by the rule of more : more data, more computing power, more complexity, best exemplified by deep learning Transformers. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. One way to ameliorate this problem is through data engineering instead of the algorithmic or hardware perspectives. Our focus here is an under-investigated data engineering technique, with enormous potential in the current scenario – Instance Selection (IS) (a.k.a. Selective Sampling, Prototype Selection). The IS goal is to reduce the training set size by removing noisy or redundant instances while maintaining or improving the effectiveness (accuracy) of the trained models and reducing the training process cost. We survey classical and recent state-of-the-art IS techniques and provide a scientifically sound comparison of IS methods applied to an essential natural language processing task—Automatic Text Classification (ATC). IS methods have been normally applied to small tabular datasets and have not been systematically compared in ATC. We consider several neural and non-neural state-of-the-art ATC solutions and many datasets. We answer several research questions based on tradeoffs induced by a tripod (training set reduction, effectiveness, and efficiency). Our answers reveal an enormous unfulfilled potential for IS solutions. Specially, we show that in 12 out of 19 datasets, specific IS methods—namely, Condensed Nearest Neighbor, Local Set-based Smoother, and Local Set Border Selector—can reduce the size of the training set without effectiveness losses. Furthermore, in the case of fine-tuning the Transformer methods, the IS methods reduce the amount of data needed, without losing effectiveness and with considerable training-time gains.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call