Abstract

Text classification is an essential task of natural language processing. Preprocessing, which determines the representation of text features, is one of the key steps of text classification architecture. It proposed a novel efficient and effective preprocessing algorithm with three methods for text classification combining the Orthogonal Matching Pursuit algorithm to perform the classification. The main idea of the novel preprocessing strategy is that it combined stopword removal and/or regular filtering with tokenization and lowercase conversion, which can effectively reduce the feature dimension and improve the text feature matrix quality. Simulation tests on the 20 newsgroups dataset show that compared with the existing state-of-the-art method, the new method reduces the number of features by 19.85%, 34.35%, 26.25% and 38.67%, improves accuracy by 7.36%, 8.8%, 5.71% and 7.73%, and increases the speed of text classification by 17.38%, 25.64%, 23.76% and 33.38% on the four data, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call