The impact of NLP techniques in the multilabel text classification problem

Teresa Gonçalves,Paulo Quaresma

doi:10.1007/978-3-540-39985-8_46

Abstract

Support Vector Machines have been used successfully to classify text documents into sets of concepts. However, typically, linguistic information is not being used in the classification process or its use has not been fully evaluated.We apply and evaluate two basic linguistic procedures (stop-word removal and stemming/lemmatization) to the multilabel text classification problem.These procedures are applied to the Reuters dataset and to the Portuguese juridical documents from Supreme Courts and Attorney General’s Office.KeywordsSupport Vector MachineFeature SelectionClass ImbalanceLinguistic InformationSupreme CourtThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Full Text