Abstract

The main target of this paper was to study the influence of training data quality on the text document classification performance of machine learning methods. A graded relevance corpus of ten classes and 957 text documents was classified with Self-Organising Maps (SOMs), learning vector quantisation, k-nearest neighbours searching, naïve Bayes and support vector machines. The relevance level of a document (irrelevant, marginally, fairly or highly relevant) was used as a measure of the quality of the document as a training example, which is a new approach. The classifiers were evaluated with micro- and macro-averaged classification accuracies. The results suggest that training data of higher quality should be preferred, but even low-quality data can improve a classifier, if there is plenty of it. In addition, further means to facilitate classification by the SOMs were explored. The novel set of SOM approach performed clearly better than the original SOM and comparably against supervised classification methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.