A subjectivity classification framework for sports articles using improved cortical algorithms

Nadine Hajj,Yara Rizk,Mariette Awad

doi:10.1007/s00521-018-3549-3

Abstract

The enormous number of articles published daily on the Internet, by a diverse array of authors, often offers misleading or unwanted information, rendering activities such as sports betting riskier. As a result, extracting meaningful and reliable information from these sources becomes a time-consuming and near impossible task. In this context, labeling articles as objective or subjective is not a simple natural language processing task because subjectivity can take several forms. With the rise of online sports betting due to the revolution in Internet and mobile technology, an automated system capable of sifting through all these data and finding relevant sources in a reasonable amount of time presents itself as a desirable and marketable product. In this work, we present a framework for the classification of sports articles composed of three stages: The first stage extracts articles from web pages using text extraction libraries, parses the text and then tags words using Stanford’s parts of speech tagger; the second stage extracts unique syntactic and semantic features, and reduces them using our modified cortical algorithm (CA)—hereafter CA*—while the third stage classifies these texts as objective or subjective. Our framework was tested on a database containing 1000 articles, manually labeled using Amazon’s crowdsourcing tool, Mechanical Turk; and results using CA, CA*, support vector machines and one of its soft computing variants (LMSVM) as classifiers were reported. A testing accuracy of 85.6% was achieved on a fourfold cross-validation with a 40% reduction in features using CA* that was trained using an entropy weight update rule and a cross-entropy cost function.

Full Text