A Comprehensive Methodology for Extracting Signal from Social Media Text Using Natural Language Processing and Machine Learning

Wenli Zhang

doi:10.2139/ssrn.2880834

Abstract

There has been increasing interest in using data from social media, search engines, and other web sources for predictive analytics in many different domains. Although using these datasets in different context has shown significant promise, mounting evidence suggests that many of the results being produced could be misrepresented because of the loosely structured textual data and noise caused by anomalous media spikes and use of misleading terms and phases. We introduce a novel and efficient framework combining natural language processing (NLP) and machine learning classification techniques to extract signal from social media text. Our methodology was tested using two different large real world datasets from social media and resulted in an overall accuracy of 88% and high per-class precision and recall. The methodology described in this paper can be used for a variety of purposes to yield improved analyses of social media and web text with a view to enabling improved predictions.

Full Text