Abstract

Selecting keywords from Twitter as features to identify events is challenging due to language informality such as acronyms, misspelled words, synonyms, transliteration and ambiguous terms. In this paper, We compare and identify the best methods for keyword selection as features to be used for classification purposes. Specifically, we study the aspects affecting keywords as features to identify civil unrest and protests. These aspects include the word count, the word forms such as n-gram, skip-gram and bags-of-words as well as the data association methods including correlation techniques and similarity techniques. To test the impact of the mentioned factors, we developed a framework that analyzed 641 days of tweets and extracted the words highly associated with event days along the same time frame. Then, we used the extracted words as features to classify any single day to be either an event day or a nonevent day in a specific location. In this framework, we used the same pipeline of data cleaning, prepossessing, feature selection, model learning and event classification using all combinations of keyword selection criteria. We used Naive Bayes classifier to learn the selected features and accordingly predict the event days. The classification is tested using multiple metrics, such as accuracy, precision, recall, F-score and AUC. This study concluded that the best word form is bag-of-words with average AUC of 0.72 and the best word count is two with average AUC of 0.74 and the best feature selection method is Spearman's correlation with average AUC of 0.89 and the best classifier for event detection is Naive Bayes Classifier.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.