Abstract

Twitter has become a powerful knowledge source for data extraction for data mining projects due to the amount of data generated by its users, which allows researchers to find content of almost any topic in real time, but this depends on the quality of the keywords used, otherwise the extracted data will have a high percentage of irrelevant content. In this paper, we introduce a time-aware machine-learning-based approach to identify meaningful keywords to maximize the extraction of relevant emergency-related tweets when the Twitter API is used. We follow the CRISP-DM methodology. The first stage relies on problem understanding, where we detected the necessity of using meaningful keywords to filter content and extract data with more quality and reduce the percentage of irrelevant tweets. In the second stage, data collection, we used the official Twitter API to extract and label tweets as “emergencia” and “no emergencia”. After that, we analyzed the collected data (data understanding) to determine preprocessing techniques and to prepare the data for the model. Finally, in the modeling and testing stages, we trained a restricted Boltzmann machine and four variations of autoencoders, including an architecture proposed by a genetic algorithm, to use them as keyword identifiers and to determine which of them has the best performance to deploy it to production (deployment stage). The results show a slightly better performance of the autoencoder proposed by the genetic algorithm (GADAE), achieving a R2score of 0.97, a MAE of 14×10−3, and a MSE of 4×10−4. GADAE, the best model, managed to extract 110% more relevant tweets than manual filtering in the context of emergency-implicated tweets in Ecuador.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call