An effective approach for identifying keywords as high-quality filters to get emergency-implicated Twitter Spanish data

Joel Garcia-Arteaga,Jesús Zambrano-Zambrano,Jorge Parraga-Alava,Jorge Rodas-Silva

doi:10.1016/j.csl.2023.101579

Joel Garcia-Arteaga, Jesús Zambrano-Zambrano + Show 2 more

https://doi.org/10.1016/j.csl.2023.101579

Copy DOI

Abstract

Twitter has become a powerful knowledge source for data extraction for data mining projects due to the amount of data generated by its users, which allows researchers to find content of almost any topic in real time, but this depends on the quality of the keywords used, otherwise the extracted data will have a high percentage of irrelevant content. In this paper, we introduce a time-aware machine-learning-based approach to identify meaningful keywords to maximize the extraction of relevant emergency-related tweets when the Twitter API is used. We follow the CRISP-DM methodology. The first stage relies on problem understanding, where we detected the necessity of using meaningful keywords to filter content and extract data with more quality and reduce the percentage of irrelevant tweets. In the second stage, data collection, we used the official Twitter API to extract and label tweets as “emergencia” and “no emergencia”. After that, we analyzed the collected data (data understanding) to determine preprocessing techniques and to prepare the data for the model. Finally, in the modeling and testing stages, we trained a restricted Boltzmann machine and four variations of autoencoders, including an architecture proposed by a genetic algorithm, to use them as keyword identifiers and to determine which of them has the best performance to deploy it to production (deployment stage). The results show a slightly better performance of the autoencoder proposed by the genetic algorithm (GADAE), achieving a R2score of 0.97, a MAE of 14×10−3, and a MSE of 4×10−4. GADAE, the best model, managed to extract 110% more relevant tweets than manual filtering in the context of emergency-implicated tweets in Ecuador.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An effective approach for identifying keywords as high-quality filters to get emergency-implicated Twitter Spanish data

Abstract

Talk to us

Similar Papers

More From: Computer Speech & Language

Lead the way for us

Journal: Computer Speech & Language	Publication Date: Oct 26, 2023
Citations: 2

Similar Papers

Uncovering the Reasons Behind COVID-19 Vaccine Hesitancy in Serbia: Sentiment-Based Topic Modeling.
Adela Ljajić ... Jelena Mitrović
Journal of medical Internet research | VOL. 24
Adela Ljajić, et. al.Adela Ljajić ... Jelena Mitrović
17 Nov 2022
Journal of medical Internet research | VOL. 24

What Are People Tweeting About Zika? An Exploratory Study Concerning Its Symptoms, Treatment, Transmission, and Prevention.
Michele Miller ... Roopteja Muppalla
JMIR Public Health and Surveillance | VOL. 3
Michele Miller, et. al.Michele Miller ... Roopteja Muppalla
19 Jun 2017
JMIR Public Health and Surveillance | VOL. 3

Reorder user's tweets
Keyi Shen ... Xiaokang Yang
ACM Transactions on Intelligent Systems and Technology | VOL. 4
Keyi Shen, et. al.Keyi Shen ... Xiaokang Yang
01 Jan 2013
ACM Transactions on Intelligent Systems and Technology | VOL. 4

A geographical and content-based approach to prioritize relevant and reliable tweets for emergency management
A Marcela Suarez ... Keith C Clarke
Cartography and Geographic Information Science | VOL. 49
A Marcela Suarez, et. al.A Marcela Suarez ... Keith C Clarke
06 Jul 2022
Cartography and Geographic Information Science | VOL. 49

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An effective approach for identifying keywords as high-quality filters to get emergency-implicated Twitter Spanish data

Abstract

Talk to us

Similar Papers

More From: Computer Speech & Language