News Classification for Identifying Traffic Incident Points in a Spanish-Speaking Country: A Real-World Case Study of Class Imbalance Learning

Gilberto Rivera,J Patricia Sánchez-Solís,Vicente García,Rogelio Florencia,Alejandro Ruiz

doi:10.3390/app10186253

Abstract

‘El Diario de Juárez’ is a local newspaper in a city of 1.5 million Spanish-speaking inhabitants that publishes texts of which citizens read them on both a website and an RSS (Really Simple Syndication) service. This research applies natural-language-processing and machine-learning algorithms to the news provided by the RSS service in order to classify them based on whether they are about a traffic incident or not, with the final intention of notifying citizens where such accidents occur. The classification process explores the bag-of-words technique with five learners (Classification and Regression Tree (CART), Naïve Bayes, kNN, Random Forest, and Support Vector Machine (SVM)) on a class-imbalanced benchmark; this challenging issue is dealt with via five sampling algorithms: synthetic minority oversampling technique (SMOTE), borderline SMOTE, adaptive synthetic sampling, random oversampling, and random undersampling. Consequently, our final classifier reaches a sensitivity of 0.86 and an area under the precision-recall curve of 0.86, which is an acceptable performance when considering the complexity of analyzing unstructured texts in Spanish.

Highlights

Nowadays, most of the people across the world live in urban areas, and the UN expects this population to increase dramatically in the coming three decades
This project is split into three blocks to face the problem of informing citizens about traffic accidents detected automatically in the news reports provided by the RSS service of ‘El Diario de Juárez’ newspaper
When the newspaper issues an RSS news, our application vectorizes it and applies the trained Support Vector Machine (SVM) model to determine whether it is a traffic accident. For those news reports classified as a traffic accident, a data extraction technique that is based on Spanish grammatical patterns is applied to the text of the news to identify the location of those events

Summary

Introduction

Most of the people across the world live in urban areas, and the UN expects this population to increase dramatically in the coming three decades. The leading causes are (1) the shift in the residence of people from rural communities to urban ones and (2) the growth of the whole population, adding 2.5 billion people to cities [1] This trend is marked in the case of in-development regions. During the last two decades, urban structures have become more digital and information-based, and there has been a decisive change in the living environment of citizens They must be capable of further advances as ICT innovations emerge. With a fast-increasing population, this city must provide structures that rapidly broadcast knowledge among all of the members of its community This fact brings to light that Ciudad Juárez should evolve towards an SC urgently.

Background

A Brief Review of the Related Literature

Our Proposal

Knowledge Discovery

Knowledge Application and Deployment

Results

Performance of the Classifiers on the Imbalanced Corpus

Impact of the Sampling Methods on SVM and Random Forest

Performance of the Location Extraction Module

Conclusions and Future Work

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied sciences	Publication Date: Sep 9, 2020
Citations: 14	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

News Classification for Identifying Traffic Incident Points in a Spanish-Speaking Country: A Real-World Case Study of Class Imbalance Learning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied sciences

Lead the way for us

Similar Papers

Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage
Jianxiang Tang ... Zilun Shao
BMC medical informatics and decision making | VOL. 22
Jianxiang Tang, et. al.Jianxiang Tang ... Zilun Shao
25 Oct 2022
BMC medical informatics and decision making | VOL. 22

Sickle cell segmentation and classification for thalassemia aid diagnosis
Yen-Siang Leow ... Kok-Why Ng
F1000Research | VOL. 10
Yen-Siang Leow, et. al.Yen-Siang Leow ... Kok-Why Ng
23 Nov 2021
F1000Research | VOL. 10

An Interpretable Two-Phase Modeling Approach for Lung Cancer Survivability Prediction.
Zahra Sedighi-Maman ... Jonathan J Heath
Sensors (Basel, Switzerland) | VOL. 22
Zahra Sedighi-Maman, et. al.Zahra Sedighi-Maman ... Jonathan J Heath
08 Sep 2022
Sensors (Basel, Switzerland) | VOL. 22

Classification of toxicity effects of biotransformed hepatic drugs using whale optimized support vector machines
Alaa Tharwat ... Aboul Ella Hassanien
Journal of Biomedical Informatics | VOL. 68
Alaa Tharwat, et. al.Alaa Tharwat ... Aboul Ella Hassanien
08 Mar 2017
Journal of Biomedical Informatics | VOL. 68

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

News Classification for Identifying Traffic Incident Points in a Spanish-Speaking Country: A Real-World Case Study of Class Imbalance Learning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied sciences