Abstract

‘El Diario de Juárez’ is a local newspaper in a city of 1.5 million Spanish-speaking inhabitants that publishes texts of which citizens read them on both a website and an RSS (Really Simple Syndication) service. This research applies natural-language-processing and machine-learning algorithms to the news provided by the RSS service in order to classify them based on whether they are about a traffic incident or not, with the final intention of notifying citizens where such accidents occur. The classification process explores the bag-of-words technique with five learners (Classification and Regression Tree (CART), Naïve Bayes, kNN, Random Forest, and Support Vector Machine (SVM)) on a class-imbalanced benchmark; this challenging issue is dealt with via five sampling algorithms: synthetic minority oversampling technique (SMOTE), borderline SMOTE, adaptive synthetic sampling, random oversampling, and random undersampling. Consequently, our final classifier reaches a sensitivity of 0.86 and an area under the precision-recall curve of 0.86, which is an acceptable performance when considering the complexity of analyzing unstructured texts in Spanish.

Highlights

  • Nowadays, most of the people across the world live in urban areas, and the UN expects this population to increase dramatically in the coming three decades

  • This project is split into three blocks to face the problem of informing citizens about traffic accidents detected automatically in the news reports provided by the RSS service of ‘El Diario de Juárez’ newspaper

  • When the newspaper issues an RSS news, our application vectorizes it and applies the trained Support Vector Machine (SVM) model to determine whether it is a traffic accident. For those news reports classified as a traffic accident, a data extraction technique that is based on Spanish grammatical patterns is applied to the text of the news to identify the location of those events

Read more

Summary

Introduction

Most of the people across the world live in urban areas, and the UN expects this population to increase dramatically in the coming three decades. The leading causes are (1) the shift in the residence of people from rural communities to urban ones and (2) the growth of the whole population, adding 2.5 billion people to cities [1] This trend is marked in the case of in-development regions. During the last two decades, urban structures have become more digital and information-based, and there has been a decisive change in the living environment of citizens They must be capable of further advances as ICT innovations emerge. With a fast-increasing population, this city must provide structures that rapidly broadcast knowledge among all of the members of its community This fact brings to light that Ciudad Juárez should evolve towards an SC urgently.

Background
A Brief Review of the Related Literature
Our Proposal
Knowledge Discovery
Knowledge Application and Deployment
Results
Performance of the Classifiers on the Imbalanced Corpus
Impact of the Sampling Methods on SVM and Random Forest
Performance of the Location Extraction Module
Conclusions and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call