A Normalizer for UGC in Brazilian Portuguese

Magali Sanches Duran,Lucas Avanço,Maria Das Graças Volpe Nunes

doi:10.18653/v1/w15-4305

Abstract

User-generated contents (UGC) represent an important source of information for governments, companies, political candidates and consumers. However, most of the Natural Language Processing tools and techniques are developed from and for texts of standard language, and UGC is a type of text especially full of creativity and idiosyncrasies, which represents noise for NLP purposes. This paper presents UGCNormal, a lexicon-based tool for UGC normalization. It encompasses a tokenizer, a sentence segmentation tool, a phonetic-based speller and some lexicons, which were originated from a deep analysis of a corpus of product reviews in Brazilian Portuguese. The normalizer was evaluated in two different data sets and carried out from 31% to 89% of the appropriate corrections, depending on the type of text noise. The use of UGCNormal was also validated in a task of POS tagging, which improved from 91.35% to 93.15% in accuracy and in a task of opinion classification, which improved the average of F1-score measures (F1-score positive and F1-score negative) from 0.736 to 0.758.

Highlights

The increasing volume of text posted by users on the web is regarded as an extremely useful opportunity to reveal public opinion on many issues
As a result, processing and analyzing User-generated contents (UGC) became a task of NLP (Natural Language Processing)
The characteristics we describe have been observed in the corpus of product reviews Buscapé, built by Hartmann et al (2014)

Summary

Introduction

The increasing volume of text posted by users on the web is regarded as an extremely useful opportunity to reveal public opinion on many issues. For a variety of reasons, governments, companies, political candidates, and consumers want to explore such web content This type of text is referred to in the literature as UGC (usergenerated content) or EWoM (electronic word-ofmouth). The problem is that, until now, almost all NLP tools and techniques were developed from, and for, standard language text, but UGC displays a range of creative and idiosyncratic differences, which represent noise for NLP purposes. This work was preceded by the detection and analysis of out-of-vocabulary (OOV) words in a corpus of product reviews (Hartmann et al 2014) In another preliminary investigation, we have found other different types of deviations and their impact on a tagging task (Duran et al, 2014).

Related works

Characteristics of User-Generated Content in product reviews

A lexicon-based approach to UGC normalization

Intrinsic Evaluation

Extrinsic Evaluation

Some limitations of the normalization tool

Findings

Final remarks and future work

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Normalizer for UGC in Brazilian Portuguese

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2015
Citations: 10	License type: cc-by

Similar Papers

Some Issues on the Normalization of a Corpus of Products Reviews in Portuguese
Magali Sanches Duran ... Lucas Avanço
-
Magali Sanches Duran, et. al.Magali Sanches Duran ... Lucas Avanço
01 Jan 2014
01 Jan 2014

Sentimental interplay between structured and unstructured user-generated contents
Xianfeng Zhang ... Zhangxi Lin
Online Information Review | VOL. 40
Xianfeng Zhang, et. al.Xianfeng Zhang ... Zhangxi Lin
08 Feb 2016
Online Information Review | VOL. 40

Evaluating Phonetic Spellers for User-Generated Content in Brazilian Portuguese
Gustavo Augusto De Mendonça Almeida ... Maria Das Graças Volpe Nunes
-
Gustavo Augusto De Mendonça Almeida, et. al.Gustavo Augusto De Mendonça Almeida ... Maria Das Graças Volpe Nunes
01 Jan 2015
01 Jan 2015

Sentiment-Based Features for Predicting Election Polls: A Case Study on the Brazilian Scenario
Diego Tumitan ... Karin Becker
-
Diego Tumitan, et. al.Diego Tumitan ... Karin Becker
01 Aug 2014
01 Aug 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Normalizer for UGC in Brazilian Portuguese

Abstract

Highlights

Summary

Talk to us

Similar Papers