Abstract

User-generated contents (UGC) represent an important source of information for governments, companies, political candidates and consumers. However, most of the Natural Language Processing tools and techniques are developed from and for texts of standard language, and UGC is a type of text especially full of creativity and idiosyncrasies, which represents noise for NLP purposes. This paper presents UGCNormal, a lexicon-based tool for UGC normalization. It encompasses a tokenizer, a sentence segmentation tool, a phonetic-based speller and some lexicons, which were originated from a deep analysis of a corpus of product reviews in Brazilian Portuguese. The normalizer was evaluated in two different data sets and carried out from 31% to 89% of the appropriate corrections, depending on the type of text noise. The use of UGCNormal was also validated in a task of POS tagging, which improved from 91.35% to 93.15% in accuracy and in a task of opinion classification, which improved the average of F1-score measures (F1-score positive and F1-score negative) from 0.736 to 0.758.

Highlights

  • The increasing volume of text posted by users on the web is regarded as an extremely useful opportunity to reveal public opinion on many issues

  • As a result, processing and analyzing User-generated contents (UGC) became a task of NLP (Natural Language Processing)

  • The characteristics we describe have been observed in the corpus of product reviews Buscapé, built by Hartmann et al (2014)

Read more

Summary

Introduction

The increasing volume of text posted by users on the web is regarded as an extremely useful opportunity to reveal public opinion on many issues. For a variety of reasons, governments, companies, political candidates, and consumers want to explore such web content This type of text is referred to in the literature as UGC (usergenerated content) or EWoM (electronic word-ofmouth). The problem is that, until now, almost all NLP tools and techniques were developed from, and for, standard language text, but UGC displays a range of creative and idiosyncratic differences, which represent noise for NLP purposes. This work was preceded by the detection and analysis of out-of-vocabulary (OOV) words in a corpus of product reviews (Hartmann et al 2014) In another preliminary investigation, we have found other different types of deviations and their impact on a tagging task (Duran et al, 2014).

Related works
Characteristics of User-Generated Content in product reviews
A lexicon-based approach to UGC normalization
Intrinsic Evaluation
Extrinsic Evaluation
Some limitations of the normalization tool
Findings
Final remarks and future work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.