Discriminating between Brazilian and European Portuguese National Varieties on Twitter Texts

Dayvid Castro,Ellen Souza,Adriano L.I De Oliveira

doi:10.1109/bracis.2016.056

Abstract

Twitter is one of the most used social media with users generating about 1 million messages per day. As a result of the expansion of this microblog, there is a diversity of languages used by users and many studies aimed at identifying the language of tweets. The third most used language on Twitter is Portuguese, a pluricentric language with two national standard varieties: Brazilian Portuguese and European Portuguese. Identifying a language variety may positively impact various Natural Language Processing tasks, but accomplishing this task is still regarded as one of the bottlenecks in this area, especially when combined with another bottleneck, language identification applied to short texts. Thus, given these challenges, this paper provides a current view on the automatic discrimination of the two main Portuguese language varieties on Twitter texts by using an acknowledged approach with different techniques and features in order to get an optimum configuration to fit our problem. Results reached 0.9271 for accuracy using an ensemble method, which combines character 6-grams and word unigrams and bigrams.

Full Text