Language Variety Prediction using Word Embeddings and Machine Leaning Algorithms

Chennam Chandrika Surya,Murali Mohan T,R Prasanthi Kumari,Karunakar K

doi:10.22214/ijraset.2022.48280

Abstract

Abstract: Author Profiling is a technique of predicting demographic characteristics like gender, age, location, nativity language, educational background etc., of an author by analysing their written texts. Author profiling is used in several text processing applications like forensics analysis, marketing, security. The author profiling techniques identify the stylistic differences among the author writing styles to identify the demographics of authors. Researchers experimented with various stylistic features like lexical features, content-based features, syntactic features, semantic features, domain specific features, structural features, readability features etc., to identify the stylistic differences among different author’s texts. The dataset plays an important role to analyse the stylistic differences of authors. PAN is one competition organizes different types of tasks in every year to encourage the participants around the globe for providing solutions to different types of text classification problems like plagiarism detection, authorship attribution, authorship verification, authorship profiling, celebrity profiling, style change detection, fake news spreaders detection, hate speech spreaders detection etc. The author profiling task was introduced in 2013 by the organizers of PAN competition. The organizers carefully gather the datasets and make available to the researchers for providing solutions to the problems. Every year the organizers conduct competitions on different sub-tasks of author profiling and provides datasets in different languages and in different genres. In 2017 competition, PAN introduces a task of predicting the language variety of an author. They release the dataset in four languages. In this work, we proposed an approach for English language dataset of language variety prediction. The proposed approach used the word embeddings generated by the Word2Vec model and BERT (Bidirectional Encoder Representations from Transformers) model. The word embeddings are used for generating the document vectors by combining the word embeddings of words those contain in documents. The document vectors are trained with two machine learning algorithms such as support vector machine and random forest. The Random Forest attained best accuracy of 96.87 for language variety prediction when experiment conducted with BERT embeddings

Full Text