Multilingual author profiling on Facebook

Mehwish Fatima,Komal Hasan,Saba Anwar,Rao Muhammad Adeel Nawab

doi:10.1016/j.ipm.2017.03.005

Abstract

Author profiling is the identification of demographic features of an author by examining his written text. Recently, it has attracted the attention of research community due to it’s potential applications in forensic, security, marketing, fake profiles identification on online social networking sites, capturing sender of harassing messages etc. We need benchmark corpora to develop and evaluate techniques for author profiling. Majority of the existing corpora are for English and other European languages but not for under–resourced South Asian languages, like Roman Urdu (written using English alphabets). Roman Urdu is used in daily communication by a large number of native speakers of Urdu around the world particularly in Facebook posts/comments, Twitter tweets, blogs, chat blogs and SMS messaging. The construction of sentences of Urdu while using alphabets of English transforms the language properties of the text. We aim to investigate the behavior of existing author profiling techniques for multilingual text consisting of English and Roman Urdu, concretely for gender and age identification. We here focus on author profiling on Facebook by (i) developing a multilingual (Roman Urdu and English) corpus, (ii) manually building of a bilingual dictionary for translating Roman Urdu words into English, (iii) modeling existing state-of-the-art author profiling techniques by using content based features (word and character N–grams) and 64 different stylistic based features (11 lexical word based features, 47 lexical character based features and 6 vocabulary richness measures) for age and gender identification on multilingual and translated corpora, (iv) evaluating and comparing the behavior of above mentioned techniques on multilingual and translated corpora. Our extensive empirical evaluation shows that (i) existing author profiling techniques can be used for multilingual text (Roman Urdu + English) as well as monolingual text (corpus obtained after translating multilingual corpus using bilingual dictionary), (ii) content based methods outperform stylistic based methods for both gender and age identification task and (iii) translation of multilingual corpus to monolingual text does not improve results.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multilingual author profiling on Facebook

Abstract

Talk to us

Similar Papers

More From: Information Processing and Management

Lead the way for us

Journal: Information Processing and Management	Publication Date: Apr 12, 2017
Citations: 43

Similar Papers

Multilingual SMS-based author profiling: Data and methods
Mehwish Fatima ... Alia Masood
Natural Language Engineering | VOL. 24
Mehwish Fatima, et. al.Mehwish Fatima ... Alia Masood
26 Jun 2018
Natural Language Engineering | VOL. 24

Comparative Analysis of Machine Learning Algorithms for Author Age and Gender Identification
Zarah Zainab ... Feras Al-Obeidat
-
Zarah Zainab, et. al.Zarah Zainab ... Feras Al-Obeidat
01 Jan 2023
01 Jan 2023

On the role of syntactic dependencies and discourse relations for author and gender identification
Juan Soler-Company ... Leo Wanner
Pattern Recognition Letters | VOL. 105
Juan Soler-Company, et. al.Juan Soler-Company ... Leo Wanner
06 Dec 2017
Pattern Recognition Letters | VOL. 105

Detecting Cyberbullying in Roman Urdu Language Using Natural Language Processing Techniques
Fahad Rasheed ... Mehmoon Anwar
Pakistan Journal of Engineering and Technology | VOL. 5
Fahad Rasheed, et. al.Fahad Rasheed ... Mehmoon Anwar
19 Sep 2022
Pakistan Journal of Engineering and Technology | VOL. 5

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multilingual author profiling on Facebook

Abstract

Talk to us

Similar Papers

More From: Information Processing and Management