Multilingual SMS-based author profiling: Data and methods

Mehwish Fatima,Rao Muhammad Adeel Nawab,Muntaha Iqbal,Waqas Arshad,Saba Anwar,Alia Masood,Amna Naveed

doi:10.1017/s1351324918000244

Abstract

AbstractIn the recent years, many benchmark author profiling corpora have been developed for various genres including Twitter, social media, blogs, hotel reviews and e-mail, etc. However, no such standard evaluation resource has been developed for Short Messaging Service (SMS), a popular medium of communication, which is very useful for author profiling. The primary aim of this study is to develop a large multilingual (English and Roman Urdu) benchmark SMS-based author profiling corpus. The proposed corpus contains 810 author profiles, wherein each profile consists of an aggregation of SMS messages as a single document of an author, along with seven demographic traits associated with each author profile: gender, age, native language, native city, qualification, occupation and personality type (introvert/extrovert). The secondary aims of this study include the following: (1) annotating the proposed corpus for code-switching annotations at the lexical level (approximately 0.69 million tokens are manually annotated for code-switching) and (2) applying the stylometry-based method (groups of sixty-four features) and the content-based method (twelve features) for gender identification in order to demonstrate how our proposed corpus can be used for the development and evaluation of various author profiling methods. The results show that the content-based character 5-gram feature outperformed all the other features by obtaining the accuracy score of 0.975 andF1score of 0.947 for gender identification while using the entire corpus. Furthermore, our proposed corpora (SMS–AP–18 and code-switched SMS–AP–18) are freely and publicly available for research purpose.

Full Text