Dataset Creation from Multilingual Data of Social Media: Challenges and Consequences

Mohammad Aman Ullah,Md Monirul Islam,Zulkifly Mohd Zaki,Norhidayah Azman

doi:10.1109/wiecon-ece52138.2020.9398002

Abstract

In recent years, social media, especially Facebook have observed a massive growth of regular posts and their related comments. The users are free to post and comment any kind of information in any language, but there are no explicit mechanisms to reconcile the information expressed in different languages into the useful data set. So, in most of the cases, the contents of the Facebook expressed in different languages remain useless. This paper elucidates the motivation behind the multilingual dataset creation and proposed a framework for the multilingual dataset creation. Besides, the research illustrated the challenges associated with the data set generation, such as separating multilingual data etc. Finally, presents the consequences of multilingual dataset creation due to different challenges. Therefore, the contribution of this research is the creation of multilingual dataset using proposed framework and practically presents the loopholes and consequences.

Full Text