Feature Selection on Noisy Twitter Short Text Messages for Language Identification

Mohd Zeeshan Ansari*,Computer Engineering, Jamia Millia Islamia, New Delhi, India ,Ana Fatima,Tanvir Ahmad

doi:10.35940/ijrte.d4360.118419

Mohd Zeeshan Ansari*, Computer Engineering, Jamia Millia Islamia, New Delhi, India + Show 2 more

Open Access

https://doi.org/10.35940/ijrte.d4360.118419

Copy DOI

Abstract

The task of written language identification involves typically the detection of the languages present in a sample of text. Moreover, a sequence of text may not belong to a single inherent language but also may be a mixture of text written in multiple languages. This kind of text is generated in large volumes from social media platforms due to its flexible and user friendly environment. Such text contains very large number of features which are essential for development of statistical, probabilistic as well as other kinds of language models. The large number of features have rich as well as irrelevant and redundant features which have diverse effect over the performance of the learning model. Therefore, feature selection methods are significant in choosing features that are most relevant for an efficient model. In this article, we consider the Hindi-English language identification task as Hindi and English are often the two most widely spoken languages of India. We apply different feature selection algorithms across various learning algorithms in order to analyze the effect of the algorithm as well as the number of selected features on the performance of the task. The methodology focuses on the word level language identification using a novel dataset of 6903 tweets extracted from Twitter. Various n-gram profiles are examined with different feature selection algorithms over many classifiers. Finally, an exhaustive comparative analysis is put forward with respect to the overall experiments conducted for the task.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Feature Selection on Noisy Twitter Short Text Messages for Language Identification

Abstract

Talk to us

Similar Papers

More From: International Journal of Recent Technology and Engineering (IJRTE)

Lead the way for us

Journal: International Journal of Recent Technology and Engineering (IJRTE)	Publication Date: Nov 30, 2019
Citations: 1

Similar Papers

Significance of Feature Selection and Pruning Algorithms in Machine Learning Classification of E-Mails
V Bindu ... Ciza Thomas
-
V Bindu, et. al.V Bindu ... Ciza Thomas
01 Jan 2020
01 Jan 2020

Particle Swarm Optimization for Feature Selection in Classification: A Multi-Objective Approach
Bing Xue ... Will N Browne
IEEE Transactions on Cybernetics | VOL. 43
Bing Xue, et. al.Bing Xue ... Will N Browne
01 Dec 2013
IEEE Transactions on Cybernetics | VOL. 43

A Hybrid Approach for Transliterated Word-Level Language Identification
Somnath Banerjee ... Sudip Kumar Naskar
-
Somnath Banerjee, et. al.Somnath Banerjee ... Sudip Kumar Naskar
01 Jan 2015
01 Jan 2015

A Survey on Feature Selection Using FAST Approach to Reduce High Dimensional Data
R Munieswari ... S Saranya
international journal of engineering trends and technology | VOL. -
R Munieswari, et. al.R Munieswari ... S Saranya
25 Feb 2014
international journal of engineering trends and technology | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Feature Selection on Noisy Twitter Short Text Messages for Language Identification

Abstract

Talk to us

Similar Papers

More From: International Journal of Recent Technology and Engineering (IJRTE)