CMIR: A Corpus for Evaluation of Code Mixed Information Retrieval of Hindi-English Tweets

Kunal Chakma,Amitava Das

doi:10.13053/cys-20-3-2459

Abstract

Social media has become almost ubiquitous in present times. Such proliferation leads to automatic information processing need and has various challenges. The nature of social media content is mostly informal. Additionally while talking about Indian social media, users often prefer to use Roman transliterations of their native languages and English embedding. Therefore Information retrieval (IR) on such Indian social media data is a challenging and difficult task when the documents and the queries are a mixture of two or more languages written in either the native scripts and/or in the Roman transliterated form. Here in this paper we have emphasized issues related with Information Retrieval (IR) for Code-Mixed Indian social media texts, particularly texts from twitter. We describe a corpus collection process, reported limitations of available state-of-the-art IR systems on such data and formalize the problem of Code-Mixed Information Retrieval on informal texts.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

CMIR: A Corpus for Evaluation of Code Mixed Information Retrieval of Hindi-English Tweets

Abstract

Talk to us

Similar Papers

More From: Computación y Sistemas

Lead the way for us

Journal: Computación y Sistemas	Publication Date: Sep 30, 2016
Citations: 17

Similar Papers

Machine Learning Techniques for Sentiment Analysis of Code-Mixed and Switched Indian Social Media Text Corpus - A Comprehensive Review
Gazi Imtiyaz Ahmad ... Aijaz Ahmad Reshi
International Journal of Advanced Computer Science and Applications | VOL. 13
Gazi Imtiyaz Ahmad, et. al.Gazi Imtiyaz Ahmad ... Aijaz Ahmad Reshi
01 Jan 2021
International Journal of Advanced Computer Science and Applications | VOL. 13

An Effective Bi-LSTM Word Embedding System for Analysis and Identification of Language in Code-Mixed social Media Text in English and Roman Hindi
Shashi Shekhar ... M.M Sufyan Beg
Computación y Sistemas | VOL. 24
Shashi Shekhar, et. al.Shashi Shekhar ... M.M Sufyan Beg
09 Dec 2020
Computación y Sistemas | VOL. 24

Transliteration Characteristics in Romanized Assamese Language Social Media Text and Machine Transliteration
Hemanta Baruah ... Sanasam Ranbir Singh
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23
Hemanta Baruah, et. al.Hemanta Baruah ... Sanasam Ranbir Singh
08 Feb 2024
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 23

A Framework for Online Hate Speech Detection on Code-mixed Hindi-English Text and Hindi Text in Devanagari
Abhishek Chopra ... Aashna Jha
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 22
Abhishek Chopra, et. al.Abhishek Chopra ... Aashna Jha
08 May 2023
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CMIR: A Corpus for Evaluation of Code Mixed Information Retrieval of Hindi-English Tweets

Abstract

Talk to us

Similar Papers

More From: Computación y Sistemas