PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation

Vivek Srivastava,Mayank Singh

doi:10.18653/v1/2020.wnut-1.7

Abstract

Code-mixing is the phenomenon of using more than one language in a sentence. In the multilingual communities, it is a very frequently observed pattern of communication on social media platforms. Flexibility to use multiple languages in one text message might help to communicate efficiently with the target audience. But, the noisy user-generated code-mixed text adds to the challenge of processing and understanding natural language to a much larger extent. Machine translation from monolingual source to the target language is a well-studied research problem. Here, we demonstrate that widely popular and sophisticated translation systems such as Google Translate fail at times to translate code-mixed text effectively. To address this challenge, we present a parallel corpus of the 13,738 code-mixed Hindi-English sentences and their corresponding human translation in English. In addition, we also propose a translation pipeline build on top of Google Translate. The evaluation of the proposed pipeline on PHINC demonstrates an increase in the performance of the underlying system. With minimal effort, we can extend the dataset and the proposed approach to other code-mixing language pairs.

Highlights

Code-mixing is the phenomenon of switching between two or more languages by the speaker in a single sentence of a text or speech
Proposed Pipeline + Google Translate (PPGT): In addition to Bing Translate (BT) and GT, we propose a simple pipeline to use translation capabilities of already existing machine translation systems
We present a parallel corpus for the English-Hindi code-mixed machine translation task

Summary

Introduction

Code-mixing is the phenomenon of switching between two or more languages by the speaker in a single sentence of a text or speech. Vyas et al (2014) proposed various experiment to identify POS tags of the 1,062 code-mixed Hindi-English Facebook posts They collected data from three popular celebrity Facebook public pages of Mr Amitabh Bachchan (well-known actor), Mr Shahrukh Khan (well-known actor), and Mr Narendra Modi (current Indian Prime Minister). The proposed evaluation benchmark has six NLP tasks, i.e., language identification, POS tagging, named entity recognition, sentiment analysis, question answering, and natural language inference These tasks have been part of the recently shared tasks co-located with various NLP conferences or the latest research works. Dhar et al (2018) propose a machine translation augmentation pipeline to use on top of the standard machine translation systems They create a parallel corpus of 6,096 English-Hindi codemixed sentences and their corresponding translation in English. We discuss various limitations of the corpus and the research opportunities

Code-Mixing and Challenges in Machine Translation

Dataset

Annotation

Exploratory Analysis

Degree of Code-mixing

Message Length

Frequent words

Evaluation of Machine Translation Systems

Limitations and Opportunities

Conclusion and Future Work

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 22	License type: cc-by

Similar Papers

Translation technology explored: Has a three-year maturation period done Google Translate any good?
Susan Lotz ... Alta Van Rensburg
Stellenbosch Papers in Linguistics Plus | VOL. 43
Susan Lotz, et. al.Susan Lotz ... Alta Van Rensburg
16 Jul 2014
Stellenbosch Papers in Linguistics Plus | VOL. 43

Machine vs human translation: a new reality or a threat to professional Arabic–English translators
Muneera Muftah
PSU Research Review | VOL. -
Muneera MuftahMuneera Muftah
19 Aug 2022
PSU Research Review | VOL. -

Automated and Human Interaction in Written Discourse: A Contrastive Parallel Corpus-based Investigation of Metadiscourse Features in Machine-Human Translations
Muhammad Afzaal ... Xiangtao Du
SAGE Open | VOL. 12
Muhammad Afzaal, et. al.Muhammad Afzaal ... Xiangtao Du
01 Oct 2022
SAGE Open | VOL. 12

A survey on the subject-verb agreement in Google Machine Translation
Mojtaba Bozorgian ... Nematollah Azadmanesh
International Journal of Research Studies in Educational Technology | VOL. 4
Mojtaba Bozorgian, et. al.Mojtaba Bozorgian ... Nematollah Azadmanesh
01 Jan 2015
International Journal of Research Studies in Educational Technology | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation

Abstract

Highlights

Summary

Talk to us

Similar Papers