Unsupervised Deep Language and Dialect Identification for Short Texts

Koustava Goswami,John P Mccrae,Bharathi Raja Chakravarthi,Rajdeep Sarkar,Theodorus Fransen

doi:10.18653/v1/2020.coling-main.141

Abstract

Automatic Language Identification (LI) or Dialect Identification (DI) of short texts of closely related languages or dialects, is one of the primary steps in many natural language processing pipelines. Language identification is considered a solved task in many cases; however, in the case of very closely related languages, or in an unsupervised scenario (where the languages are not known in advance), performance is still poor. In this paper, we propose the Unsupervised Deep Language and Dialect Identification (UDLDI) method, which can simultaneously learn sentence embeddings and cluster assignments from short texts. The UDLDI model understands the sentence constructions of languages by applying attention to character relations which helps to optimize the clustering of languages. We have performed our experiments on three short-text datasets for different language families, each consisting of closely related languages or dialects, with very minimal training sets. Our experimental evaluations on these datasets have shown significant improvement over state-of-the-art unsupervised methods and our model has outperformed state-of-the-art LI and DI systems in supervised settings.

Highlights

Automatic Language Identification (LI) and Dialect Identification (DI) has become a very crucial part of natural language processing (NLP) pipelines and is part of tasks such as language modelling, categorization and analysis of code-mixed datasets
These previous works have generated some questions: What is the best way to construct sentence embeddings and how to cluster them efficiently for short texts when there is no labelled training data? How to address the hard task of DI and LI for closely related languages in an unsupervised way without any manual intervention? In this paper, we address these problems by taking inspiration from iterative clustering (Xie et al, 2016) and self-attention-based sentence embeddings from Lin et al (2017)
We propose a novel character attention based unsupervised deep language and dialect identification (UDLDI) model for short texts of closely related languages

Summary

Introduction

Automatic Language Identification (LI) and Dialect Identification (DI) has become a very crucial part of natural language processing (NLP) pipelines and is part of tasks such as language modelling, categorization and analysis of code-mixed datasets. Unsupervised LI is an under-explored area, but is highly useful as it can exploit the large amount of unlabelled data and, more importantly, can be employed when the languages to be identified are not known in advance. Ciobanu et al (2018) studied German dialect identification. Unsupervised LI for short texts is a very difficult task, for which performance is still comparatively poor. Zhang et al (2016) have explored unsupervised language identification but this does not work well with closely related short texts. The task is even harder when it comes to unsupervised DI

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Unsupervised Deep Language and Dialect Identification for Short Texts

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 2	License type: cc-by

Similar Papers

Language model adaptation for language and dialect identification of text
T Jauhiainen ... K Lindén
Natural Language Engineering | VOL. 25
T Jauhiainen, et. al.T Jauhiainen ... K Lindén
31 Jul 2019
Natural Language Engineering | VOL. 25

Convolutional Neural Network and Language Embeddings for End-to-End Dialect Recognition
Suwon Shon ... James Glass
-
Suwon Shon, et. al.Suwon Shon ... James Glass
26 Jun 2018
26 Jun 2018

Language Identification with Unsupervised Phoneme-like Sequence and TDNN-LSTM-RNN
Linjia Sun
-
Linjia SunLinjia Sun
06 Dec 2020
06 Dec 2020

Language and Dialect Identification of Cuneiform Texts
Tommi Jauhiainen ... Tero Alstola
-
Tommi Jauhiainen, et. al.Tommi Jauhiainen ... Tero Alstola
01 Jan 2019
01 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Unsupervised Deep Language and Dialect Identification for Short Texts

Abstract

Highlights

Summary

Talk to us

Similar Papers