Abstract

Automatic Language Identification (LI) or Dialect Identification (DI) of short texts of closely related languages or dialects, is one of the primary steps in many natural language processing pipelines. Language identification is considered a solved task in many cases; however, in the case of very closely related languages, or in an unsupervised scenario (where the languages are not known in advance), performance is still poor. In this paper, we propose the Unsupervised Deep Language and Dialect Identification (UDLDI) method, which can simultaneously learn sentence embeddings and cluster assignments from short texts. The UDLDI model understands the sentence constructions of languages by applying attention to character relations which helps to optimize the clustering of languages. We have performed our experiments on three short-text datasets for different language families, each consisting of closely related languages or dialects, with very minimal training sets. Our experimental evaluations on these datasets have shown significant improvement over state-of-the-art unsupervised methods and our model has outperformed state-of-the-art LI and DI systems in supervised settings.

Highlights

  • Automatic Language Identification (LI) and Dialect Identification (DI) has become a very crucial part of natural language processing (NLP) pipelines and is part of tasks such as language modelling, categorization and analysis of code-mixed datasets

  • These previous works have generated some questions: What is the best way to construct sentence embeddings and how to cluster them efficiently for short texts when there is no labelled training data? How to address the hard task of DI and LI for closely related languages in an unsupervised way without any manual intervention? In this paper, we address these problems by taking inspiration from iterative clustering (Xie et al, 2016) and self-attention-based sentence embeddings from Lin et al (2017)

  • We propose a novel character attention based unsupervised deep language and dialect identification (UDLDI) model for short texts of closely related languages

Read more

Summary

Introduction

Automatic Language Identification (LI) and Dialect Identification (DI) has become a very crucial part of natural language processing (NLP) pipelines and is part of tasks such as language modelling, categorization and analysis of code-mixed datasets. Unsupervised LI is an under-explored area, but is highly useful as it can exploit the large amount of unlabelled data and, more importantly, can be employed when the languages to be identified are not known in advance. Ciobanu et al (2018) studied German dialect identification. Unsupervised LI for short texts is a very difficult task, for which performance is still comparatively poor. Zhang et al (2016) have explored unsupervised language identification but this does not work well with closely related short texts. The task is even harder when it comes to unsupervised DI

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call