Abstract

We present a machine learning approach for the Arabic Dialect Identification (ADI) and the German Dialect Identification (GDI) Closed Shared Tasks of the DSL 2017 Challenge. The proposed approach combines several kernels using multiple kernel learning. While most of our kernels are based on character p-grams (also known as n-grams) extracted from speech transcripts, we also use a kernel based on i-vectors, a low-dimensional representation of audio recordings, provided only for the Arabic data. In the learning stage, we independently employ Kernel Discriminant Analysis (KDA) and Kernel Ridge Regression (KRR). Our approach is shallow and simple, but the empirical results obtained in the shared tasks prove that it achieves very good results. Indeed, we ranked on the first place in the ADI Shared Task with a weighted F1 score of 76.32% (4.62% above the second place) and on the fifth place in the GDI Shared Task with a weighted F1 score of 63.67% (2.57% below the first place).

Highlights

  • The recent 2016 Challenge on Discriminating between Similar Languages (DSL) (Malmasi et al, 2016) shows that dialect identification is a challenging NLP task, actively studied by researchers in nowadays

  • We present a method based on learning with multiple kernels, that we designed for the Arabic Dialect Identification (ADI) and the German Dialect Identification (GDI) Shared Tasks of the DSL 2017 Challenge (Zampieri et al, 2017)

  • In a set of preliminary experiments performed on the GDI training set, we found that Kernel Discriminant Analysis (KDA) gives slightly better results than Kernel Ridge Regression (KRR)

Read more

Summary

Introduction

The recent 2016 Challenge on Discriminating between Similar Languages (DSL) (Malmasi et al, 2016) shows that dialect identification is a challenging NLP task, actively studied by researchers in nowadays. The third kernel is derrived from Local Rank Distance (LRD), a distance measure that was first introduced in computational biology (Ionescu, 2013; Dinu et al, 2014), but it has shown its application in NLP (Popescu and Ionescu, 2013; Ionescu, 2015) All these string kernels have been previously used for Arabic dialect identification by Ionescu and Popescu (2016b), and they obtained very good results, taking the second place in the ADI Shared Task of the DSL 2016 Challenge (Malmasi et al, 2016).

Arabic Dialect Identification
German Dialect Identification
String Kernels
Kernel based on Local Rank Distance
Kernel based on Audio Features
Learning Methods
Data Set
Parameter and System Choices
Method
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call