Abstract
MOTIVATION Most NLP applications assume that a particular language is homogeneous in the regions where it is spoken. However, each language varies considerably throughout its geographical distribution. When dialectal variation is significant, the effectiveness of oral and written communication can be significantly affected. To make NLP sensitive to dialects, a reliable, representative and up-to-date source of information that quantitatively represents such variation must be necessary. PROBLEM Some of the current approaches have disadvantages such as the subjectivity of the regions found, the need for parameters, ignoring the geographical coordinates in the analysis and the lack of a statistical test of the existence of the identified dialectal regions. METHOD Detection of ecotones is an analogous problem in the field of ecology that focuses on the detection of boundaries in ecosystems instead of region, facilitating the construction of statistical tests. We adapted a popular ecotone detection technique called “wombling” to the detection of dialectal boundaries by using as underlying non-parametric statistical test, the Hilbert-Schmidt independence criterion (HSIC). In addition to dealing with the aforementioned drawbacks, the use of HSIC provides robustness against to non-linearities present in the linguistic and geographical variables. The proposed method was applied to a large corpus of Spanish tweets produced in 250 locations in Colombia through the analysis of unigram features. RESULTS The resulting dialectal boundaries (i.e. dialectones) showed to be meaningful and spatially correlated with regions identified by other authors using classic dialectology. CONCLUSION We concluded that the automatic detection of dialectones is convenient alternative to classical methods in dialectology.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have