AlineR: an R Package for Optimizing Feature-Weighted Alignments and Linguistic Distances

Sean,S Downey,Peter Norquest,Guowei Sun

doi:10.32614/rj-2017-005

Abstract

Linguistic distance measurements are commonly used in anthropology and biology when quantitative and statistical comparisons between words are needed. This is common, for example, when comparisons between linguistic and genetic data are required. Such comparisons can provide insight into historical population patterns and they provide general insight into evolutionary processes. However, the most commonly used linguistic distances are derived from edit distances, which do not weight phonetic features that may, for example, represent smaller-scale patterns in linguistic evolution. Thus, computational methods for calculating feature-weighted linguistic distances are needed for linguistic, biological, and evolutionary applications; additionally, the linguistic distances presented here are generic and may have broader applications in fields such as text mining and search. To facilitate similar research, we are making alineR available as an open-source R software package that performs feature-weighted linguistic distance calculations. The package includes a supervised learning methodology that uses a genetic algorithm and manually determined alignments to estimate 13 linguistic parameters including feature weights and a skip penalty. Here we present the package and use it to demonstrate a supervised learning methodology to estimate the optimal linguistic parameters for a sample of Austronesian languages. Our results show that the methodology can estimate these parameters for both simulated language data and for real language data, that optimizing feature weights improves alignment accuracy by approximately 29%, and that optimizing these parameters affects the resulting distance measurements. Availability: alineR is available on CRAN.

Highlights

Human speech patterns change through time in response to both cultural and demographic processes of speech communities such as migration and social contact
The Levenshtein distance is parsimonious and robust and it has been found to correlate with perceptions of dialectical distances (Gooskens and Heeringa, 2004); feature-based alignment approaches have been found to be a complementary approach to calculating linguistic distances (Kondrak, 2000)
We present simple instructions for basic alignment operations and for users who want to calculate linguistic distances using this alternative to the Levenshtein distance, the instructions may be sufficient

Summary

Introduction

Human speech patterns change through time in response to both cultural and demographic processes of speech communities such as migration and social contact. We present a supervised learning methodology that uses manual alignment determinations and a genetic algorithm (GA) to estimate the optimal feature weights for any paired word data. We describe the genetic algorithm and illustrate with simple examples how to use it with supervised-learning to optimize ALINE’s feature-weight parameters.

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: The R Journal	Publication Date: Jan 1, 2017
Citations: 14	License type: cc-by

R Discovery Prime

R Discovery Prime

AlineR: an R Package for Optimizing Feature-Weighted Alignments and Linguistic Distances

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The R Journal

Lead the way for us

Similar Papers

Commentary: Large-scale psychological differences within China explained by rice vs. wheat agriculture.
Seán G Roberts
Frontiers in psychology | VOL. 6
Seán G RobertsSeán G Roberts
07 Jul 2015
Frontiers in psychology | VOL. 6

An empirical approach to the measurement of interchromosomal distances in the genetic algorithm
Robert Collier ... Mark Wineberg
-
Robert Collier, et. al.Robert Collier ... Mark Wineberg
07 Jul 2012
07 Jul 2012

Languages and Genes in China and In East Asia
Alain Peyraube
Bulletin of Chinese Linguistics | VOL. 2
Alain PeyraubeAlain Peyraube
24 Jan 2007
Bulletin of Chinese Linguistics | VOL. 2

Genes, Peoples, and Languages. By Luigi Luca Cavalli-Sforza. New York: Farrar, Straus & Giroux, 2000. Pp. 224. $24.00 (hardcover).
Guido Barbujani
The American Journal of Human Genetics | VOL. 67
Guido BarbujaniGuido Barbujani
01 Jul 2000
Genes, Peoples, and Languages. By Luigi Luca Cavalli-Sforza. New York: Farrar, Straus & Giroux, 2000. Pp. 224. $24.00 (hardcover).
Guido Barbujani

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

AlineR: an R Package for Optimizing Feature-Weighted Alignments and Linguistic Distances

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The R Journal