Abstract

In recent years, Native Language Identification (NLI) has shown significant interest in computational linguistics. NLI uses an author’s speech or writing in a second language to figure out their native language. This may find applications in forensic linguistics, language teaching, second language acquisition, authorship attribution, identification of spam emails or phishing websites, etc. Conventional pairwise string comparison techniques are computationally expensive and time-consuming. This paper presents fast NLI techniques based on string kernels such as spectrum, presence bits, and intersection string kernels incorporating different learners such as a Support Vector Machine (SVM), Random Forest (RF), and Extreme Gradient Boosting-XGBoost (XGB). Feature sets for the proposed techniques are generated using different combinations of features such as n-word grams and noun phrases. Experimental analyses are carried out using 8235 English as a second language articles from 10 different linguistic backgrounds from a typical NLP benchmark dataset. The experimental results show that the proposed NLI technique incorporating a spectrum string kernel with an RF classifier outperformed existing character n-gram string kernels incorporating SVM, RF, and XGB classifiers. Also, comparable results were observed among different combinations of string kernels. Interestingly, the random forest classifier outperformed SVM and XGB classifiers with different feature sets. All the proposed NLI techniques demonstrated promising results with significant improvement in training time, with the best result attaining more than a 95 percent decrease in training time. The reduced training time of proposed techniques makes it well suited to scale NLI applications for production.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call