Identification of language from multi-lingual dataset using classification algorithms

N Abinaya,P Jayadharshini,S Priyanka,S Keerthika,S Santhiya

doi:10.1088/1742-6596/2664/1/012009

Abstract

The process of automatically identifying the language used in a text or document is known as language identification. Language identification might represent a critical step in Natural Language Processing (NLP). It entails making an effort to foresee a text‘s natural language. Before any actions can be conducted, it is crucial to understand the language of the text. We must build a model that can anticipate the given language using the text as a guide. This provides an answer for many computational linguists and Artificial Intelligence (AI) applications. In this study, the language in the provided text was identified using machine learning algorithms and vectorization techniques. The performance of different classification algorithms like Naïve bayes, Logistic Regression, Decision Tree and Random Forest have been compared and analyzed. Vectorization technique has been done to convert the text into matrix. This paper presents the comparison of all the above algorithms performed through various measures.

Full Text