Text Language Identification and Translator

Tejas Pinge Tejas Pinge,Aditya Nandurkar Aditya Nandurkar,Prajwal Patil Prajwal Patil,Prof Ravindra Chilbule Prof Ravindra Chilbule,Mayur Sherki Mayur Sherki

doi:10.48175/ijarsct-14055

Abstract

Language Identification refers to the process of detecting the language(s) of the text in the document based on the script used for writing and observing the diacritics particular to a language. This research area has always fascinated researchers as early as 1970 and till now due to varied applications and increased demands of this field. In this work, I address the problem of detecting language of textual documents. I have introduced a method which is able to detect language of text more efficiently and accurately by determining their respective proportions and finding the greatest of them which represents the language of the text. I have demonstrated the performance comparison of three different approaches which are using n-gram approach (word-wise), using n-gram approach (character-wise) and using a combination of word search and stop words detection. My project currently contains language models for 4 languages. On an average the accuracy of my program is about 96.5%.

Full Text