Abstract

Many languages identification (LID) systems rely on language models that use machine learning (ML) approaches, LID systems utilize rather long recording periods to achieve satisfactory accuracy. This study aims to extract enough information from short recording intervals in order to successfully classify the spoken languages under test. The classification process is based on frames of (2-18) seconds where most of the previous LID systems were based on much longer time frames (from 3 seconds to 2 minutes). This research defined and implemented many low-level features using MFCC (Mel-frequency cepstral coefficients), containing speech files in five languages (English. French, German, Italian, Spanish), from voxforge.org an open-source corpus that consists of user-submitted audio clips in various languages, is the source of data used in this paper. A CNN (convolutional Neural Networks) algorithm applied in this paper for classification and the result was perfect, binary language classification had an accuracy of 100%, and five languages classification with six languages had an accuracy of 99.8%.

Highlights

  • Humans are currently the most accurate language recognition system on the planet and can detect whether a language is their mother tongue within seconds of hearing it

  • This paper aims to extract information from short recording intervals that are convenient to classify the spoken languages under test successfully

  • The classification is based on frames of (2–18) seconds, whereas most of the previous language classification systems are based on much longer time frames

Read more

Summary

Introduction

Humans are currently the most accurate language recognition system on the planet and can detect whether a language is their mother tongue within seconds of hearing it. They can often create subjective comparisons with a language they are familiar with to elucidate hidden knowledge if it is a language they are unfamiliar with [1]. The neural network is fed raw audio input, with spectrograms developing as each impulse is sent into it. Another benefit is that the technique can well classify brief audio samples (approximately 2–18 seconds), which is critical for voice assistants that need to detect language once the speaker starts speaking [4]. The activation functions are all of ReLU type, except for the last layer, which is of SoftMax type, which is useful for probability outputs

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.