Unsupervised stemmed text corpus for language modeling and transcription of Telugu broadcast news

Mythilisharan Pala,Venkataramana Appala,Laxminarayana Parayitam

doi:10.1007/s10772-020-09749-0

Abstract

In Indian Languages, root words will be either combined or modified to match the context with reference to tense, number and/or gender. So the number of unique words will increase when compared to many European languages. Whatever be the size of the text corpus used for language modeling cannot contain all the possible inflected words. A word which occurred during testing but not in training data is called Out of Vocabulary (OOV) word. Similarly, the text corpus cannot have all possible sequence of words. So Due to this data sparsity, Automatic Speech Recognition system (ASR) may not accommodate all the words in the language model/irrespective of the size of the text corpus. It also becomes computationally challenging if the volume of the data increases exponentially due to morphological changes to the root word. To reduce the OOVs in the language model, a new unsupervised stemming method is proposed in this paper for one Indian language, Telugu, based on the method proposed for Hindi. Other issues in the language modeling for Telugu using techniques like smoothing and interpolation, with supervised and unsupervised stemming data is also analyzed. It is observed that the smoothing techniques Witten–Bell and Kneser–Ney performing well when compared to other techniques, on pre-processed data with supervised learning. The ASRs accuracy is improved by 0.76% and 0.94% with supervised and unsupervised stemming respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Unsupervised stemmed text corpus for language modeling and transcription of Telugu broadcast news

Abstract

Talk to us

Similar Papers

More From: International Journal of Speech Technology

Lead the way for us

Journal: International Journal of Speech Technology	Publication Date: Sep 1, 2020
Citations: 3

Similar Papers

An Investigation of Multilingual TDNN-BLSTM Acoustic Modeling for Hindi Speech Recognition
Ankit Kumar ... Rajesh Kumar Aggarwal
International Journal of Sensors, Wireless Communications and Control | VOL. 12
Ankit Kumar, et. al.Ankit Kumar ... Rajesh Kumar Aggarwal
01 Jan 2021
International Journal of Sensors, Wireless Communications and Control | VOL. 12

Topic-Dependent Language Model with Voting on Noun History
Welly Naptali ... Seiichi Nakagawa
ACM Transactions on Asian Language Information Processing | VOL. 9
Welly Naptali, et. al.Welly Naptali ... Seiichi Nakagawa
01 Jun 2010
ACM Transactions on Asian Language Information Processing | VOL. 9

Native Language Identification from Spoken Indian English
...
Trends in Electrical Engineering | VOL. 9
, et. al. ...
30 Oct 2019
Trends in Electrical Engineering | VOL. 9

Adaptive Speech Recognition System Using Data From Keyboard
Sai Ajay Modukuri ... Gaurav Kumar
-
Sai Ajay Modukuri, et. al.Sai Ajay Modukuri ... Gaurav Kumar
01 Nov 2018
01 Nov 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Unsupervised stemmed text corpus for language modeling and transcription of Telugu broadcast news

Abstract

Talk to us

Similar Papers

More From: International Journal of Speech Technology