Abstract

Dealing with Out Of Vocabulary (OOV) words or unseen words is one of the main issues of Machine Translation (MT) as well as automatic speech recognition (ASR) systems. For morphologically rich languages having high type token ratio, the OOV percentage is also quite high. Sub-word segmentation has been found to be one of the major approaches in dealing with OOVs. In this paper we present a hybrid sub-word segmentation algorithm to deal with OOVs. A sub-word segmentation evaluation methodology is also presented. We also present results of our segmentation approach in comparison to some of the popular sub-word segmentation algorithms. Malayalam is a morphological rich low resource Indic language with very high type token ratio. All the experiments are done for conversational code-switched Malayalam-English corpus.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call