Hybrid sub-word segmentation for handling long tail in morphologically rich low resource languages

Sreeja Manghat,Tanja Schultz,Sreeram Manghat

doi:10.1109/icassp43922.2022.9746652

Abstract

Dealing with Out Of Vocabulary (OOV) words or unseen words is one of the main issues of Machine Translation (MT) as well as automatic speech recognition (ASR) systems. For morphologically rich languages having high type token ratio, the OOV percentage is also quite high. Sub-word segmentation has been found to be one of the major approaches in dealing with OOVs. In this paper we present a hybrid sub-word segmentation algorithm to deal with OOVs. A sub-word segmentation evaluation methodology is also presented. We also present results of our segmentation approach in comparison to some of the popular sub-word segmentation algorithms. Malayalam is a morphological rich low resource Indic language with very high type token ratio. All the experiments are done for conversational code-switched Malayalam-English corpus.

Full Text