Abstract
The success of many language modeling methods and applications relies heavily on the amount of data available. This problem is further exacerbated in statistical machine translation, where parallel data in the source and target languages is required. However, large amounts of data are only available for a small number of languages; as a result, many language modeling techniques are inadequate for the vast majority of languages. In this paper, we attempt to lessen the problem of a lack of training data for low-resource languages by adding data from related high-resource languages in three experiments. First, we interpolate language models trained on the target language and on the related language. In our second experiment, we select the sentences most similar to the target language and add them to our training corpus. Finally, we integrate data from the related language into a translation model for a statistical machine translation application. Although we do not see many significant improvements over baselines trained on a small amount of data in the target language, we discuss some further experiments that could be attempted in order to augment language models and translation models with data from related languages.
Highlights
Statistical language modeling methods are an essential part of many language processing applications, including automatic speech recognition (Stolcke, 2002), machine translation (Kirchhoff and Yang, 2005), and information retrieval (Liu and Croft, 2005)
This problem can be seen as a special case of domain adaptation, with the in-domain data being the data in the target language and the out-of-domain data being the data in the related language (Nakov and Ng, 2012)
We observed that a closely-related language cannot be used to aid in modeling a low-resource language without being properly transformed
Summary
Statistical language modeling methods are an essential part of many language processing applications, including automatic speech recognition (Stolcke, 2002), machine translation (Kirchhoff and Yang, 2005), and information retrieval (Liu and Croft, 2005) Their success is heavily dependent on the availability of suitably large text resources for training (Chen and Goodman, 1996). Domain adaptation is often used to leverage resources for a specific domain, such as biomedical text, from more general domains like newswire data (Dahlmeier and Ng, 2010) This idea can be applied to SMT, where data from the related lan-. We attempt to select the best out-of-domain data using perplexity, similar to what was done in Gao et al (2002)
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have