Abstract

Statistical and Neural Machine translation techniques are based on the parallel data used for training models. The general belief is that more training data would result in better models. We studied the available corpora and ambient noises present in them. It revealed that the available data is highly noisy. The paper describes various types of noises present in there and how these are identified. Different types of noise filters are developed and normalization processes have been applied on the corpora. Statistical and neural machine translation models are trained to study the impact of cleaning of noisy data. We performed experiments with noisy data and with cleaned data after discarding noisy data from the training corpus. Standard test set WMT-14 has been used for performing evaluation. The quality of machine translation has been measured through BLEU scores. It was observed that even after discarding a significant volume of noisy data, the models without noisy data performed better than the corpus containing noises. It proves that quality of data has significant impact and mere having huge piles of uncleaned data in not a good choice. The test case presented here is for English-Hindi language pair. It also shows a path that for low resource language pairs, paying attention to the quality of data would bring returns in form of better translation performance. As the noises discussed in paper are general in nature, the findings should be true for any other Indian language pair also, due to inherent similarity among Indian languages.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.