Bidirectional LSTMs with Byte Pair Encoding in NMT for CLIR using English and Telugu Parallel Corpus

Et Al B N V Narasimha Raju

doi:10.17762/ijritcc.v11i9.8832

Abstract

The Neural Machine Translation (NMT) is very crucial for Cross-Lingual Information Retrieval (CLIR). NMT is effective in translating English language queries to the Telugu Language. In this paper, we are translating English queries to Telugu. The NMT will utilize a parallel corpus for translations. Telugu is a resource-poor language, it is very difficult to supply large amounts of parallel corpus to NMT. So the NMT will have a problem called Out Of Vocabulary (OOV). To overcome this problem Byte Pair Encoding (BPE) is used along with Long Short Term Memory (LSTM), which segments the rare words into sub-words and tries to translate the rare words. It still faces problems like Named Entity Recognition (NER). Some problems of NER can be solved by utilizing bidirectional LSTMs in sequence-to-sequence models. The bidirectional LSTMs (BiLSTMs) will be helpful in training systems in both directions for recognizing the named entities. The accuracy parameters and a BLEU score show the translation quality of NMT with bidirectional LSTMs has slightly more accuracy than regular LSTMs which is considerable.

Full Text