Parallel Corpus Creation for NMT using Web Scraping and Filtering

Farha Naznin,Shikhar Kumar Sarma,Kishore Kashyap

doi:10.1109/i3cs58314.2023.10127513

Abstract

Parallel corpus is a critical resource for various Natural Language Processing (NLP) tasks including neural machine translation (NMT). For utilizing in the translation of low resource languages using NMT models, a good and sizable parallel corpus is needed. In this paper, details of the method designed and experimented to implement extraction of parallel texts in multiple languages from websites using a web scraping method is presented. For extracting information from the websites some websites allow us to use Application Programming Interfaces (APIs) provided by them. But for some websites, APIs are not available to extract the data. In such situations, we have to apply web scraping techniques to get information from such websites. Moreover, scraping textual data as-it-is is much easier. However, identifying and scraping parallel texts in multiple languages is more challenging and involves language identification, mapping, filtering etc. We explained how we implemented web scraping method to gather parallel text data from websites. Specific and structured methodology and techniques for identification of parallel text data and related methods are described. We also explained how cleaning is done on the parallel text data to get a good and clean parallel corpus which would be suitable to be applied in NMT.

Full Text