Abstract

AbstractParallel corpus is needed for many natural language processing tasks, like machine translation and multilingual document classification. The parallel corpus of English–Punjabi language pair is sparse in volume due to the semantic differences between two languages and Punjabi being a low resource language. In this paper, a parallel corpus for machine translation is being created and evaluated using the sentence alignment permutation metrics. Multiple translation corpora and human assessment together validate automatic evaluation metrics, which are important for the development of machine translation systems. The corpora considered are dialogues of the movie taken from the Wikipedia dumps. Further, the metrics are identified that define the corpora more accurately. The quality of the corpus is verified using the performance metrics based on distance metrics. KeywordsNatural language processing (NLP)Natural language understanding (NLU)CorpusPunjabiEnglishSentence alignmentApplication programming interface (API)

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call