Abstract

Transliterating the text of a language to a foreign script is called forward transliteration and transliterating the text back to the original script is called backward transliteration. In this work, we perform both forward as well as backward transliteration on Punjabi. We transliterate Punjabi person names from Gurmukhi script to English Roman script and from English Roman script back to Gurmukhi script using n-gram language model. We used more than one million parallel entities of person names in Gurmukhi and Roman script as the training corpus. We generated English to Punjabi and Punjabi to English n-grams databases from the corpus. To get better results, we tried to create as long n-grams as possible ranging from bi-gram to 30-gram. Our n-grams database contains more than 10 million n-grams, with each n-gram having multiple mappings of the other script. The most challenging part is to find the mapping for the given n-gram from the parallel name entity while creating n-grams databases. As per the orthography rules, the same combination of letters may have different pronunciation, depending upon its location in the word. Therefore, we categorized n-grams into starting, middle, and ending n-grams and used them accordingly in the transliteration process. The transliteration process works like the merge sort. We start searching the longest possible n-gram in the database and split the string recursively until the match is found. The transliterated strings are merged back to form the final output. In English to Punjabi transliteration, we achieved 96% accuracy using gold standard and 99.14% accuracy using minimum edit distance. In Punjabi to English transliteration, the result showed 96.85% and 99.35% accuracy for the gold standard and minimum edit distance, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call