Abstract

Urdu language written in English alphabets for communication is known as Roman Urdu. In pronunciation, both are the same but different in spelling and have different shapes of the alphabet. A survey acknowledges that 300 million people are speaking Urdu and about 11 million speakers in Pakistan from which maximum users prefer Roman Urdu for the textual communication. Today most of the modern technologies like computers and mobile phones using English script, due to this local Urdu user has to use English letters to type Urdu script that is Roman Urdu. In this research, Roman Urdu to Urdu Translator (RUTUT) is proposed that consists of preprocessing methods, rule-based character substitution and Unicode based character mapping techniques. It can transliterate the messages or descriptions from the Roman Urdu script to Urdu script which may help the Urdu speaker to elaborate their message in efficient manners. The focus of this research is to analyze the issues related to the Roman Urdu script to Urdu script transliteration and develop a translator based on the concepts of transliteration. This research analyzed Roman Urdu data and identified different rules-based character substitution techniques that transform the Roman Urdu into Urdu script at fundamental levels. This research is carried out using a python programming language in programming tool Anaconda in Jupiter notebook and user-friendly Graphical User Interface (GUI) created by using Tkinter library. To evaluate the RUTUT, different translational tests are performed and compare those results with famous Google online translator and ijunoon online transliteration. The analyses of results show that the proposed RUTUT approach translates accurately than Google online translator and ijunoon online transliteration.

Highlights

  • The multi-linguistic content rapidly growing on the internet in the last decade

  • In this research, Roman Urdu to Urdu script translational (RUTUT) model is developed as shown in Figure 19 consists of rule-based character substitution and Unicode based character mapping techniques

  • At the initial stage, when the user gives a Roman Urdu script as an input preprocessing rules are applied that filter unnecessary data

Read more

Summary

Introduction

The multi-linguistic content rapidly growing on the internet in the last decade. The information retrieval process based on cross-lingual [1] and monolingual gain a lot of attention from the Natural Language Processing (NLP) researcher community World Wide Web (WWW). It was the web of the English language and become a huge collection of. When the information retrieval process concentrated on the queries and accessed information in the same language is known as monolingual and cross-lingual focused to access information in several different languages [2]. The researchers of the NLP attract to those languages that have script writing styles from right to left like Urdu and Arabic.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call