Multi character frequency based encoding for efficient text messaging in Indian Languages

Manu Seth,Shivam Chaturvedi,Rajesh M Hegde,Sourya Basu

doi:10.1109/ncc.2016.7561128

Abstract

Short Message Service (SMS) via cell phones is a widely used mode of data communication. Currently employed encoding schemes allow the transmission of 160 characters per SMS in English. This drops to 70 characters per SMS if any Indian language including Hindi is used, due to the UNICODE format used therein. Schemes proposed to improve the encoding efficiency of short text messaging generally encode one character at a time. Table splitting schemes that reduce the average number of bits per character are generally used in this context. In this paper, a novel multi-character frequency-based encoding scheme is proposed for efficient messaging of short text messages in four Indian Languages. Both uni-gram and bi-gram modelling based schemes are proposed herein. The efficiency of the proposed schemes is evaluated by conducting experiments on a large multilingual database of short text messages collected from twitter using a dictionary learning approach. Performance evaluation shows that these encoding schemes can allow the transmission of around 190 characters per SMS in English and more than 165 characters per SMS for Four Indian Languages. Encoding efficiency is significantly improved when compared to existing state of the art table marker algorithms and is motivating enough to be used in practice for transmission of short text messages in Indian Languages.

Full Text