Abstract

As the technology is developing day-by-day and most of the human work is done by the machine or systems, it is the need of the today’s world to develop systems that can read informal text or words in a proper and standard way even though the format of writing these words or text does not match the standard English words. The informal texts types that exists are the dates, currencies, abbreviations and acronyms of standard words, measurements, URLs, phone numbers etc. This paper focuses on the normalization of such text that converts the informal text into their equivalent standard form which is called text normalization. To produce the equivalent speech form of these non-standard words is the necessity of the today’s system. Text normalization is pre-processing step of the natural language processing system. The paper discusses various techniques and methods for the conversion of the non-standard words into standard words. The methods used for classification of the token are regular expressions, used for simple patter match of the token. Naïve Bayes classification for number sense disambiguity and Stochastic Gradient Descent for resolving acronym and class ambiguity .The result and analysis are also mentioned in the form of error-rate of the system, which shows the area for the scope of more improvement in the system.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call