Abstract

Most areas related to language and speech technology, directly or indirectly, require handling of unrestricted text, and Text-to-speech systems directly need to work on real text. To build a natural sounding speech synthesis system, it is essential that the text processing component produce an appropriate sequence of phonemic units corresponding to an arbitrary input text. A novel approach is used, where the input text is tokenized, and classification is done based on token type. The token sense disambiguation is achieved by the semantic nature of the language and then the expansion rules are applied to get the normalized text. However, for Telugu language not much work is done on text normalization. In this paper we discuss our efforts for designing a rule based system to achieve text normalization in the context of building Telugu text-to-speech system.

Highlights

  • The objective of the text processing component [1, 2] is to process the given input text and produce the written form of the text into the spoken form. This orthographic form is realized by the speech generation component either by synthesis from parameters or by selection of a unit from a large speech corpus

  • For natural sounding speech synthesis [3, 4], it is essential that the text processing component produce an appropriate sequence of orthographic units corresponding to an arbitrary input text

  • This paper presents the need for text to be preprocessed before it is handed to any synthesizer

Read more

Summary

INTRODUCTION

The objective of the text processing component [1, 2] is to process the given input text and produce the written form (orthographic form) of the text into the spoken form. This orthographic form is realized by the speech generation component either by synthesis from parameters or by selection of a unit from a large speech corpus. For natural sounding speech synthesis [3, 4], it is essential that the text processing component produce an appropriate sequence of orthographic units corresponding to an arbitrary input text. The standard word representation is achieved using the expansion rules and the look up table (database)

Nature and Format of Telugu Text
PROPOSED MODEL FOR TEXT NORMALIZATION
Tokenization and Token Classification
Token Sense Disambiguation
Standard Word Generation
IMPLEMENTATION OF THE SYSTEM
Tokenization
Token Classification
Cardinal Numbers
Ordinal Numbers
Decimal Numbers
Phone Numbers
Date Formats
Currency
Abbreviations and Acronyms
Address
Percentages
Coverage Analysis
CONCLUSIONS
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.