Abstract
Text normalization is an important component in text-to-speech system and the difficulty in text normalization is to disambiguate the non-standard words (NSWs). This paper develops a taxonomy of NSWs on the basis of a large scale Chinese corpus, and proposes a two-stage NSWs disambiguation strategy, finite state automata (FSA) for initial classification and maximum entropy (ME) classifiers for subclass disambiguation. Based on the above NSWs taxonomy, the two-stage approach achieves an F-score of 98.53% in open test, 5.23% higher than that of FSA based approach. Experiments show that the NSWs taxonomy ensures FSA a high baseline performance and ME classifiers make considerable improvement, and the two-stage approach adapts well to new domains.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have