Abstract

This paper proposes an improved probabilistic CFG (Context-Free Grammar), called the mixture probabilistic CFG, based on an idea of cluster-based language modeling. This model assumes that the language model parameters have different probability distributions in different topics or domains. In order to performs topic-or domaindependent language modeling, we first divide the training corpus into a number of subcorpora according to their topics or domains, and then estimate separate probability distribution from each subcorpus. Therefore, a mixture probabilistic CFG has several different probability distributions for CFG productuions. The language model probability of a sentence is calculated as the mixture of these probability distributions. The mixture probabilistic CFG enables us to make a context-or topic-dependent language model, and thus accurate language modeling would be possible. The proposed model was evaluated by calculating test-set perplexity using the ADD (ATR Dialogue Database) corpus and a Japanese intra-phrase grammar. The mixture probabilistic CFG had a test-set perplexity of 2.47/phone, while simple probabilistic CFG had a test-set perplexity of 2.77/phone. We also conducted speech recognition experiments using three language models, including pure CFG (without probabilities), simple probabilistic CFG, and the mixture probabilistic CFG. In our experiments, the mixture probabilistic CFG attained the best performance. The proposed model was also evaluated using sentence-level clustering. This evaluation used the dialogue corpus in which each utterance is annotated with an utterance type called IFT (Illocutionary Force Type). Using these IFTs, we divided the corpus into 9 clusters, and then estimated production probabilities from these clusters. Without IFT clustering, the perplexity was 2.18 per phone, but using IFT clustering, the perplexity was reduced to 1.82 per phone.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.