The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets

Alan W Black,Keiichi Tokuda

doi:10.21437/interspeech.2005-72

Abstract

In order to better understand different speech synthesis techniques on a common dataset, we devised a challenge that will help us better compare research techniques in building corpusbased speech synthesizers. In 2004, we released the first two 1200-utterance single-speaker databases from the CMU ARCTIC speech databases, and challenged current groups working in speech synthesis around the world to build their best voices from these databases. In January of 2005, we released two further databases and a set of 50 utterance texts from each of five genres and asked the participants to synthesize these utterances. Their resulting synthesized utterances were then presented to three groups of listeners: speech experts, volunteers, and US English-speaking undergraduates. This paper summarizes the purpose, design, and whole process of the challenge. 1. Background With a view to allowing closer comparison of corpus-based techniques, from labeling, pruning, join costs, signal processing techniques, and others, we devised a challenge for participants to use the same databases to synthesize utterances from a small number of genres. An organized evaluation, based on listening tests, was then carried out to try to rank the systems and help identify the effectiveness of the techniques. The sister field of speech recognition has clearly benefited from the availability of common datasets in order to provide valid comparisons between systems [1]. These evaluations concentrate efforts in the speech recognition fields, particularly through the 1990s with DARPA workshops where NIST (and others) devised standardized tests for speech recognition. It is clear that these standardized tests and widely available datasets allowed speech recognition results to be more easily compared and more importantly cause the core technology to improve. Although today many may criticize a naive word error metric as a sole accuracy measure for speech recognition systems, few would complain that it has not contributed to drastic improvement in the utility of speech recognition as a viable technology. Speech synthesis has not been as lucky in having a welldefined evaluation metric, nor has it had a well-funded centralized community that could be targeted to the same task. With the rise of general corpus-based speech synthesis over the last ten years, we have moved from a domain where new synthetic voices could only be built with many man-years of effort from highly skilled researchers. Such systems were tuned to the particular data sets being used, thus comparisons of techniques such as labeling and signal processing could only be done within the research group that originally developed the dataset. Such tying of databases to particular systems made it hard to genuinely compare techniques since the quality of the original recorded voice itself contributed greatly to the resulting synthetic voice quality.

Full Text