Topic extraction with multiple topic-words in broadcast-news speech

K Ohtsuki,S Matsunaga,S Furui,T Matsutoka

doi:10.1109/icassp.1998.674434

Abstract

This paper reports on topic extraction in Japanese broadcast-news speech. We studied, using continuous speech recognition, the extraction of several topic-words from broadcast-news. A combination of multiple topic-words represents the content of the news. This is a more detailed and more flexible approach than using a single word or a single category. A topic extraction model shows the degree of relevance between each topic-word and each word in the article. For all words in an article, topic-words which have high total relevance score are extracted. We trained the topic extraction model with five years of newspapers, using the frequency of topic-words taken from headlines and words in articles. The degree of relevance between topic-words and words in articles is calculated on the basis of statistical measures, i.e., mutual information or the /spl chi//sup 2/-value. In topic extraction experiments for recognized broadcast-news speech, we extracted five topic-words from the 10-best hypotheses using a /spl chi//sup 2/-based model and found that 76.6% of them agreed with the topic-words chosen by subjects.

Full Text