Abstract
Context Overlapping Model (COM) is presented in this article for the task of Automatic Sentence Segmentation (ASS). Comparing with HMM, COM expands observation from single word to n-gram unit and there is an overlapping part between the neighboring units. Due to the co-occurrence constraint and transition constraint, COM model reduces the search space and improves tagging accuracy. We treated ASS as a task of sequence labeling and applied 2-gram COM to it. The experiment results show that the overall correct rate of the open test is as high as 90.11%, which is significantly higher than the baseline model (second order HMM), which is 85.16%.
Highlights
Automatic sentence segmentation (ASS) is an important step in the Automatic Speech Recognition (ASR)
Since Chinese sentences are always segmented by comma as well as period, question mark or exclaimer, comma is always regarded as the sentence boundary
Some experiments results show that Hidden Markov Model (HMM) has a poor performance in ASS for the observation independence assumption
Summary
Automatic sentence segmentation (ASS) is an important step in the Automatic Speech Recognition (ASR). Little work has been done in this area but recently it gained more interest from the research community.(Mikheev,2003:216) CYBERPUNC (Beeferman, Berger, and Lafferty, 1998) is a system which aims to segment sentences in the speech transcripts. This system was designed to augment a standard trigram language model of a speech recognizer with information about sentence splitting. Tanev and Mitkov (2000) had an evaluation of a sentence segmentation system for Slavonic languages This system used nine main end-of-sentence rules with a list of abbreviations and achieved 92% in precision and 99% in recall measured on a text of 190 sentences.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have