Abstract

Accurate location of the endpoints of an isolated word is important for reliable and robust word recognition. The endpoint detection problem for isolated words where the background noise is stationary and low level is relatively easy to handle. However, often the beginning and end of an isolated word is obscured by speaker generated artifacts (such as mouth noises, e.g., clicks, pops, lip smackings, and breathiness) as well as those introduced by the recording environment and transmission system. Several techniques for endpoint detection of isolated words recorded over a dialed‐up telephone line were studied. The techniques investigated are classified as either explicit, implicit, or hybrid. The class of explicit techniques are those in which the endpoints are located prior to and independent of the recognition and decision stages of the system. For implicit endpoint estimation, the endpoints of the isolated word are determined by the recognition and decision stages of the system; i.e., there is no separate stage for endpoint detection. The hybrid techniques incorporate aspects from both the explicit and implicit methods. Investigations showed that the hybrid techniques provided the best estimate of both endpoints and correspondingly the highest recognition accuracy of all three classes that were studied. A new hybrid technique for endpoint detection is proposed which reduces the recording rejection rate by 30% while maintaining the same recognition accuracy as obtained using earlier techniques with clean speech (i.e., no artifacts).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call