Automatic Speech Recognition Transcripts Research Articles

Recent reports have investigated the use of automatic speech recognition (ASR) to analyze and score verbal responses in cognitive tests. ASR scoring is objective, permits the efficient computerized administration of verbal tests, and generates timestamps that enable the detailed temporal analysis of responses. However, ASR transcription accuracy varies by engine, task, and participant, and ASR can incorrectly score responses from participants with atypical speech patterns. Here we describe the speech-transcription pipeline of the California Cognitive Assessment Battery (CCAB), which incorporates consensus ASR (CASR) to produce more accurate transcripts than possible with any single ASR engine. We also developed a Transcript Review Tool (TRT) which facilitates the manual correction of mis-transcribed words in problem subjects. Figure 1 shows the CCAB speech transcription pipeline. Realtime ASR transcriptions are obtained along with the transcriptions of the digital recordings of responses using six cloud-based ASR engines (e.g., Google, etc.). Individual transcripts are then combined to produce a "consensus" transcript, and a transcription confidence measure based primarily on the agreement between ASR engines (Figure 2). If needed, "consensus" transcripts can be manually corrected using the Transcript Review Tool which enables the review of all words or just those words below a predefined CASR confidence threshold (Figure 3). ASR transcriptions were obtained from 442 healthy adults (mean age = 65.1 ±14.4) who each underwent three days of cognitive testing that included 25 verbal tests. In all, approximately 276 hours of speech were transcribed. Preliminary analyses show that CASR transcription accuracy surpassed 99% for tests with limited response sets (e.g., digit span, verbal list learning, face-name binding, etc.) and exceeded 95% for discursive speech tests (e.g., picture description and logical memory). CASR transcription is more accurate than that of any single ASR engine. When combined with the TRT, "consensus" ASR can produce error-free, timestamped transcripts that enable the detailed analysis of verbal responses from older individuals at risk of cognitive decline.

Read full abstract

Increasing amounts of informal spoken content are being collected, e.g. recordings of meetings, lectures and personal data sources. The amount of this content being captured and the difficulties of manually searching audio data mean that efficient automated search tools are of increasing importance if its full potential is to be realized. Much existing work on speech search has focused on retrieval of clearly defined document units in ad hoc search tasks. We investigate search of informal speech content using an extended version of the AMI meeting collection. A retrieval collection was constructed by augmenting the AMI corpus with a set of ad hoc search requests and manually identified relevant regions of the recorded meetings. Unlike standard ad hoc information retrieval focussing primarily on precision, we assume a recall-focused search scenario of a user seeking to retrieve a particular incident occurring within meetings relevant to the query. We explore the relationship between automatic speech recognition (ASR) accuracy, automated segmentation of the meeting into retrieval units and retrieval behaviour with respect to both precision and recall. Experimental retrieval results show that while averaged retrieval effectiveness is generally comparable in terms of precision for automatically extracted segments for manual content transcripts and ASR transcripts with high recognition accuracy, segments with poor recognition quality become very hard to retrieve and may fall below the retrieval rank position to which a user is willing search. These changes impact on system effectiveness for recall-focused search tasks. Varied ASR quality across the relevant and non-relevant data means that the rank of some well-recognized relevant segments is actually promoted for ASR transcripts compared to manual ones. This effect is not revealed by the averaged precision based retrieval evaluation metrics typically used for evaluation of speech retrieval. However such variations in the ranks of relevant segments can impact considerably on the experience of the user in terms of the order in which retrieved content is presented. Analysis of our results reveals that while relevant longer segments are generally more robust to ASR errors, and consequentially retrieved at higher ranks, this is often at the expense of the user needing to engage in longer content playback to locate the relevant content in the audio recording. Our overall conclusion being that it is desirable to minimize the length of retrieval units containing relevant content while seeking to maintain high ranking of these items.

Read full abstract

Automatic Speech Recognition Transcripts Research Articles

Related Topics

Articles published on Automatic Speech Recognition Transcripts

Consensus automatic speech recognition (CASR) in the California Cognitive Assessment Battery (CCAB).

The Corpus of British Isles Spoken English (CoBISE)

A hierarchical reasoning graph neural network for the automatic scoring of answer transcriptions in video job interviews

Confusion2Vec 2.0: Enriching ambiguous spoken language representations with subwords.

What does parity mean? A detailed comparison of ASR and human transcription errors

Automatic speech recognition in neurodegenerative disease

The impact of semantic annotation techniques on content-based video lecture recommendation

Asynchronous Speech Recognition Affects Physician Editing of Notes.

Predicting speech intelligibility with deep neural networks

Modeling Latent Topics and Temporal Distance for Story Segmentation of Broadcast News

A simple generative model of incremental reference resolution for situated dialogue

Content Based Lecture Video Retrieval Using Speech and Video Text Information

Exploring speech retrieval from meetings using the AMI corpus

MBNSeg: A Clustering System for Segmenting Malay Spoken Broadcast News

Beyond audio and video retrieval: topic-oriented multimedia summarization

Semantic tagging of video ASR transcripts using the web as a source of knowledge

Recognising speakers from the topics they talk about

Performance Analysis and Improvement of Turkish Broadcast News Retrieval

Podcast search: user goals and retrieval technologies

Statistical lattice-based spoken document retrieval

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Automatic Speech Recognition Transcripts Research Articles

Related Topics

Articles published on Automatic Speech Recognition Transcripts

Consensus automatic speech recognition (CASR) in the California Cognitive Assessment Battery (CCAB).

The Corpus of British Isles Spoken English (CoBISE)

A hierarchical reasoning graph neural network for the automatic scoring of answer transcriptions in video job interviews

Confusion2Vec 2.0: Enriching ambiguous spoken language representations with subwords.

What does parity mean? A detailed comparison of ASR and human transcription errors

Automatic speech recognition in neurodegenerative disease

The impact of semantic annotation techniques on content-based video lecture recommendation

Asynchronous Speech Recognition Affects Physician Editing of Notes.

Predicting speech intelligibility with deep neural networks

Modeling Latent Topics and Temporal Distance for Story Segmentation of Broadcast News

A simple generative model of incremental reference resolution for situated dialogue

Content Based Lecture Video Retrieval Using Speech and Video Text Information

Exploring speech retrieval from meetings using the AMI corpus

MBNSeg: A Clustering System for Segmenting Malay Spoken Broadcast News

Beyond audio and video retrieval: topic-oriented multimedia summarization

Semantic tagging of video ASR transcripts using the web as a source of knowledge

Recognising speakers from the topics they talk about

Performance Analysis and Improvement of Turkish Broadcast News Retrieval

Podcast search: user goals and retrieval technologies

Statistical lattice-based spoken document retrieval