Human Speech Perception Research Articles

Major developments in the field of finding more natural ways of interacting with computers have been taking place. The clear focus lies on making technology more approachable to people. The concept that computers can comprehend our various gestures by eyes, voices, touch and our different movements to interact is called the Natural User Interface (NUI). Today, many of these elements are available in mobile phones, PCs, and in other devices. Speech technologies particularly play a substantial role in this evolution. Significant advancement in automatic speech recognition (ASR) for well defined applications like dictation and medium vocabulary transaction processing assignments in comparatively controlled environments have been made. But, automatic speech recognition still has to reach a level needed for speech to become a completely pervasive user interface because even in clean acoustic surroundings, the state of ASR system performance falls behind human speech perception. Visual speech recognition, however, is a promising source of extra speech information and it has successfully exhibited to enhance noise robustness of automatic speech recognizers, thereby promising to expand their usability in the human computer interaction. In this paper, the main components of audio-visual speech recognition, i.e., the audio and the video components are discussed, along with the latest advancements made in this field. Further, the research goes beyond the recent advancements and discusses the future scope of audio video speech recognition and mentions some likely future developments, evaluating each on the basis of its performance. Graphs are plotted based on experiments to depict the performance improvements from audio- only ASR to audio-video ASR, along with its expected performance level in future.

Read full abstract

Objective. Voice pitch carries important cues for speech perception in humans. Recent studies have shown the feasibility of recording the frequency-following response (FFR) to voice pitch in normal-hearing listeners. The presence of such a response, however, has been dependent on subjective interpretation of experimenters. The purpose of this study was to develop and test an automated procedure including a control-experimental protocol and response-threshold criteria suitable for extracting FFRs to voice pitch, and compare the results to human judgments. Design. A set of four Mandarin tones (Tone 1 flat; Tone 2 rising; Tone 3 dipping; and Tone 4 falling) were prepared to reflect the four contrastive pitch contours. Two distinctive algorithms, short-term autocorrelation in the time domain and narrow-band spectrogram in the frequency domain, were used to estimate the Frequency Error, Slope Error, Tracking Accuracy, Pitch Strength and Pitch-Noise Ratio of the recordings from individual listeners as well as the power and false-positive rates of each algorithm. Study Sample. Eleven native speakers (five males; age: mean ± SD = 31.4 ± 4.7 years) of Mandarin Chinese were recruited. Results. The results demonstrated that both algorithms were suitable for extracting FFRs and the objective measures showed comparable results to human judgments. Conclusions. The automated procedure used in this study, including the use of the control-experimental protocol and response thresholds used for each of the five objective indices, can be used for difficult-to-test patients and may prove to be useful as an assessment and diagnostic method in both clinical and basic research efforts.SumarioLa altura tonal de la voz contiene importantes claves para la percepción del habla en humanos. Estudios recientes mostraron la factibilidad de registrar la respuesta de seguimiento a la frecuencia (FFR), de la frecuencia vocal en normo-oyentes. No obstante, la presencia de esta respuesta ha sido dependiente de la interpretación subjetiva de los experimentadores. El propósito de este estudio fue desarrollar y probar un procedimiento automatizado, incluyendo un protocolo de control experimental y criterios de respuesta-umbral, apropiados para la extracción de FFRs de la frecuencia vocal y comparar los resultados con los juicios humanos. Se preparó un set de tonos del Mandarín (tono 1, plano; tono 2, ascendente; tono, 3 silbado y tono 4, descendente), para mostrar los cuatro contornos de frecuencia contrastante. Se usaron dos algoritmos distintivos, de auto-correlación a corto plazo, para estimar el error en la frecuencia y en el gradiente, la exactitud del rastreo, la fortaleza de la frecuencia y la relación tono-ruido de los registros de los oyentes individuales así como de la fuerza y las tasas de falsos positivos de cada algoritmo. Los resultados demostraron que ambos algoritmos fueron apropiados para extraer las FFRs y que las mediciones objetivas mostraron resultados comparables a los de los juicios humanos.

Read full abstract

Human Speech Perception Research Articles

Related Topics

Articles published on Human Speech Perception

Advances in Automatic Speech Recognition: From Audio-Only To Audio-Visual Speech Recognition

Technology in Phonetic Science: Setting Up a Basic Phonetics Laboratory

Zebra finches and Dutch adults exhibit the same cue weighting bias in vowel perception

Speech Perception: A Language-Trained Chimpanzee Weighs In

Bird Speech Perception and Vocal Production: A Comparison with Humans

Bird Speech Perception and Vocal Production: A Comparison with Humans

Extracting amplitude modulations from speech in the time domain

Evaluation of two algorithms for detecting human frequency-following responses to voice pitch

Human phoneme recognition depending on speech-intrinsic variability

Responsiveness of the human auditory cortex to degraded speech sounds: Reduction of amplitude resolution vs. additive noise

Intelligibility predictors and neural representation of speech

Automatic speech emotion recognition using modulation spectral features

Frequency-Lowering Hearing Aids: Verification Tools and Research Needs

Two-scale Auditory Feature Based Non-intrusive Speech Quality Evaluation

The Relationship between Electrically Evoked Compound Action Potential and Speech Perception in CI24RE Implant Users

Zebra finches exhibit speaker-independent phonetic perception of human speech

Effects of syllable-final segment duration on the identification of synthetic speech continua by birds and humans

Time-Warp–Invariant Neuronal Processing

Variability of vowel productions within and between days.

Computational Auditory Scene Analysis: Principles, Algorithms and Applications

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Human Speech Perception Research Articles

Related Topics

Articles published on Human Speech Perception

Advances in Automatic Speech Recognition: From Audio-Only To Audio-Visual Speech Recognition

Technology in Phonetic Science: Setting Up a Basic Phonetics Laboratory

Zebra finches and Dutch adults exhibit the same cue weighting bias in vowel perception

Speech Perception: A Language-Trained Chimpanzee Weighs In

Bird Speech Perception and Vocal Production: A Comparison with Humans

Bird Speech Perception and Vocal Production: A Comparison with Humans

Extracting amplitude modulations from speech in the time domain

Evaluation of two algorithms for detecting human frequency-following responses to voice pitch

Human phoneme recognition depending on speech-intrinsic variability

Responsiveness of the human auditory cortex to degraded speech sounds: Reduction of amplitude resolution vs. additive noise

Intelligibility predictors and neural representation of speech

Automatic speech emotion recognition using modulation spectral features

Frequency-Lowering Hearing Aids: Verification Tools and Research Needs

Two-scale Auditory Feature Based Non-intrusive Speech Quality Evaluation

The Relationship between Electrically Evoked Compound Action Potential and Speech Perception in CI24RE Implant Users

Zebra finches exhibit speaker-independent phonetic perception of human speech

Effects of syllable-final segment duration on the identification of synthetic speech continua by birds and humans

Time-Warp–Invariant Neuronal Processing

Variability of vowel productions within and between days.

Computational Auditory Scene Analysis: Principles, Algorithms and Applications