Human and machine learning expertise applied to speech and music signals processing

Bozena Kostek

doi:10.1121/10.0007802

Abstract

Machine learning techniques provide extensive advances in automated signal and data processing in many fields. One of the critical areas in acoustic signal processing is extracting relevant information from the speech signal. The speech signal contains phonemic variation, temporal structure, prosody, timbre, and voice quality. It also includes various aspects of the speaker’s profile, such as emotions or sentiments. What is easily discerned and analyzed by a human escapes a machine learning-based approach when dealing with all that complexity at once. One may easily see parallels to musical signal processing as such aspects as temporal structure, timbre, and music quality, etc., are of importance, especially as music also carries an artistic message. An example of such a functional concept of music is when it is associated with the image, as the ultimate goal of a film music composer is to enhance or evoke the audience’s emotions. A question appears whether a machine learning technique may discern emotions associated with a given scene the same way as the human does. This paper focuses on the application of machine learning to speech and music signal analyses. It also emphasizes speech and music signal processing covering recent advances in deep learning.

Full Text