An investigation of manifold learning for speech analysis

Andrew Errity,John Mckenna

doi:10.21437/interspeech.2006-628

Abstract

Abstract Due to the physiological constraints of articulatory motion thespeech apparatus has limited degrees of freedom. As a result, therange of speech sounds a human is capable of producing may lieon a low dimensional submanifold of the high dimensional spaceof all possible sounds. In this study a number of manifold learningalgorithms are applied to speech data in an effort to extract usefullow dimensional structure from the high dimensional speech sig-nal. The ability of these manifold learning algorithms to separatevowels in a low dimensional space is evaluated and compared to aclassical linear dimensionality reduction method. Results indicatethat manifold learning algorithms outperform classical methods inlow dimensions and are capable of discovering useful manifoldstructure in speech data.Index Terms: speech analysis, manifold learning, dimensionalityreduction, classiﬁcation. 1. Introduction In speech processing, the speech signal is often modeled by rela-tively high dimensional features such as discrete Fourier transform(DFT) or linear prediction (LP) coefﬁcients. However due to phys-iological constraints the speech production apparatus has relativelyfew degrees of freedom. Thus, humans are only capable of gen-erating a limited range of sounds which occupy a conﬁned regionof the acoustic space. In this case, we can imagine the speech dataas lying on or near a manifold embedded in the high dimensionalacoustic space. It has been proposed that speech intrinsically lieson some such low dimensional manifold [1, 2].It is desirable to reduce the dimensionality of the speech sig-nal prior to processing. Traditionally, signal processing techniqueshave been applied to speech in order to reduce the dimensional-ity by extracting information that is judged to capture informationabout the energy and spectral characteristics of the signal. Theextracted information is often transformed according to some per-ceptually motivated scheme to better model the speech auditorypath; for example, Mel-frequency cepstral coefﬁcients (MFCC)and perceptual linear prediction (PLP) parameters. These acousti-cally and perceptually motivated representations are based on ourknowledge and assumptions of speech production and perception,and as such do not attempt to automatically discover the underly-ing low dimensional structure of speech.A number of automatic dimensionality reduction algorithms,driven by the statistics of the data, have been proposed that aimto extract a meaningful low dimensional representation of high di-mensional data. Applications of these dimensionality reductionalgorithms include data compression, visualisation, noise reduc-tion, and feature extraction. Dimensionality reduction methodscan be categorised as linear or nonlinear methods. Linear meth-ods are limited to discovering the structure of data lying on ornear a linear subspace of the high dimensional input space. Themost widely used linear dimensionality reduction methods includethe classic principal component analysis (PCA) [3] and multidi-mensional scaling (MDS). These methods have been applied to awide range of speech processing problems including, feature trans-formation for improved speech recognition performance, speakeradaptation, data compaction, and speech analysis.Jansen and Niyogi [2] have recently shown that certain classesof speech sounds lie on a low dimensional manifold nonlinearlyembedded in the high dimensional acoustic space. A low dimen-sional submanifold such as this may have a highly nonlinear struc-ture that linear methods would fail to discover. Recently, a numberof manifold learning (also referred to as nonlinear dimensionalityreduction) algorithms have been proposed [4, 5, 6] which over-come the limitations of linear methods. These methods have beensuccessfully applied to a number of benchmark manifold problemsand have also proved useful in several image processing applica-tions.Manifold learning algorithms may also be useful in speechanalysis; for example, to project speech into a low dimensionalspace for visualisation or extract features for use in speech recog-nition. However there has been relatively little research conductedin this area to date. A number of exploratory studies have shownthat manifold learning algorithms can be used to successfully vi-sualise speech data in a low dimensional space [7, 6, 8] and forphone classiﬁcation [9].In this paper, we apply several manifold learning algorithms—locally linear embedding (LLE) [4, 10], isometric feature mapping(Isomap) [5], and Laplacian eigenmaps [6]—to speech data. Theability of these algorithms to discover low dimensional structurewithin speech data is evaluated and compared. Their performanceis also contrasted with that of the classical, linear, PCA method[3]. This paper is structured as follows. In Section 2, the man-ifold learning algorithms LLE, Isomap and Laplacian eigenmapsare described. The corpus, experiments and results are detailed inSection 3. Section 4 discusses a number of limitations of the mani-fold learning algorithms. Finally, in Section 5, the conclusions arepresented.

Full Text