Abstract

We have been investigating for some time the use of a layered modular/ensemble neural network architecture for acoustic modelling. In the particular instantiation investigated so far, this architecture decomposes the task of acoustic modelling by phone. In a first layer at least one multilayer perceptron (or ‘primary detector’) is trained to discriminate each phone and, in a second layer, outputs from the first layer are combined into posterior probabilities by further MLPs. In this paper we show how our approach provides good acoustic modelling in a series of experiments on the TIMIT speech corpus. Firstly we show that decomposition itself provides a gain through greater precision in MLP training. Secondly we show that primary detectors trained on different front-ends can be profitably combined. Our analysis of the correlations between different detectors for the same phone shows that some independent information is provided by different front-ends. Thirdly we show how to employ information from a wide context within our architectural framework and that this provides performance equivalent to the best context-dependent acoustic modelling systems.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.