Abstract

Speaker recognition is a technology that verifies the identity of a person by his or her voice. Different feature parameters have different potential information in speaker recognition. In order to solve the problem that a single feature parameter cannot fully represent a speaker's identity, this paper proposes a feature fusion approach based on embedding mechanism. The fusion features adopted in our approach are filter bank coefficients (Fbank) and mel frequency cepstrum coefficients (MFCC). Potential and complementary information in two features can be obtained by a neural network model, which takes our embedded features as inputs. The d-vector output of the neural network model is classified using the Softmax loss function and optimized using the generalized end-to-end loss function. Both of the most common models, long short term memory network (LSTM) and bi-directional long short term memory network (BiLSTM), are used as our testbed. Results show that, by using our proposed feature fusion approach, the performance of both models are improved. In particular, the minimum equal error rate is 4.17% under the BiLSTM model, compared with the single MFCC or Fbank feature, which are reduced by 72.2% and 28.4% respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call