Abstract

The speech signals inserted in the computer may be mixed as a result of interference with signals from other sources. These signals may be speech signals or noise. One of the most famous examples of this problem when a group of people speaking in the same time is the “Cocktail Party”. This problem produces a mixture of different speech signals, called the mixed signal. To solve this problem, the audio signals that make up the mixed-signal must be restored to their sources. This task is called separating audio sources. In this paper, supervised Deep Recurrent Neural Networks with Bi-directional Long Short Term Memory (Supervised DRNN-BLSTM) were used. To achieve a monaural source separation, we build a model to separate audio signals from a monaural mixed signal. This mixed signal consists of two different audio signals (male-female). We predict two types of time-frequency masks (Ideal Ratio Mask (IRM), and Optimal Ratio Mask (ORM). They are used to achieve the separation of the target audio sources from the mixed signal. We test the model on a dataset with (500) mixed signals. Each mixed signal three seconds in length and consists of two speaker signals (Female-Male). They are recorded in a stereo format at 8192kHz, our approach achieves Signal-to-Distortion ratio (SDR) (0.183.db), Source-to-interference Ratio (SIR) (0.198.db), and Source-to-Artifacts Ratio (SAR) (0.13.db) gain using (ORM) mask compared to the existing model using (IRM) mask.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call