Abstract
In the current field of air traffic control speech, there is a lack of effective objective speech quality evaluation methods. This paper proposes a new network framework based on ResNet–BiLSTM to address this issue. Firstly, the mel-spectrogram of the speech signal is segmented using the sliding window technique. Next, a preceding feature extractor composed of convolutional and pooling layers is employed to extract shallow features from the mel-spectrogram segment. Then, ResNet is utilized to extract spatial features from the shallow features, while BiLSTM is used to extract temporal features, and these features are horizontally concatenated. Finally, based on the concatenated spatiotemporal features, the final speech quality score is computed using fully connected layers. We conduct experiments on the air traffic control speech database and compare the objective scoring results with the subjective scoring results. The experimental results demonstrate that the proposed method has a high correlation with the mean opinion score (MOS) of air traffic control speech.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.