Abstract
Splicing is one of the most common tampering techniques for speech forgery in many forensic scenarios. Some successful approaches have been presented for detecting speech splicing when the splicing segments have different signal-to-noise ratios (SNRs). However, when the SNRs between the spliced segments are close or even same, no effective detection methods have been reported yet. In this study, noise inconsistency between the original speech and the inserted segment from other speech is utilized to detect the splicing trace. First, noise signal of the suspected speech is extracted by a parameter-optimized noise estimation algorithm. Second, the statistical Mel frequency features are extracted from the estimated noise signal. Finally, the spliced region is located by utilizing a change point detection algorithm on the estimated noise signal. The effectiveness of the proposed method is evaluated on a well-designed speech splicing dataset. The comparative experimental results show that the proposed algorithm can achieve better detection performance than other algorithms.
Highlights
With the wide spread of social networks and the rapid development of powerful audio editing tools, digital speech can be accessed, manipulated, and distributed
PN (λ, k) to obtain the enhanced speech signal s(i). erefore, the noise signal n(i) can be estimated with the noisy speech y(i) and the enhanced speech s(i). en, the estimated noise n(i) is framed and windowed, and for each frame, M-dimensional Mel frequency cepstral coefficient (MFCC) coefficients are calculated. e variance sequence V (V1, V2, . . . , Vn) of MFCC coefficients is obtained and taken as the input of the change point detection algorithm, and the penalty cost function is constructed by equation (11)
Splicing Dataset. e transsplicing speech samples in this study are created based on NOIZEUS speech corpus [21] which is derived from the clean speech contaminated by various kinds of noise in the real world. e clean speech comes from 30 IEEE statements containing three male and three female pronunciations. e noise signals in NOIZEUS come from the AURORA-2 database [22], including noise from train stations, airports, exhibition halls, streets, and restaurants, as well as car noise, noise from commuter trains, and babble noise from multiperson speech
Summary
With the wide spread of social networks and the rapid development of powerful audio editing tools (such as Adobe Audition and GoldWave), digital speech can be accessed, manipulated, and distributed. En, the variances of Mel frequency cepstral coefficient (MFCC) [15] for estimated noise signal are calculated as the detecting features. The spliced region is located by a change point detection algorithm based on the penalty cost function [16]. E main work of this study is described, in which noise estimation, feature extraction, and the change point detection algorithm are described in detail. The estimated noise signal n(i) can be obtained by subtracting the enhanced speech s(i) from the noisy speech y(i), that is, n(i) y(i) − s(i). In the noise estimation process of each frame, the minimum pmin(λ, k) in the window is tracked, and the value obtained by the tracking is used to continuously update pmin(λ, k). It can be seen from the above analysis that reasonable adjustment
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.