Abstract

Several methods for synthetic audio speech generation have been developed in the literature through the years. With the great technological advances brought by deep learning, many novel synthetic speech techniques achieving incredible realistic results have been recently proposed. As these methods generate convincing fake human voices, they can be used in a malicious way to negatively impact on today’s society (e.g., people impersonation, fake news spreading, opinion formation). For this reason, the ability of detecting whether a speech recording is synthetic or pristine is becoming an urgent necessity. In this work, we develop a synthetic speech detector. This takes as input an audio recording, extracts a series of hand-crafted features motivated by the speech-processing literature, and classify them in either closed-set or open-set. The proposed detector is validated on a publicly available dataset consisting of 17 synthetic speech generation algorithms ranging from old fashioned vocoders to modern deep learning solutions. Results show that the proposed method outperforms recently proposed detectors in the forensics literature.

Highlights

  • The possibility of manipulating digital multimedia objects is within everyone’s reach

  • 2 Background we provide the reader with some background on state-of-the-art algorithms for synthetic speech generation and synthetic speech detection

  • Text-to-speech (TTS) synthesis was largely based on concatenative waveform synthesis, i.e., given a text as input, the output audio is produced by selecting the correct diphone units from a large dataset of diphone waveforms and concatenating them so that intelligibility is ensured [18,19,20]

Read more

Summary

Introduction

The possibility of manipulating digital multimedia objects is within everyone’s reach. The huge technological advances determined by deep learning has delivered a series of artificial intelligence (AI)-driven tools that make manipulations extremely realistic and convincing. All of these tools are surely a great asset in a digital artist’s arsenal. Deepfake AI-driven technology enables replacing one person’s identity with someone else in a video [3] This has been used to disseminate fake news through politician impersonation as well as for revenge porn distribution. 2.1 Fake speech generation Synthetic speech generation is a problem that has been studied for many years and addressed with several approaches For this reason, in the literature a large number of techniques that achieve good results are present and there is not a single unique way of generating a synthetic speech track. The main drawback of concatenative synthesis is the difficulty of modifying the voice timbral characteristics, e.g., to change speaker or embed emotional content in the voice

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.