Design of Objective Quality Measures for Time-Scale Modification of Audio

Timothy Roberts

doi:10.25904/1912/4070

Abstract

Time-Scale Modification (TSM) is a well-researched field and allows for time-domain manipulation of a signal without modifying the pitch or timbre. Many TSM methods have been presented, however quantitative results on the quality of these methods are rare, with most methods reporting informal listening tests. This is likely due to the timecommitment and cost of subjective testing. Additionally, an objective measure of quality has not yet been developed that is suitable for timescaled signals. This dissertation describes the design of e ective objective measures of quality for TSM. TSM methods are, generally, single channel algorithms that give poor results when applied to multi-channel signals, as the phase relationship between channels must be maintained. This dissertation proposes a method and additional variant for maintaining the phase relationship between channels and retaining the presence in the centre of the stereo signal. The method involves pre- and post-processing the signal, with the variant processing each frame for real-time suitability. Sum and di erence transformations of the stereo signal are used for TSM and result in a large improvement in stereo phase coherence, consequently maintaining the stereo field. The proposed method produces a highquality stereo output and greatly improves quality over the independent channel processing method. It also allows for simple implementation around all existing TSM frameworks. A modification to the Epoch-Synchronous Overlap-Add (ESOLA) TSM algorithm is proposed in this dissertation. The proposed method, Fuzzy Epoch-Synchronous Overlap-Add, improves on the previous ESOLA method through cross-correlation of time-smeared epochs before overlap-adding. This reduces distortion and artefacts while the speaker's fundamental frequency is stable, as well as reducing artefacts during pitch modulation. The proposed method is tested against well-known TSM algorithms. It is preferred over ESOLA and gives similar performance to other TSM algorithms for voice signals. It is also shown that this algorithm can work effectively with solo instrument signals containing strong fundamental frequencies. No effective objective measure of quality for TSM exists. This dissertation details the creation, subjective evaluation and analysis of a dataset, for use in the development of an objective measure of quality for TSM. Comprising two parts, the training subset contains 88 source files processed using six TSM methods at 10 time-scales, while the testing subset contains 20 source files processed using three additional methods at four time-scales. The source material contains speech, solo harmonic and percussive instruments, sound effects and a range of music genres. 42,529 ratings were collected from 633 sessions using laboratory and remote collection methods. Analysis of results shows no correlation between age and quality of rating; equivalence between expert and non-expert listeners; negligible di erences between participants with and without hearing issues; and negligible di erences between testing modalities. Comparison of published objective measures and subjective scores shows the objective measures to be poor indicators of subjective quality. Initial results for a retrained objective measure of quality are presented with results approaching average loss and correlation values of subjective sessions. An objective measure of quality for time-scaled audio is proposed that makes use of the previously developed dataset and improves on reported results. The measure uses hand-crafted features and a fully connected network to predict subjective mean opinion scores. Basic and Advanced Perceptual Evaluation of Audio Quality features are used in addition to nine features specific to TSM artefacts. Six methods of alignment are explored, with interpolation of the reference magnitude spectrum to the length of the test magnitude spectrum giving the best performance. The proposed measure achieves an average Root Mean Squared Error (RMSE) of 0.490 and a mean Pearson Correlation Coe cient (PCC) of 0.864, equivalent to 97th and 82nd percentiles of subjective sessions respectively. The proposed measure is used to evaluate TSM algorithms, finding that Elastique gives the highest objective quality for solo instrument and voice signals, while the Identity Phase-Locking Phase Vocoder gives the highest objective quality for music signals and the best overall quality. Two single-ended objective quality measures for time-scaled audio are also proposed. These measure do not require a reference signal, nor alignment. Data driven features are created by either a convolutional neural network (CNN) or a bidirectional gated recurrent unit (BGRU) network, and are fed to a fully-connected network to predict subjective mean opinion scores. The proposed CNN and BGRU measures achieve an average RMSE of 0.608 and 0.576, and a mean PCC of 0.771 and 0.794, respectively. The proposed measures are used to evaluate TSM algorithms, and comparisons are provided for 16 TSM implementations. A literature review is included with required background knowledge. It includes the fundamentals of sound perception, sound capture, digital signal processing, time-scale modification methods used within research, and subjective and objective measures of quality. Full implementation of all proposed methods and measures can be found at github.com/zygurt/TSM, while the labelled dataset is available at http://ieee-dataport.org/1987.

Full Text