Abstract

The task of pitch estimation is an essential step in many audio signal processing applications. In this paper, we propose a data-driven pitch estimation network, the Dual Attention Network (DA-Net), which processes directly on the time-domain samples of monophonic music. DA-Net includes six Dual Attention Modules (DA-Modules), and each of them includes two kinds of attention: element-wise and channel-wise attention. DA-Net is to perform element attention and channel attention operations on convolution features, which reflects the idea of "symmetry". DA-Modules can model the semantic interdependencies between element-wise and channel-wise features. In the DA-Module, the element-wise attention mechanism is realized by a Convolutional Gated Linear Unit (ConvGLU), and the channel-wise attention mechanism is realized by a Squeeze-and-Excitation (SE) block. We explored three kinds of combination modes (serial mode, parallel mode, and tightly coupled mode) of the element-wise attention and channel-wise attention. Element-wise attention selectively emphasizes useful features by re-weighting the features at all positions. Channel-wise attention can learn to use global information to selectively emphasize the informative feature maps and suppress the less useful ones. Therefore, DA-Net adaptively integrates the local features with their global dependencies. The outputs of DA-Net are fed into a fully connected layer to generate a 360-dimensional vector corresponding to 360 pitches. We trained the proposed network on the iKala and MDB-stem-synth datasets, respectively. According to the experimental results, our proposed dual attention network with tightly coupled mode achieved the best performance.

Highlights

  • F0, or pitch, is one of the most useful acoustical features that determines an audible pitch level

  • Motivated by Gated Linear Units (GLUs) [30] and SENets [35], we propose a datadriven Dual Attention Network (DA-Net) for pitch estimation

  • For pitch estimation of monophonic music, we propose a data-driven DA-Net integrating the element-wise attention mechanism and the channel-wise attention mechanism

Read more

Summary

Introduction

F0, or pitch, is one of the most useful acoustical features that determines an audible pitch level. Pitch estimation is important in monophonic or polyphonic music signal processing. The monophonic pitch tracking method is used to generate pitch labels for multi-track datasets [1] or as a core step of melody extraction algorithms [2,3]. This research has attracted increasing attention with the demand for singing processing [4], music information retrieval [5], large-scale analysis of different musical styles [6], and the automatic transcription of music [7]. Pitch is a perceptual property, and F0 is a physical property of audio. The pitch is determined by the F0. Despite this important distinction, pitch and F0 are generally used interchangeably outside the field of psychoacoustics

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call