Abstract

Playing the piano in the correct position is important because the correct position helps to produce good sound and prevents injuries. Many studies have been conducted in the field of piano playing posture recognition that combines various techniques. Most of these techniques are based on analyzing visual information. However, in the piano education field, it is essential to utilize audio information in addition to visual information due to the deep relationship between posture and sound. In this paper, we propose an audio-visual tensor fusion network (simply, AV-TFN) for piano performance posture classification. Unlike existing studies that used only visual information, the proposed method uses audio information to improve the accuracy in classifying the postures of professional and amateur pianists. For this, we first introduce a dataset called C3Pap (Classic piano performance postures of amateur and professionals) that contains actual piano performance videos in diverse environments. Furthermore, we propose a data structure that represents audio-visual information. The proposed data structure represents audio information on the color scale and visual information on the black and white scale for representing relativeness between them. We call this data structure an audio-visual tensor. Finally, we compare the performance of the proposed method with state-of-the-art approaches: VN (Visual Network), AN (Audio Network), AVN (Audio-Visual Network) with concatenation and attention techniques. The experiment results demonstrate that AV-TFN outperforms existing studies and, thus, can be effectively used in the classification of piano playing postures.

Highlights

  • Studies on classifying playing postures are being carried out in various fields

  • Experiments using numerous hyperparameters demonstrated that kernel size and stride size influenced the performance of piano playing posture classification

  • visual network (VN) as it uses audio information to compensate for cases in which the posture classification accuracy may be degraded because of deteriorating video quality and rapid hand movement

Read more

Summary

Introduction

Studies on classifying playing postures are being carried out in various fields. Research on piano playing posture classification can be used for piano education, playing posture training, and evaluation systems, making these studies important. [2] proposed a motion capture system integrated with a data glove that can visualize the skeleton of the pianist’s arms and hands Most of these devices are expensive and uncomfortable to use. This study proposes an Audio-Visual Tensor Fusion Network (AV-TFN), the first deep learning-based method for piano playing posture classification method using audio-visual information. This study proposes an audio-visual fusion method that represents the audio-visual information as a data structure. It is a data representation method that can retain audio-visual identity in one data structure and represent relativeness between audio and video information for piano playing posture classification. This study demonstrates the superiority of the proposed AV-TFN method through comparisons of the performances of the visual network (VN) [7], audio network (AN), and audio-visual network (AVN) with concatenation (AVN-Concat) [8] and attention (AVN-Atten) [9] techniques.

Piano Playing Posture Classification Methods
Data Representation Methods
Proposed Audio-Visual Tensor Fusion Network
Explanation of C3Pap Dataset
Feature Extraction
Data Normalization
Audio-Visual Tensor
Model Training
Implementation Details
Hyperparameter Setting
Experiments and Results
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call