Abstract

Crowds express emotions as a collective individual, which is evident from the sounds that a crowd produces in particular events, e.g., collective booing, laughing or cheering in sports matches, movies, theaters, concerts, political demonstrations, and riots. A critical question concerning the innovative concept of crowd emotions is whether the emotional content of crowd sounds can be characterized by frequency-amplitude features, using analysis techniques similar to those applied on individual voices, where deep learning classification is applied to spectrogram images derived by sound transformations. In this work, we present a technique based on the generation of sound spectrograms from fragments of fixed length, extracted from original audio clips recorded in high-attendance events, where the crowd acts as a collective individual. Transfer learning techniques are used on a convolutional neural network, pre-trained on low-level features using the well-known ImageNet extensive dataset of visual knowledge. The original sound clips are filtered and normalized in amplitude for a correct spectrogram generation, on which we fine-tune the domain-specific features. Experiments held on the finally trained Convolutional Neural Network show promising performances of the proposed model to classify the emotions of the crowd.

Highlights

  • Introduction and previous workFor long time research on sound emotion recognition has mainly focused on the individual dimension aiming at detecting emotions either perceived by single listeners, typically through music [17] or produced by single speakers speech [8, 16, 27, 34] and expressed by fine-tuning different shades of vocal features [21, 25]

  • Rethinking the emotional classes for crowd context, we present an extension of the preliminary ideas on crowd sound and a crowd-sound emotion model implementation, using deep learning and transfer learning techniques

  • The first approach follows the standard practices in image classification, as used in state-ofthe-art works on speech emotion recognition [25], where a dataset is partitioned in two subsets by randomly picking images and assigning them to the training set and test set, according to a given proportion

Read more

Summary

Introduction

Introduction and previous workFor long time research on sound emotion recognition has mainly focused on the individual dimension aiming at detecting emotions either perceived by single listeners, typically through music [17] or produced by single speakers speech [8, 16, 27, 34] and expressed by fine-tuning different shades of vocal features [21, 25]. [29] introduced the innovative proposal to investigate the emotions embedded in the crowd sounds, collectively produced by the participants to mass events. It is well known how a stadium of football fans can loudly express Approval or disapproval, highlighting different phases of the game, e.g., showing happiness for a goal or delusion for a missed one. In public events (e.g., concerts, receptions, parties, political meetings, protests, riots) and the public areas holding social activities (e.g., an open-air marketplace, a shopping mall, a restaurant, an airport hall), the crowd can collectively express its emotions by laughing, cheering, booing, shouting in protest, or showing a neutral emotion, like, for example, the background sound produced by a group quietly chatting at a party, or by a sports stadium crowd during a boring part of the match. The expression “the crowd roar” [20] captures the essence of the concept of the collective emotion expressed through sound by the collective individual, i.e., the crowd, dynamically influencing the behavior of the single individuals

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call