Decompl: Decompositional Learning with Attention Pooling for Group Activity Recognition from a Single Volleyball Image
Group Activity Recognition (GAR) aims to detect the activity performed by multiple actors in a scene. Prior works model the spatio-temporal features based on the RGB, optical flow or keypoint data types. However, using both the temporality and these data types altogether increase the computational complexity significantly. Our hypothesis is that by only using the RGB data without temporality, the performance can be maintained with a negligible loss in accuracy. To that end, we propose a novel GAR technique for volleyball videos, DECOMPL, which consists of two complementary branches. In the visual branch, it extracts the features using attention pooling in a selective way. In the coordinate branch, it considers the current configuration of the actors and extracts the spatial information from the box coordinates. Moreover, we analyzed the Volleyball dataset that the recent literature is mostly based on, and realized that its labeling scheme degrades the group concept in the activities to the level of individual actors. We manually reannotated the dataset in a systematic manner for emphasizing the group concept. Experimental results on the Volleyball as well as Collective Activity (from another domain, i.e., not volleyball) datasets demonstrated the effectiveness of the proposed model DECOMPL, which delivered the best/second best GAR performance with the reannotations/original annotations among the comparable state-of-the-art techniques. Our code, results and new annotations will be made available through GitHub after the revision process.
- Research Article
- 10.1360/ssi-2020-0235
- Feb 25, 2021
- SCIENTIA SINICA Informationis
In group activity recognition, the hierarchical framework is widely used to represent the relationships between individuals and their corresponding groups and has achieved promising performance. However, existing methods simply employ the max/average pooling in this framework, overlooking the distinct contributions of different individuals to the group activity recognition. In this paper, we propose a new contextual pooling scheme, named attentive pooling, which enables weighted information transition from individual actions to group activity. Using the attention mechanism, attentive pooling is intrinsically interpretable and can embed the member context in the existing hierarchical model. To verify the effectiveness of the proposed scheme, two specific attentive pooling methods, i.e., global attentive pooling (GAP) and hierarchical attentive pooling (HAP), are designed. GAP rewards individuals significant to the group activity, while HAP further considers the hierarchical division by introducing the subgroup structure. Experimental results on the benchmark dataset demonstrate that the proposed scheme is significantly superior over the baseline and comparable to state-of-the-art methods.
- Conference Article
14
- 10.1109/cvprw.2009.5204329
- Jun 1, 2009
Summary form only given. A significant amount of computer vision research has addressed the recognition of human activities recently. Researchers are particularly successful in recognizing the activities of one individual or between two individuals, such as pushing and fighting. Notably, in our previous work, we have presented a representation syntax to describe high-level human-human interactions based on their sub-events, and proposed a hierarchical algorithm to recognize represented interactions probabilistically. Not only simple interactions such as punching, kicking, and shaking hands are recognized, but also recursive interactions like "fighting" between two persons are recognized with our previous framework. In this paper, we take our next evolutionary step in human activity recognition: recognition of group activities. Group activities are the activities that can be characterized by movements of members who belong to one or more conceptual groups. Recognition of groups and their activities will make the detection of high-level events possible. Especially, when such events are semantically meaningful in terms of overall actions of multiple persons considered jointly but not when they are considered individually. Automated recognition of suspicious groups and their activities such as 'a group of thieves robbing the bank' are essential for the construction of high-level surveillance systems. The analysis of movements and plays in team sports also becomes possible with the group activity recognition system. The semantic understanding of military operations and joint works is another application of group activity recognition. This paper describes a stochastic methodology for the recognition of various types of high-level group activities. Our system maintains the probabilistic representation of a group activity, describing how individual activities of its group members must be organized temporally, spatially, and logically. In order to recognize each of the represented group activities, the system searches for a set of group members that has the maximum posterior probability while satisfying its representation. A hierarchical recognition algorithm utilizing a Markov chain Monte Carlo (MCMC)- based probability distribution sampling has been designed to detect group activities and find the acting groups simultaneously. The system is developed to recognize complex activities such as 'two groups fighting', 'a group of thieves stealing an object from another group', and 'a group assaulting a person'. Videos downloaded from YouTube as well as videos that we have taken are tested. Experimental results shows that our system recognizes complicated group activities, and it does it more reliably and accurately compared to previous approaches by analyzing them probabilistically.
- Conference Article
23
- 10.1109/ijcnn.2016.7727387
- Jul 1, 2016
The recognition of group activities using computer vision and pattern recognition methods has been, and still remains, a challenging problem. Most of the research on human behaviour has been focused on recognizing individual issues from actions to behaviours. However, the analysis and recognition of group activities, the relationships of different groups in the scene and the interaction of the individuals in the group is still considered an open problem. This paper proposes a novel representation method to analyse and recognise group activities, called Group Activity Descriptor Vector (GADV). It is calculated from the trajectory described by the group and by the individuals who form it. Specifically, the GADV describes three different components: the trajectory followed by the group, the coherence of the individual trajectories in the group and, finally, the movement relationships among different groups in the scene. The trajectory analysis allows a simple high level understanding of complex groups activities. The GADV representation has been evaluated with different self-organizing neural networks using Behave and Caviar dataset sequences obtaining great accuracy in the recognition of the group activities, outperforming the state of the art methods.
- Research Article
39
- 10.1016/j.buildenv.2019.05.016
- May 10, 2019
- Building and Environment
A framework for group activity detection and recognition using smartphone sensors and beacons
- Conference Article
44
- 10.1109/wmvc.2008.4544065
- Jan 1, 2008
The paper describes a methodology for the recognition of high-level group activities. Our system recognizes group activities including group actions, group-persons interactions, group-group (i.e. inter-group) interactions, intra-group interactions, and their combinations described using a common representation scheme. Our approach is to represent various types of complex group activities with a programming language-like representation, and then to recognize represented activities based on the recognition of activities of individual group members. A hierarchical recognition algorithm is designed for the recognition of high-level group activities. The system was tested to recognize activities such as 'two groups fighting', 'a group of thieves stealing an object from another group', and 'a group of policemen arresting a group of criminals (or a criminal)'. Videos downloaded from YouTube as well as videos that we have taken are tested. Experimental results shows that our system recognizes complicated group activities, and it does it more reliably and accurately compared to previous approaches.
- Conference Article
8
- 10.1109/aina.2016.94
- Mar 1, 2016
Human activity recognition using mobile sensors is becoming increasingly important. Scaling up from individuals to groups, that is, Group Activity Recognition (GAR), has attracted significant attention recently. This paper investigates energy consumption for GAR and proposes a novel distributed middleware called GroupSense for mobile GAR. We implemented and tested GroupSense, which incorporates a protocol for the exchange of information required for GAR. We also investigated the battery drain of continuous activity recognition in a range of simple GAR scenarios. We then conclude with lessons learnt for GAR.
- Conference Article
8
- 10.1109/mdm.2014.62
- Jul 1, 2014
Group Activity Recognition (GAR) is a challenging research area in context-aware computing which has attracted much attention recently. Many studies have been conducted in the field of activity recognition (AR) along with their applications in domains such as health, smart homes, daily living and life logging. However, still many open issues exist. Lack of an energy-efficient approach is one of the most vital issues in the context of AR. GAR work often suffers from energy consumption issues for the reason that, apart from AR process, there is the requirement to have more interaction among members of the group and a need to run more complex recognition processes. Moreover, almost all work in GAR are technology-oriented and assume that our real-life environment remains fixed once the system has been established, but this may not be the case. Hence, we propose a framework called Group Sense for GAR towards addressing these issues. Also, a relatively simple scheme for GAR, with a protocol for the exchange of information required for GAR, has been implemented, tested and evaluated. We then conclude with lessons learnt for GAR.
- Conference Article
7
- 10.1109/ijcnn48605.2020.9207366
- Jul 1, 2020
Nowadays, the recognition of group activities is a significant problem, specially in video surveillance. It is increasingly important to have vision architectures that automatically allow timely recognition of group activities and predictions about them in order to make decisions. This paper proposes a computer vision architecture able to learn and recognise group activities using the movements of it in the scene. It is based on the Activity Description Vector (ADV), a descriptor able to represent the trajectory information of an image sequence as a collection of the local movements that occur in specific regions of the scene. The proposal evolves this descriptor towards the generation of images able to be the input queue of a two-stream convolutional neural network capable of robustly classifying group activities. Hence, this proposal, besides the use of trajectory analysis that allows a simple high level understanding of complex groups activities, takes advantage of the deep learning characteristics providing a robust architecture for multi-class recognition. The architecture has been evaluated and compared to other approaches using BEHAVE and INRIA dataset sequences obtaining great performance in the recognition of group activities.
- Book Chapter
15
- 10.1007/978-981-15-0994-0_9
- Nov 30, 2019
There has been a tremendous advance in machine learning techniques especially for automatic group activity recognition (GAR) over the past few decades. This review article surveys the modern advancement made towards video-based group activity recognition technique. Various applications, including video surveillance systems, sports analytics and human behaviour for robotics characterization, require a group activity recognition system. Comprehensive reviews of machine learning (ML) techniques like hidden Markov models (HMMs), graphical method and support vector machines employed in GAR are being discussed. A comprehensive review on the latest progress in deep learning model has delivered important developments in GAR performance; those are also presented. The main purpose of this survey is to broadly categorize and analyse GAR according to handcrafted features based on machine learning model and learned features based on deep model. Various GAR models illustrated by considering activities of individual person, person-to-person interaction, person-to-group interaction and group interaction using temporal sequence information from video frames for recognition of group activity are discussed. The review facilitates in diverse applications, and the models described in different applications present specifically in surveillance, sport analytics, video summary, etc.
- Conference Article
3
- 10.1109/icassp49357.2023.10096109
- Jun 4, 2023
Group activity recognition, which aims to simultaneously understand individual action and group activity in video clips, plays a fundamental role in video analysis. In this paper, we propose a novel reasoning network, Hierarchical Spatial-Temporal Transformer termed HSTT, for individual action and group activity recognition, which focuses on capturing the various degrees of spatial-temporal dynamic interactions adaptively and jointly among actors. Specifically, we first design a hierarchical spatial-temporal Transformer by capturing different levels of relationships to deal with unequal interaction relationships among actors. Furthermore, our proposed spatial-temporal Transformer (STT) block is capable of fully mining long-range spatial-temporal interactions with the virtue of the merge function and cross attention mechanism. Besides, we adopt the motion trajectory branch to provide complementary dynamic features for improving recognition performance. Extensive experiments on the two public GAR datasets clearly show that our approach can achieve very competitive performance by comparing them with state-of-the-art works.
- Research Article
24
- 10.1016/j.eswa.2023.122482
- Nov 22, 2023
- Expert Systems with Applications
Aging is inevitably associated with a decline in physical abilities and can pose challenges to the social lives of elderly individuals. In long-term care facilities, group exercise is instrumental for keeping elderly residents physically and socially healthy. Accommodating these needs in elderly care can be challenging due to staff shortages and other lacking resources. A robotic exercise coach could be helpful in such contexts. Intelligent human–robot interaction requires accurate and efficient human activity recognition. Several solutions focusing on human activity recognition in healthcare robotics have been proposed. However, multiperson activity recognition remains a challenging task in case of using vision-based or wearable sensors data, and past research has mainly focused on single-person rather than multiperson or group activity recognition. Moreover, the existing state-of-the-art methods for activity recognition mainly use heavyweight Convolutional Neural Network (CNN) models to achieve good accuracy. However, these models have certain drawbacks, such as requiring significant computational resources, higher memory and storage needs, and slower inference times. Another challenge is the limited number of publicly available datasets containing few activities for physical activity recognition. In this work, we propose a lightweight, deep learning-based, multiperson activity recognition system for group exercise training of elderly persons. Considering the limited publicly available datasets, we curated a new dataset named the Routine Exercise Dataset (RED), comprising 19 routine exercise activities recommended for elderly persons. The RED dataset has 14,440 samples collected from 19 participants and is one of the most extensive datasets of its kind. We evaluated our proposed activity recognition method based on proposed feature extraction modules and a one-dimensional multilayer long short-term memory network on 16 datasets, including 10 publicly available benchmark activity recognition datasets, an RED dataset, a publicly available dataset combined with RED dataset, and four noise-corrupted RED datasets. The results indicate the efficiency of the proposed method for real-time activity recognition compared to the state-of-the-art methods. The proposed method achieved F1-scores of 98.64%, 97.95%, and 99% on large-scale datasets named UESTC RGB-D, NTU RGB+D, and RED, respectively. We also developed a Robot Operating System (ROS)-based application to deploy our proposed system in a social robot and test it in real-life scenarios.
- Research Article
3
- 10.1016/j.patcog.2024.111118
- Nov 1, 2024
- Pattern Recognition
A unified framework for unsupervised action learning via global-to-local motion transformer
- Research Article
15
- 10.3390/fi12080133
- Aug 9, 2020
- Future Internet
Deep learning (DL) models have emerged in recent years as the state-of-the-art technique across numerous machine learning application domains. In particular, image processing-related tasks have seen a significant improvement in terms of performance due to increased availability of large datasets and extensive growth of computing power. In this paper we investigate the problem of group activity recognition in office environments using a multimodal deep learning approach, by fusing audio and visual data from video. Group activity recognition is a complex classification task, given that it extends beyond identifying the activities of individuals, by focusing on the combinations of activities and the interactions between them. The proposed fusion network was trained based on the audio–visual stream from the AMI Corpus dataset. The procedure consists of two steps. First, we extract a joint audio–visual feature representation for activity recognition, and second, we account for the temporal dependencies in the video in order to complete the classification task. We provide a comprehensive set of experimental results showing that our proposed multimodal deep network architecture outperforms previous approaches, which have been designed for unimodal analysis, on the aforementioned AMI dataset.
- Research Article
6
- 10.1186/s13174-019-0103-1
- Mar 1, 2019
- Journal of Internet Services and Applications
Human activity recognition using embedded mobile and embedded sensors is becoming increasingly important. Scaling up from individuals to groups, that is, group activity recognition, has attracted significant attention recently. This paper proposes a model and specification language for group activities called GroupSense-L, and a novel architecture called GARSAaaS (GARSA-as-a-Service) to provide services for mobile Group Activity Recognition and Situation Analysis (or GARSA) applications. We implemented and evaluated GARSAaaS which is an extension of a framework called GroupSense (Abkenar et al., 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA), 2016) where sensor data, collected using smartphone sensors, smartwatch sensors and embedded sensors in things, are aggregated via a protocol for these different devices to share information, as required for GARSA. We illustrate our approach via a scenario for providing services for bush walking leaders and bush walkers in a bushwalking group activity. We demonstrate the feasibility of our model and expressiveness of our proposed model.
- Conference Article
9
- 10.1109/avss.2016.7738071
- Aug 1, 2016
In this research study, we propose an automatic group activity recognition approach by modelling the interdependencies of group activity features over time. Unlike in simple human activity recognition approaches, the distinguishing characteristics of group activities are often determined by how the movement of people are influenced by one another. We propose to model the group interdependences in both motion and location spaces. These spaces are extended to time-space and time-movement spaces and modelled using Kernel Density Estimation (KDE). Such representations are then fed into a machine learning classifier which identifies the group activity. Unlike other approaches to group activity recognition, we do not rely on the manual annotation of pedestrian tracks from the video sequence.