Activity Recognition From Newborn Resuscitation Videos.
Birth asphyxia is one of the leading causes of neonatal deaths. A key for survival is performing immediate and continuous quality newborn resuscitation. A dataset of recorded signals during newborn resuscitation, including videos, has been collected in Haydom, Tanzania, and the aim is to analyze the treatment and its effect on the newborn outcome. An important step is to generate timelines of relevant resuscitation activities, including ventilation, stimulation, suction, etc., during the resuscitation episodes. We propose a two-step deep neural network system, ORAA-net, utilizing low-quality video recordings of resuscitation episodes to do activity recognition during newborn resuscitation. The first step is to detect and track relevant objects using Convolutional Neural Networks (CNN) and post-processing, and the second step is to analyze the proposed activity regions from step 1 to do activity recognition using 3D CNNs. The system recognized the activities newborn uncovered, stimulation, ventilation and suction with a mean precision of 77.67%, a mean recall of 77,64%, and a mean accuracy of 92.40%. Moreover, the accuracy of the estimated number of Health Care Providers (HCPs) present during the resuscitation episodes was 68.32%. The results indicate that the proposed CNN-based two-step ORAA-net could be used for object detection and activity recognition in noisy low-quality newborn resuscitation videos. A thorough analysis of the effect the different resuscitation activities have on the newborn outcome could potentially allow us to optimize treatment guidelines, training, debriefing, and local quality improvement in newborn resuscitation.
- Book Chapter
9
- 10.1007/978-3-030-04167-0_23
- Jan 1, 2018
Recently, convolutional neural networks (CNNs) have been extensively applied for human action recognition in videos with the fusion of appearance and motion information by two-stream network. However, for human action recognition in videos, the performance over still images recognition is so far away because of difficulty in extracting the temporal information. In this paper, we propose a multi-stream architecture with convolutional neural networks for human action recognition in videos to extract more temporal features. We make the three contributions: (a) we present a multi-stream with 3D and 2D convolutional neural networks by using still RGB frames, dense optical flows and gradient maps as the input of networks separately; (b) we propose a novel 3D convolutional neural network with residual blocks, use deep 2D convolutional neural network as the pre-train network which is added attention blocks to extract the major motion information; (c) we fuse the multi-stream networks by weights not only for networks but also for every action category to take advantage of the optimal performance of each network. Our networks are trained and evaluated on the standard video action benchmarks of UCF-101 and HMDB-51 datasets, and result shows that our method achieves considerable and comparable recognition performance to the state-of-the-art.
- Research Article
16
- 10.1177/1729881418825093
- Jan 1, 2019
- International Journal of Advanced Robotic Systems
Temporal information plays a significant role in video-based human action recognition. How to effectively extract the spatial–temporal characteristics of actions in videos has always been a challenging problem. Most existing methods acquire spatial and temporal cues in videos individually. In this article, we propose a new effective representation for depth video sequences, called hierarchical dynamic depth projected difference images that can aggregate the action spatial and temporal information simultaneously at different temporal scales. We firstly project depth video sequences onto three orthogonal Cartesian views to capture the 3D shape and motion information of human actions. Hierarchical dynamic depth projected difference images are constructed with the rank pooling in each projected view to hierarchically encode the spatial–temporal motion dynamics in depth videos. Convolutional neural networks can automatically learn discriminative features from images and have been extended to video classification because of their superior performance. To verify the effectiveness of hierarchical dynamic depth projected difference images representation, we construct a hierarchical dynamic depth projected difference images–based action recognition framework where hierarchical dynamic depth projected difference images in three views are fed into three identical pretrained convolutional neural networks independently for finely retuning. We design three classification schemes in the framework and different schemes utilize different convolutional neural network layers to compare their effects on action recognition. Three views are combined to describe the actions more comprehensively in each classification scheme. The proposed framework is evaluated on three challenging public human action data sets. Experiments indicate that our method has better performance and can provide discriminative spatial–temporal information for human action recognition in depth videos.
- Conference Article
- 10.1109/ijcnn48605.2020.9207404
- Jul 1, 2020
With the emergence of a large number of video resources, video action recognition is attracting much attention. Recently, realizing the outstanding performance of three-dimensional (3D) convolutional neural networks (CNNs), many works have began to apply them for action recognition and obtained satisfactory results. However, high computational over-heads greatly reduce the efficiency of 3D CNNs. To make up for the shortcoming, in this paper, we first propose two innovations — the Xwise Separable Convolution and the SS block, both of which are lightweight. Then we build an efficient 3D CNN called the XwiseNet based on our innovations. Our work aims to make 3D CNNs lightweight without reducing the recognition accuracy. The key idea of the Xwise Separable Convolution is extremely decoupling the 3D convolution in channel, spatial, and temporal dimensions. The SS block can capture temporal long-range dependencies via aggregating sequence-specific global context to each sequence feature. Experiments have verified that our XwiseNet achieves competitive performance with the least computational overhead.
- Conference Article
4
- 10.1109/vcip47243.2019.8965878
- Dec 1, 2019
Convolutional Neural Networks (CNNs) are powerful in learning spatial information for static images, while they appear to lose their abilities for action recognition in videos because of the neglecting of long-term motion information. Traditional 3D convolution has high computation complexity and the used Global Average Pooling (GAP) on the bottom of network can also lead to unwanted content loss or distortion. To address above problems, we propose a novel action recognition algorithm by effectively fusing 2D and Pseudo-3D CNN to learn spatio-temporal features of video. First, we use Pseudo-3D CNN with proposed Multi-level pooling module to learn spatio-temporal features. Second, the features output by multi-level pooling module are passed through our proposed processing module to make full use of the rich features. Third, a 2D CNN fed with motion vectors is designed to extract motion patterns, which can be regarded as a supplement of Pseudo-3D CNN to make up for the information lost by RGB images. Fourth, a dependency-based fusion method is proposed to fuse the multi-stream features. Finally, the effectiveness of our proposed action recognition algorithm is demonstrated on public UCF101 and HMDB51 datasets.
- Book Chapter
4
- 10.1007/978-981-15-1084-7_51
- Jan 1, 2020
With the advent of growing digital technology, large amount of video data is being generated, making video analytics a promising technology. Human activity recognition in videos is currently receiving increased attention and activity recognition systems are a large field of research and development with a focus on advanced machine learning algorithms, innovations in the field of hardware architecture, and on decreasing the costs of monitoring while increasing safety (Guo and Lai in Pattern Recognit 47:3343–3361, 2014, [1]). The existing system for action recognition involves using Convolutional Neural Networks (CNN). Videos are taken as a sequence of frames and frame-level CNN sequence features generated are fed to Long Short-Term Memory (LSTM) model for video recognition. However, the abovementioned methodology takes frame-level CNN sequence features as input for LSTM, which may fail to capture the rich motion information from adjacent frames or multiple clips. It is important to consider adjacent frames that allow for salient features, instead of mapping an entire frame into a static representation. Thereby, to mitigate this drawback, a new methodology is proposed wherein initially, saliency-aware methods are applied to generate saliency-aware videos. Then, an end-to-end pipeline is designed by integrating 3D CNN with LSTM, followed by a time series pooling layer and a softmax layer to predict the activities in video.
- Book Chapter
2
- 10.1007/978-3-319-77383-4_19
- Jan 1, 2018
Current research works for human action recognition in videos mainly focused on the case in different types of videos, that is coarse recognition. However, for recognizing specific actions of one object of interest, these methods may fail to recognize, especially if the video contains multiple moving objects with different actions. In this paper, we proposed a novel method for specific player action recognition in combat sports video. Object tracking with body segmentation are used to generate sub-frame sequences. Action recognition is achieved by training a new three-stream Convolutional Neural Networks (CNNs) model, where the network inputs are horizontal components of optical flow, single sub-frame and vertical components of optical flow, respectively. And the network fusion is applied at both convolutional and softmax layers. Extensive experiments on real broadcast combat sports videos are provided to show the advantages and effectiveness of the proposed method.
- Conference Article
33
- 10.1109/cvprw50498.2020.00193
- Jun 1, 2020
Action recognition in still images is closely related to various other computer vision tasks such as pose estimation, object recognition, image retrieval, video action recognition and frame tagging in videos. This problem is focused on recognizing a person's action or behavior using a single frame. Unlike action recognition in videos - a relatively very well established area of research where spatio-temporal features are used, these are not available for still images, making the problem more challenging. In the present work only actions that involve objects are considered. A complex action is broken down into components based on semantics. The importance of each of these components in action recognition is systematically studied.
- Research Article
3
- 10.1007/s11042-020-09137-5
- Jun 20, 2020
- Multimedia Tools and Applications
With the emergence of a large number of video resources, video action recognition is attracting much attention. Recently, realizing the outstanding performance of three-dimensional (3D) convolutional neural networks (CNNs), many works have began to apply it for action recognition and obtained satisfactory results. However, little attention has been paid to reduce the model size and computation cost of 3D CNNs. In this paper, we first propose a novel 3D convolution called the Xwise Separable Convolution, then we construct an original 3D CNN called the XwiseNet. Our work aims to make 3D CNNs lightweight without reducing its recognition accuracy. Our key idea is extremely decoupling the 3D convolution in channel, spatial and temporal dimensions. Experiments have verified that the XwiseNet outperforms 3D-ResNet-50 on the Mini-Kinetics benchmark with only 6% training parameters and 12% computation cost.
- Research Article
55
- 10.1186/s12887-018-1127-6
- May 15, 2018
- BMC Pediatrics
BackgroundAbout three – quarters of all neonatal deaths occur during the first week of life, with over half of these occurring within the first 24 h after birth. The first minutes after birth are critical to reducing neonatal mortality. Successful neonatal resuscitation (NR) has the potential to prevent these perinatal mortalities related to birth asphyxia. This study described the practice of NR and outcomes of newborns with birth asphyxia in a busy referral hospital.MethodsDirect observations of 138 NRs by 28 healthcare providers (HCPs) were conducted using a predetermined checklist adapted from the national pediatric resuscitation protocol. Descriptive statistics were computed and chi – square tests were used to test associations between the newborn outcome at 1 h and the NR processes for the observed newborns. Logistic regression models assessed the relationship between the survival status at 1 h versus the NR processes and newborn characteristics.ResultsNurses performed 72.5% of the NRs. A warm environment was maintained in 71% of the resuscitations. Airway was checked for almost all newborns (98%) who did not initiate spontaneous breathing after stimulation. However, only 40% of newborns were correctly cared for in case of meconium presence in airway. Bag and mask ventilation (BMV) was initiated in 100% of newborns who did not respond to stimulation and airway maintenance. About 86.2% of resuscitated newborns survived after 1 h. Removing wet cloth (P = 0.035, OR = 2.90, CI = 1.08–7.76), keeping baby warm (P = 0.018, OR = 3.30, CI = 1.22–8.88), meconium in airway (P = 0.042, OR = 0.34, CI = 0.12–0.96) and gestation age (P = 0.007, OR = 1.38, CI = 1.10–1.75) were associated with newborn outcome at 1 h.ConclusionsMentorship and regular cost – effective NR trainings with focus on maintaining the warm chain during NR, airway maintenance in meconium presence, BMV and care for premature babies are needed for HCPs providing NR.
- Book Chapter
- 10.1007/978-981-16-1092-9_3
- Jan 1, 2021
Action recognition in video sequences is an active research problem in Computer Vision. However, no significant efforts have been made for recognizing actions in hazy videos. This paper proposes a novel unified model for action recognition in hazy video using an efficient combination of a Convolutional Neural Network (CNN) for obtaining the dehazed video first, followed by extracting spatial features from each frame, and a deep bidirectional LSTM (DB-LSTM) network for extracting the temporal features during action. First, each frame of the hazy video is fed into the AOD-Net (All-in-One Dehazing Network) model to obtain the clear representation of frames. Next, spatial features are extracted from every sampled dehazed frame (produced by the AOD-Net model) by using a pre-trained VGG-16 architecture, which helps reduce the redundancy and complexity. Finally, the temporal information across the frames are learnt using a DB-LSTM network, where multiple LSTM layers are stacked together in both the forward and backward passes of the network. The proposed unified model is the first attempt to recognize human action in hazy videos. Experimental results on a synthetic hazy video dataset show state-of-the-art performances in recognizing actions.KeywordsCNNBidirectional LSTMHaze removalHuman action recognitionAODNet
- Conference Article
4
- 10.1109/icccs55155.2022.9846526
- Apr 22, 2022
Action recognition is to automatically detect and classify human’s action in videos, with difficulty lies in modeling temporal relationship between frame sequences. The well-used 2D convolution neural network (CNN) is not suitable for this work, due to lacking temporal modeling ability. In this paper, a novel 2D CNN with inter frame information extraction module based on bilinear operation is proposed to deal with this problem. This model can greatly improve the temporal modeling ability of 2D CNN and just introduce a small amount of storage and calculation via parameter decomposition method. In addition, it has a flexible form to easily make tradeoff between performance and complexity. Finally, the effectiveness of this new network is validated on two kinds of benchmarks including both temporal-related (Something-Something v1) and scene-related(mini-kinetics), with top-1 accuracy 44.5% and 67.8% respectively, which reach or exceed the performance of existing methods with the similar model complexity.
- Research Article
17
- 10.3390/electronics9010147
- Jan 12, 2020
- Electronics
Action recognition is an active research field that aims to recognize human actions and intentions from a series of observations of human behavior and the environment. Unlike image-based action recognition mainly using a two-dimensional (2D) convolutional neural network (CNN), one of the difficulties in video-based action recognition is that video action behavior should be able to characterize both short-term small movements and long-term temporal appearance information. Previous methods aim at analyzing video action behavior only using a basic framework of 3D CNN. However, these approaches have a limitation on analyzing fast action movements or abruptly appearing objects because of the limited coverage of convolutional filter. In this paper, we propose the aggregation of squeeze-and-excitation (SE) and self-attention (SA) modules with 3D CNN to analyze both short and long-term temporal action behavior efficiently. We successfully implemented SE and SA modules to present a novel approach to video action recognition that builds upon the current state-of-the-art methods and demonstrates better performance with UCF-101 and HMDB51 datasets. For example, we get accuracies of 92.5% (16f-clip) and 95.6% (64f-clip) with the UCF-101 dataset, and 68.1% (16f-clip) and 74.1% (64f-clip) with HMDB51 for the ResNext-101 architecture in a 3D CNN.
- Book Chapter
1
- 10.1007/978-3-030-39431-8_23
- Jan 1, 2020
Deep Convolutional Neural Networks (CNNs) have achieved great success in object recognition. However, they are difficult to capture the long-range temporal information, which plays an important role for action recognition in videos. To overcome this issue, a two-stream architecture including spatial and temporal segments based CNNs is widely used recently. However, the relationship among the segments is not sufficiently investigated. In this paper, we proposed to combine multiple segments by a fully connected layer in a deep CNN model for the whole action video. Moreover, the four streams (i.e., RGB, RGB differences, optical flow, and warped optical flow) are carefully integrated with a linear combination, and the weights are optimized on the validation datasets. We evaluate the recognition accuracy of the proposed method on two benchmark datasets of UCF101 and HMDB51. The extensive experimental results demonstrate encouraging results of our proposed method. Specifically, the proposed method improves the accuracy of action recognition in videos obviously (e.g., compared with the baseline, the accuracy is improved from 94.20% to 97.30% and from 69.40% to 77.99% on the dataset UCF101 and HMDB51, respectively). Furthermore, the proposed method can obtain the competitive accuracy to the state-of-the-art method of the 3D convolutional operation, but with much fewer parameters.
- Book Chapter
5
- 10.1007/11875581_99
- Jan 1, 2006
Combining audio and image processing for understanding video content has several benefits when compared to using each modality on their own. For the task of context and activity recognition in video sequences, it is important to explore both data streams to gather relevant information. In this paper we describe a video context and activity recognition model. Our work extracts a range of audio and visual features, followed by feature reduction and information fusion. We show that combining audio with video based decision making improves the quality of context and activity recognition in videos by 4% over audio data and 18% over image data.
- Research Article
22
- 10.3390/data5040104
- Nov 11, 2020
- Data
The Two-stream convolution neural network (CNN) has proven a great success in action recognition in videos. The main idea is to train the two CNNs in order to learn spatial and temporal features separately, and two scores are combined to obtain final scores. In the literature, we observed that most of the methods use similar CNNs for two streams. In this paper, we design a two-stream CNN architecture with different CNNs for the two streams to learn spatial and temporal features. Temporal Segment Networks (TSN) is applied in order to retrieve long-range temporal features, and to differentiate the similar type of sub-action in videos. Data augmentation techniques are employed to prevent over-fitting. Advanced cross-modal pre-training is discussed and introduced to the proposed architecture in order to enhance the accuracy of action recognition. The proposed two-stream model is evaluated on two challenging action recognition datasets: HMDB-51 and UCF-101. The findings of the proposed architecture shows the significant performance increase and it outperforms the existing methods.