A study of animal action segmentation algorithms across supervised, unsupervised, and semi-supervised learning paradigms.
Action segmentation of behavioral videos is the process of labeling each frame as belonging to one or more discrete classes, and is a crucial component of many studies that investigate animal behavior. A wide range of algorithms exist to automatically parse discrete animal behavior, encompassing supervised, unsupervised, and semi-supervised learning paradigms. These algorithms - which include tree-based models, deep neural networks, and graphical models - differ widely in their structure and assumptions on the data. Using four datasets spanning multiple species - fly, mouse, and human - we systematically study how the outputs of these various algorithms align with manually annotated behaviors of interest. Along the way, we introduce a semi-supervised action segmentation model that bridges the gap between supervised deep neural networks and unsupervised graphical models. We find that fully supervised temporal convolutional networks with the addition of temporal information in the observations perform the best on our supervised metrics across all datasets.
- Conference Article
5
- 10.1109/icsipa.2017.8120635
- Sep 1, 2017
Unsupervised segmentation of action segments in egocentric videos is a desirable feature in tasks such as activity recognition and content-based video retrieval. Reducing the search space into a finite set of action segments facilitates a faster and less noisy matching. However, there exist a substantial gap in machine understanding of natural temporal cuts during a continuous human activity. This work reports on a novel gaze-based approach for segmenting action segments in videos captured using an egocentric camera. Gaze is used to locate the region-of-interest inside a frame. By tracking two simple motion-based parameters inside successive regions-of-interest, we discover a finite set of temporal cuts. We present several results using combinations (of the two parameters) on a dataset, i.e., BRISGAZE-ACTIONS. The dataset contains egocentric videos depicting several daily-living activities. The quality of the temporal cuts is further improved by implementing two entropy measures.
- Research Article
10
- 10.1609/aaai.v38i6.28445
- Mar 24, 2024
- Proceedings of the AAAI Conference on Artificial Intelligence
Action segmentation serves as a pivotal component in comprehending videos, encompassing the learning of a sequence of semantically consistent action units known as actoms. Conventional methodologies tend to require a significant consumption of time for both training and learning phases. This paper introduces an innovative unsupervised framework for action segmentation in video, characterized by its fast learning capability and absence of mandatory training. The core idea involves splitting the video into distinct actoms, which are then merging together based on shared actions. The key challenge here is to prevent the inadvertent creation of singular actoms that attempt to represent multiple actions during the splitting phase. Additionally, it is crucial to avoid situations where actoms associated with the same action are incorrectly grouped into multiple clusters during the merging phase. In this paper, we present a method for calculating the similarity between adjacent frames under a subspace assumption. Then, we employ a local minimum searching procedure, which effectively splits the video into coherent actoms aligned with their semantic meaning and provides us an action segmentation proposal. Subsequently, we calculate a spatio-temporal similarity between actoms, followed by developing a merging process to merge actoms representing identical actions within the action segmentation proposals. Our approach is evaluated on four benchmark datasets, and the results demonstrate that our method achieves state-of-the-art performance. Besides, our method also achieves the optimal balance between accuracy and learning time when compared to existing unsupervised techniques. Code is available at https://github.com/y66y/SaM.
- Conference Article
11
- 10.1109/icip.2019.8803088
- Sep 1, 2019
Fine-grained temporal human action segmentation in untrimmed videos is receiving increasing attention due to its extensive applications in surveillance, robotics, and beyond. It is crucial for an action segmentation system to be robust to the temporal scale of different actions since in practical applications the duration of an action can vary from less than a second to tens of minutes. In this paper, we introduce a novel atrous temporal convolutional network (AT-Net), which explicitly generates multiscale video contextual representations by utilizing atrous temporal pyramid pooling (ATPP) and has an architecture of encoder-decoder fully convolutional network. In the decoding stage, AT-Net combines multiscale contextual features with low-level local features to generate high-quality action segmentation results. Experiments on the 50 Salads, GTEA and JIGSAWS benchmarks demonstrate that AT-Net achieves improvement over the state of the art.
- Research Article
42
- 10.1016/j.neunet.2019.01.004
- Feb 1, 2019
- Neural Networks
Deep associative neural network for associative memory based on unsupervised representation learning
- Conference Article
6
- 10.1109/mlsp.2016.7738877
- Sep 1, 2016
Semi-supervised learning methods exploit both labeled and unlabeled data items in their training process, requiring only a small subset of labeled items. Although capable of drastically reducing the costs of labeling process, such methods are directly dependent on the effectiveness of distance measures used for building the kNN graph. On the other hand, unsupervised distance learning approaches aims at capturing and exploiting the dataset structure in order to compute a more effective distance measure, without the need of any labeled data. In this paper, we propose a combined approach which employs both unsupervised and semi-supervised learning paradigms. An unsupervised distance learning procedure is performed as a pre-processing step for improving the kNN graph effectiveness. Based on the more effective graph, a semi-supervised learning method is used for classification. The proposed Combined Unsupervised and Semi-Supervised Learning (CUSSL) approach is based on very recent methods. The Reciprocal kNN Distance is used for unsupervised distance learning tasks and the semi-supervised learning classification is performed by Particle Competition and Cooperation (PCC). Experimental results conducted in six public datasets demonstrated that the combined approach can achieve effective results, boosting the accuracy of classification tasks.
- Research Article
7
- 10.3390/info11110518
- Nov 5, 2020
- Information
Document clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their purely unsupervised nature. To solve this problem, some semi-supervised models have been proposed for English language. However, no such work is available for poor resource language Urdu. Therefore, document clustering has become a challenging task in Urdu language, which has its own morphology, syntax and semantics. In this study, we proposed a semi-supervised framework for Urdu documents clustering to deal with the Urdu morphology challenges. The proposed model is a combination of pre-processing techniques, seeded-LDA model and Gibbs sampling, we named it seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA). We apply the proposed model and other methods to Urdu news datasets for categorizing. For the datasets, two conditions are considered for document clustering, one is “Dataset without overlapping” in which all classes have distinct nature. The other is “Dataset with overlapping” in which the categories are overlapping and the classes are connected to each other. The aim of this study is threefold: it first shows that unsupervised models (Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and K-means) are giving satisfying results on the dataset without overlapping. Second, it shows that these unsupervised models are not performing well on the dataset with overlapping, because, on this dataset, these algorithms find some topics that are neither entirely meaningful nor effective in extrinsic tasks. Third, our proposed semi-supervised model Seeded-ULDA performs well on both datasets because this model is straightforward and effective to instruct topic models to find topics of specific interest. It is shown in this paper that the semi-supervised model, Seeded-ULDA, provides significant results as compared to unsupervised algorithms.
- Conference Article
12
- 10.1109/cvprw.2017.205
- Jul 1, 2017
We propose a new task of unsupervised action detection by action matching. Given two long videos, the objective is to temporally detect all pairs of matching video segments. A pair of video segments are matched if they share the same human action. The task is category independent-it does not matter what action is being performed-and no supervision is used to discover such video segments. Unsupervised action detection by action matching allows us to align videos in a meaningful manner. As such, it can be used to discover new action categories or as an action proposal technique within, say, an action detection pipeline. Moreover, it is a useful pre-processing step for generating video highlights, e.g., from sports videos. We present an effective and efficient method for unsupervised action detection. We use an unsupervised temporal encoding method and exploit the temporal consistency in human actions to obtain candidate action segments. We evaluate our method on this challenging task using three activity recognition benchmarks, namely, the MPII Cooking activities dataset, the THUMOS15 action detection benchmark and a new dataset called the IKEA dataset. On the MPII Cooking dataset we detect action segments with a precision of 21.6% and recall of 11.7% over 946 long video pairs and over 5000 ground truth action segments. Similarly, on THUMOS dataset we obtain 18.4% precision and 25.1% recall over 5094 ground truth action segment pairs.
- Preprint Article
- 10.48550/arxiv.2103.15449
- Mar 29, 2021
- Lirias (KU Leuven)
Freezing of gait (FOG) is a common and debilitating gait impairment in Parkinson's disease. Further insight into this phenomenon is hampered by the difficulty to objectively assess FOG. To meet this clinical need, this paper proposes an automated motion-capture-based FOG assessment method driven by a novel deep neural network. Automated FOG assessment can be formulated as an action segmentation problem, where temporal models are tasked to recognize and temporally localize the FOG segments in untrimmed motion capture trials. This paper takes a closer look at the performance of state-of-the-art action segmentation models when tasked to automatically assess FOG. Furthermore, a novel deep neural network architecture is proposed that aims to better capture the spatial and temporal dependencies than the state-of-the-art baselines. The proposed network, termed multi-stage spatial-temporal graph convolutional network (MS-GCN), combines the spatial-temporal graph convolutional network (ST-GCN) and the multi-stage temporal convolutional network (MS-TCN). The ST-GCN captures the hierarchical spatial-temporal motion among the joints inherent to motion capture, while the multi-stage component reduces over-segmentation errors by refining the predictions over multiple stages. The experiments indicate that the proposed model outperforms four state-of-the-art baselines. Moreover, FOG outcomes derived from MS-GCN predictions had an excellent (r=0.93 [0.87, 0.97]) and moderately strong (r=0.75 [0.55, 0.87]) linear relationship with FOG outcomes derived from manual annotations. The proposed MS-GCN may provide an automated and objective alternative to labor-intensive clinician-based FOG assessment. Future work is now possible that aims to assess the generalization of MS-GCN to a larger and more varied verification cohort.
- Research Article
- 10.3389/fvets.2025.1586438
- Nov 17, 2025
- Frontiers in Veterinary Science
IntroductionIn swine disease surveillance, obtaining labeled data for supervised learning models can be challenging because many farms lack standardized diagnostic routines and consistent health monitoring systems. Unsupervised learning is particularly suitable in such scenarios because it does not require labeled data, allowing for detecting anomalies without predefined labels. This study evaluates the effectiveness of unsupervised machine learning models in detecting anomalies in productivity indicators in swine breeding herds.MethodsAnomalies, defined as deviations from expected patterns, were identified in indicators such as abortions per 1000 sows, prenatal losses, preweaning mortality, total born, liveborn, culled sows per 1000 sows, and dead sows per 1000 sows. Three unsupervised models - Isolation Forest, Autoencoder, and K-Nearest Neighbors (KNN) - were applied to data from two swine production systems. The herd-week was used as the unit of analysis, and anomaly scores above the 75th percentile were used to flag anomalous weeks. A permutation test assessed differences between anomalous and non-anomalous weeks. Performance was evaluated using F1-score, precision, and recall, with true anomalous weeks defined as those coinciding with reported health challenges, including porcine reproductive and respiratory syndrome (PRRS) and Seneca Valley virus outbreaks. A total of 8,044 weeks were analyzed.ResultsThe models identified 336 anomalous weeks and 1,008 non-anomalous weeks in Production System 1, and 1,675 anomalous weeks and 5,025 non-anomalous weeks in Production System 2. The results from the permutation test revealed significant differences in productivity indicators between anomalous and non-anomalous weeks, especially during PRRS outbreaks, with more subtle changes observed during Seneca Valley virus outbreaks. The models performed well in detecting the PRRSV anomaly, achieving perfect precision (100%) across all models for both production systems. For anomalies like SVV the models showed lower performance compared to PRRSV.DiscussionThese findings suggest that unsupervised machine learning models are promising tools for early disease detection in swine herds, as they can identify anomalies in productivity data that may signal health challenges.
- Research Article
37
- 10.1109/tpami.2021.3089127
- Oct 1, 2022
- IEEE Transactions on Pattern Analysis and Machine Intelligence
Action segmentation is the task of predicting the actions for each frame of a video. As obtaining the full annotation of videos for action segmentation is expensive, weakly supervised approaches that can learn only from transcripts are appealing. In this paper, we propose a novel end-to-end approach for weakly supervised action segmentation based on a two-branch neural network. The two branches of our network predict two redundant but different representations for action segmentation and we propose a novel mutual consistency (MuCon) loss that enforces the consistency of the two redundant representations. Using the MuCon loss together with a loss for transcript prediction, our proposed approach achieves the accuracy of state-of-the-art approaches while being 14 times faster to train and 20 times faster during inference. The MuCon loss proves beneficial even in the fully supervised setting.
- Conference Article
1
- 10.1109/acii.2009.5349506
- Sep 1, 2009
Social signal processing is an emerging field that gains more and more attention. As a key element in the field, visual perception of human motion is important for understanding human behavior in social intelligence. Motivated by the hypothesis of muscle synergies, we proposed action synergies for automatically partitioning human motion into individual action segments in videos. Assuming the size of the human subject is reasonable and the background changes smoothly, the video sequence is represented by six latent variables, which we obtain using Gaussian process dynamical models (GPDM). For each variable, the third order derivative and its local maxima are computed. Then by finding the consistent local maxima in all variables, the video is partitioned into action segments. We demonstrate the usefulness of the algorithm for periodic motion patterns as well as non-periodic ones, using videos of various qualities. Results show that the proposed algorithm partitions videos into meaningful action segments.
- Research Article
81
- 10.1111/tpj.15905
- Jul 27, 2022
- The Plant Journal
Advances in high-throughput omics technologies are leading plant biology research into the era of big data. Machine learning (ML) performs an important role in plant systems biology because of its excellent performance and wide application in the analysis of big data. However, to achieve ideal performance, supervised ML algorithms require large numbers of labeled samples as training data. In some cases, it is impossible or prohibitively expensive to obtain enough labeled training data; here, the paradigms of unsupervised learning (UL) and semi-supervised learning (SSL) play an indispensable role. In this review, we first introduce the basic concepts of ML techniques, as well as some representative UL and SSL algorithms, including clustering, dimensionality reduction, self-supervised learning (self-SL), positive-unlabeled (PU) learning and transfer learning. We then review recent advances and applications of UL and SSL paradigms in both plant systems biology and plant phenotyping research. Finally, we discuss the limitations and highlight the significance and challenges of UL and SSL strategies in plant systems biology.
- Conference Article
1
- 10.1145/3383455.3422565
- Oct 15, 2020
Forecasting with multivariate time series, which aims to predict future values given previous and current several univariate time series data, has been studied for decades, with one example being ARIMA. Because it is difficult to measure the extent to which noise is mixed with informative signals within rapidly fluctuating financial time series data, designing a good predictive model is not a simple task. Recently, many researchers have become interested in recurrent neural networks and attention-based neural networks, applying them in financial forecasting. There have been many attempts to utilize these methods for the capturing of long-term temporal dependencies and to select more important features in multivariate time series data in order to make accurate predictions. In this paper, we propose a new prediction framework based on deep neural networks and a trend filtering, which converts noisy time series data into a piecewise linear fashion. We reveal that the predictive performance of deep temporal neural networks improves when the training data is temporally processed by a trend filtering. To verify the effect of our framework, three deep temporal neural networks, state of the art models for predictions in time series finance data, are used and compared with models that contain trend filtering as an input feature. Extensive experiments on real-world multivariate time series data show that the proposed method is effective and significantly better than existing baseline methods.
- Research Article
2
- 10.1088/1755-1315/1101/9/092003
- Nov 1, 2022
- IOP Conference Series: Earth and Environmental Science
Earthwork excavator, as an all-terrain and high-efficiency earthwork excavation equipment, has been widely used in earthwork sites. It is very necessary to analyze the work of earthmoving excavator by means of machine vision. In this paper, the action segmentation method based on long video was applied to the analysis and recogniton of the excavator’s action, and compared with other two current best action segmentation models using the real construction site video. Firstly, the sequence features of the excavator’s work video obtained at the cnstruction site was extracted through 3D convolution method, and then two differernt networks with the extracted sequence features were trained and tested. The experimental results showed that the average frame accuracy of MS-TCN model and ASRF model in excavator action segmentation were 82.6490% and 86.1042% respectively. However, for the recognition task under different working environment, the performance of the two models is quite different. The experimental results manifest that the motion segmentation model based on long video reached good results in excavator motion recognition in earthmoving operation. And it is helpful to analyze the long video working behavior sequence of excavator. This research contributes to the identification of critical elements that explains serial action and to the development of a new application scenario for vision-based behavior segmentation network. Additionally, the results of this study were helpful to automatically analyze the working efficiency and monitor the productivity of earthmoving excavators. Using this kind of data-driven decision can improve the work efficiency of earthmoving excavator and promote the project progress.
- Supplementary Content
4
- 10.1108/ria-01-2023-0008
- Aug 21, 2023
- Robotic Intelligence and Automation
PurposeAccurate segmentation of artificial assembly action is the basis of autonomous industrial assembly robots. This paper aims to study the precise segmentation method of manual assembly action.Design/methodology/approachIn this paper, a temporal-spatial-contact features segmentation system (TSCFSS) for manual assembly actions recognition and segmentation is proposed. The system consists of three stages: spatial features extraction, contact force features extraction and action segmentation in the temporal dimension. In the spatial features extraction stage, a vectors assembly graph (VAG) is proposed to precisely describe the motion state of the objects and relative position between objects in an RGB-D video frame. Then graph networks are used to extract the spatial features from the VAG. In the contact features extraction stage, a sliding window is used to cut contact force features between hands and tools/parts corresponding to the video frame. Finally, in the action segmentation stage, the spatial and contact features are concatenated as the input of temporal convolution networks for action recognition and segmentation. The experiments have been conducted on a new manual assembly data set containing RGB-D video and contact force.FindingsIn the experiments, the TSCFSS is used to recognize 11 kinds of assembly actions in demonstrations and outperforms the other comparative action identification methods.Originality/valueA novel manual assembly actions precisely segmentation system, which fuses temporal features, spatial features and contact force features, has been proposed. The VAG, a symbolic knowledge representation for describing assembly scene state, is proposed, making action segmentation more convenient. A data set with RGB-D video and contact force is specifically tailored for researching manual assembly actions.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.