A multimodal link prediction approach for bridge maintenance via spatiotemporal feature fusion and cross-modal contrastive interaction
A multimodal link prediction approach for bridge maintenance via spatiotemporal feature fusion and cross-modal contrastive interaction
- Research Article
26
- 10.1016/j.eswa.2022.118089
- Jul 8, 2022
- Expert Systems with Applications
Human-inspired spatiotemporal feature extraction and fusion network for weather forecasting
- Research Article
- 10.1002/cpe.6709
- Nov 17, 2021
- Concurrency and Computation: Practice and Experience
Object detection in video is an emerging and challenging research topic in Computer Vision. Recently, deep learning based approaches achieves greater results in object detection. The major issues in detecting multiple objects from the videos are exploiting low‐level visual concepts and temporal information and also exploring the correlation between the objects in the videos. Therefore, to address the above mentioned issues, the spatio‐temporal feature fusion based correlative binary relevance (STFF‐CBR) classifier is proposed to generate a rich vector representation and exploit the label correlation for the object detection in videos. In this article, first, the spatio‐temporal feature fusion (STFF) is proposed to exploit the low‐level visual concepts and temporal information in the videos which significantly improves the object detection performance. Second, correlative binary relevance (CBR) classification approach is proposed to exploit the dependencies between the labels in the video using the nearest neighbor based label dependency graph (LDG). Additionally, feed forward neural network (FFNN) classifier is utilized to increase the classification accuracy of the CBR method. Experimental evaluation shows that the STFF‐CBR classifier model achieves better performance for object detection problem in video against the state‐of‐the‐art methods.
- Research Article
2
- 10.3390/en18061514
- Mar 19, 2025
- Energies
Accurately identifying the fault type of an optical current transformer (optical CT) and evaluating the fault severity can provide strong support for the operation and maintenance of a direct current (DC) power system. In response to the problems that current research overlooks, the spatiotemporal features when making fault identification, which restrain the improvement of identification accuracy, and consider fault identification as an assessment of fault severity, which is unable to provide effective information for actual operation and maintenance work, this paper proposes an optical CT fault severity assessment model based on scene generation and spatiotemporal feature fusion. Firstly, a CNN-Transformer model is constructed to mine the fault characteristics in spatial and temporal dimensions by feature fusion, achieving accurate identification of fault types. Secondly, an improved synthetic minority oversampling method is adopted to generate virtual operating scenes, and the operating range under different operating states of the optical CT is statistically obtained. Finally, based on Shapley Additive Explanations (SHAP), the importance of the feature of optical CT is evaluated under different fault types. Reliant on the importance of features and operating range under different running states, the severity of the fault is assessed by quantifying the difference between the fault state and the normal state of the optical CT under the identified fault type. This study validated the effectiveness of the proposed method using actual operational data from an optical CT at a converter station in Hebei Province in China.
- Research Article
27
- 10.1109/tcyb.2021.3072311
- Nov 1, 2022
- IEEE Transactions on Cybernetics
Detecting small low-contrast targets in the airspace is an essential and challenging task. This article proposes a simple and effective data-driven support vector machine (SVM)-based spatiotemporal feature fusion detection method for small low-contrast targets. We design a novel pixel-level feature, called a spatiotemporal profile, to depict the discontinuity of each pixel in the spatial and temporal domains The spatiotemporal profile is a local patch of the spatiotemporal feature maps concatenated by the spatial feature maps and temporal feature maps in channelwise, which are generated by the morphological black-hat filter and a ghost-free dark-focusing frame difference methods, respectively. Instead of the handcrafted feature fusion mechanisms in previous works, we use the labeled spatiotemporal profiles to train an SVM classifier to learn the spatiotemporal feature fusion mechanism automatically. To speed up detection for high-resolution videos, the serial SVM classification process on central processing units (CPUs) is reformed as parallel convolution operations on graphics processing unit (GPUs), which exhibits over 1000+ times speedup in our real experiments. Finally, blob analysis is applied to generate final detection results. Elaborate experiments are conducted, and experimental results demonstrate that the proposed method performs better than 12 baseline methods for the small low-contrast target detection. The field tests manifest that the parallel implementation of the proposed method can realize real-time detection at 15.3 FPS for videos at a resolution of 2048×1536 and the maximum detection distance can reach 1 km for drones in sunny weather.
- Research Article
10
- 10.1109/tcsvt.2023.3326695
- Aug 1, 2024
- IEEE Transactions on Circuits and Systems for Video Technology
The object tracking technology for aerial remote sensing images has made significant development, but it is still a very challenging work. The related difficulties of object tracking include the accumulation of long-term tracking errors, similar object interference, partial or full occlusion, scale change, etc, which can lead to object tracking failure. In this paper, an aerial object tracker with ViT Spatio-Temporal Feature Fusion (STFF) for the aerial remote sensing images is proposed, which can achieve accurate tracking of aviation objects. Firstly, we propose a spatial-temporal feature fusion strategy based on the characteristics of object tracking timing. In this strategy, the object information of the previous frames is applied to enhance both the real-time responsiveness of the model and the performance of the tracker. Secondly, the dynamic change information of objects in space and time context is used for spatio-temporal feature information fusion, which can further enhance the appropriate correlation and promote the feature aggregation and information transmission of visual tracking. Finally, a dataset with real and virtual scenarios is collected and constructed to address training data requirements for aviation object tracking. According to our experiments, STFF can achieve accurate tracking of aerial objects and has achieved excellent performance on UAV123, DTB70 and our benchmarks.
- Research Article
- 10.3390/aerospace13030212
- Feb 27, 2026
- Aerospace
Aero engine surge diagnosis is a key technology in engine health management, and its diagnostic accuracy is of great significance for ensuring operational safety. Traditional threshold-based diagnostic methods are significantly affected by working conditions, which makes it difficult to achieve full working condition coverage. Moreover, due to issues such as varying feature thresholds across conditions, weak signal characteristics, and low identifiability, the diagnostic accuracy remains limited. To address these challenges, this paper proposes an STFF-CANet (Spatio-Temporal Feature Fusion Cross-Attentional Network) diagnosis model of aero engine surge based on spatio-temporal feature fusion. The model first employs a Convolutional Neural Network (CNN) to extract spatial features from the frequency domain of dynamic signals via Fast Fourier Transform (FFT). Simultaneously, a Bidirectional Long Short-Term Memory (BiLSTM) network is used to capture temporal features from signals optimized by Variational Mode Decomposition (VMD). A cross-attention mechanism is further introduced to achieve deep fusion of spatiotemporal features, thereby enhancing the capability to identify weak fault characteristics. In addition, the sliding window slice method is used to expand the sample size for the small sample fault data of the engine surge of an aero engine. This ensures both informational continuity between slices and statistical stability of features, effectively mitigating the difficulty of diagnosing early and weak surge characteristics under small-sample conditions. Experimental results demonstrate that the model achieves an F1-score, Recall, Precision, and Accuracy of 97.96%, 97.52%, 98.43%, and 99.01%, respectively, in surge fault classification. These outcomes meet the practical requirements for aero engine surge diagnosis and provide an effective solution for early fault warning in complex industrial equipment.
- Research Article
3
- 10.1016/j.future.2024.107636
- May 1, 2025
- Future Generation Computer Systems
IoVST: An anomaly detection method for IoV based on spatiotemporal feature fusion
- Research Article
- 10.3390/pr13113586
- Nov 6, 2025
- Processes
Significant interest has been sparked in the monitoring and prediction of air quality due to the impact of air quality on human health. However, challenges arise from characterizing the complex spatial features and temporal features of monitored air quality data. In this paper, we develop an air quality forecasting model using spatio-temporal feature fusion over graphs. We use the location information of air quality monitoring stations to construct a directed graph adjacency matrix, which helps in extracting the spatial features of air quality data. A spatio-temporal feature extraction module is designed by explicitly involving the graph adjacency matrix to help characterize the coupled effects between spatial and temporal features of air quality data. Our proposed air quality prediction model was demonstrated using a real-world dataset collected over 35 air monitoring stations in Beijing. Numerical experiments demonstrate that our proposed model improves the air quality prediction over several existing models, e.g., 18.65 percent improvement in 24 h air quality prediction over the MAE metric and 15.91 percent improvement in 24 h prediction over the RMSE metric.
- Book Chapter
78
- 10.1007/978-3-030-60639-8_40
- Jan 1, 2020
Human action recognition is one of the challenging and active research fields. Recently, spatio-temporal graph convolutions for skeleton-based action recognition have attracted much attention. Several strategies, such as temporal downsampling, convolution striding, and temporal pooling, are used to handle long action sequences. Recurrent neural networks are typically used for the processing of sequential data. In this paper, we propose a deep architecture that combines spatio-temporal graph convolution and graph-temporal long short-term memory (GT-LSTM) for skeleton-based human action recognition. Initially, topology-learnable spatio-temporal graph convolutions are applied to learn the local spatio-temporal features of graph nodes and adaptively evolve graph topologies. Then, GT-LSTM successively performs the spatio-temporal feature fusion with the node sequence and the temporal dimension, for the final recognition. Experimental results on the NTU RGB+D and Kinetics-Skeleton datasets demonstrate that the proposed architecture can effectively perform graph node information aggregation, graph topology evolution, and spatio-temporal graph feature fusion. liu2017skeleton.
- Research Article
59
- 10.1109/tgrs.2024.3470314
- Jan 1, 2024
- IEEE Transactions on Geoscience and Remote Sensing
The field of remote sensing change detection (RSCD) has seen significant advancements recently, focusing on the precise identification and analysis of temporal changes in remote sensing images. Existing deep learning-based RSCD methods primarily rely on concatenation or subtraction to integrate features of bi-temporal images and reconstruct change features through a feature pyramid network (FPN) decoding architecture. However, these methods face challenges related to inadequate spatio-temporal change representation and insufficient aggregation of multilevel semantic information, resulting in pseudo-changes and poor completeness of detected change objects. In this article, we propose an innovative RSCD framework via spatio-temporal feature fusion and guide aggregation (STFF-GA) to address the aforementioned challenges. The architecture of this network comprises two key components: the STFF module and the GA module. The STFF module is designed as a low-parameter and low-computation structure, effectively enhancing the representation of spatio-temporal change information through split, interaction, and fusion strategies. The GA module uses deep feature guidance (DFG) mapping as prior information to guide the aggregation of multilevel semantic information, thereby correcting the positional information of change objects and filtering out pseudo-changes and other noise interference. In addition, it utilizes convolution kernels of various scales to extract fine-grained features, facilitating the complete reconstruction of change objects. Extensive experiments conducted on three benchmark change detection datasets demonstrate that the proposed STFF-GA consistently outperforms other state-of-the-art (SOTA) detectors. The code is available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/NjustHGWei/STFF-GA</uri>.
- Research Article
- 10.1109/tmi.2026.3674130
- Jan 1, 2026
- IEEE transactions on medical imaging
Atrial fibrillation, characterized by high prevalence and poor prognosis, presents a significant global health burden. Accurate segmentation and measurement of left ventricular and left atrial appendage morphology and function are essential for reliable risk assessment. However, these tasks are hindered by ambiguous bound-aries, complex cardiac motion, and sparse annotations. To address these challenges, we propose a Keypoint-Guided Medical Video Segmentation Model with Spatiotemporal Feature Fusion (KG-STS). First, we propose a shape-constrained point encoder that explicitly encodes boundary points to improve the representation of ambiguous boundaries. Next, we introduce a motion-aware alignment module that models cardiac motion by forming coherent motion information across frames. Building on these two modules, we develop a keypoint-guided spatiotemporal feature fusion module that integrates spatial boundary representations with temporal motion cues to enhance decoding features under sparse annotations, enabling temporally consistent segmentation and supporting morphological measurement. We evaluate the segmentation and measurement performance of our method on a self-constructed multi-view transesophageal echocardiography dataset and two publicly available transthoracic echocar-diography datasets. The results demonstrate that KG-STS achieves superior temporal consistency in segmentation and higher accuracy in morphological measurements compared to competing methods.
- Research Article
- 10.1088/1361-6501/ae46c5
- Feb 25, 2026
- Measurement Science and Technology
Structured light measurement is widely used in weld seam recognition due to its high precision and robust performance. However, in the automated guided welding process of narrow weld seam, robust detection cannot be achieved using traditional laser vision sensors due to complex environmental interference and the lack of distinct geometric contour information in the weld. To accurately extract narrow weld seam information and enhance image detection stability, we propose a semantic segmentation network for narrow weld seam based on spatio-temporal feature fusion. First, we utilize an enhanced laser vision sensor equipped with an auxiliary ambient light source to autonomously capture high-quality sequence images of narrow weld seam with laser stripe at the engineering site. Second, we design the spatio-temporal feature fusion narrow weld seam network (STFNet), and a topology encoder is introduced to extract the target's topological information. Subsequently, a spatial perception module and a temporal feature extraction module are proposed to capture the target's spatio-temporal information. The feature pyramid fusion module employs multi-level fusion to ultimately output precise weld detection results. Performance on our self-constructed NWSDataset demonstrates that our network effectively addresses real-time detection of narrow weld seam. It has been successfully deployed in submerged arc welding engineering applications, meeting practical welding requirements.
- Research Article
- 10.32604/csse.2023.040132
- Jan 1, 2023
- Computer Systems Science and Engineering
An action recognition network that combines multi-level spatiotemporal feature fusion with an attention mechanism is proposed as a solution to the issues of single spatiotemporal feature scale extraction, information redundancy, and insufficient extraction of frequency domain information in channels in 3D convolutional neural networks. Firstly, based on 3D CNN, this paper designs a new multilevel spatiotemporal feature fusion (MSF) structure, which is embedded in the network model, mainly through multilevel spatiotemporal feature separation, splicing and fusion, to achieve the fusion of spatial perceptual fields and short-medium-long time series information at different scales with reduced network parameters; In the second step, a multi-frequency channel and spatiotemporal attention module (FSAM) is introduced to assign different frequency features and spatiotemporal features in the channels are assigned corresponding weights to reduce the information redundancy of the feature maps. Finally, we embed the proposed method into the R3D model, which replaced the 2D convolutional filters in the 2D Resnet with 3D convolutional filters and conduct extensive experimental validation on the small and medium-sized dataset UCF101 and the large-sized dataset Kinetics-400. The findings revealed that our model increased the recognition accuracy on both datasets. Results on the UCF101 dataset, in particular, demonstrate that our model outperforms R3D in terms of a maximum recognition accuracy improvement of 7.2% while using 34.2% fewer parameters. The MSF and FSAM are migrated to another traditional 3D action recognition model named C3D for application testing. The test results based on UCF101 show that the recognition accuracy is improved by 8.9%, proving the strong generalization ability and universality of the method in this paper.
- Research Article
6
- 10.1016/j.displa.2023.102482
- Jun 22, 2023
- Displays
Visual saliency assistance mechanism based on visually impaired navigation systems
- Research Article
15
- 10.1016/j.enbuild.2024.114735
- Aug 30, 2024
- Energy & Buildings
Deep spatio-temporal feature fusion learning for multi-step building cooling load forecasting