Cross-Modal Alignment of Local and Global Features for Zero-Shot Chinese Character Recognition
Chinese character recognition (CCR) is a pivotal domain in computer vision due to its complexity and diverse applications, especially given the extensive character categories posing challenges in identifying unseen characters. Addressing the zero-shot hurdle, we propose a CLIP-style model, which independently extracts features from aligned Chinese character images and Ideographic Description Sequences (IDS), achieving cross-modal alignment. Our approach encompasses local and global feature alignment. Initially, we introduce learnable discrete tokens to represent shared embeddings for visual and textual modalities, capturing the local context of Chinese characters. Then, encoding each radical extracts local features, mapped to shared discrete tokens via attention mechanisms. Additionally, encoding the entire character obtains global features. Training utilizes contrastive loss to facilitate cross-modal alignment. Experimental results confirm our method’s superiority over conventional approaches, demonstrating remarkable performance on zero-shot Chinese character recognition benchmarks.
- Research Article
13
- 10.1109/tgrs.2022.3171038
- Jan 1, 2022
- IEEE Transactions on Geoscience and Remote Sensing
Unsupervised domain adaptation (UDA) significantly reduces the gap between the source domain and the target domain in machine learning and computer vision tasks. Most UDA approaches are applied to images and videos, and only a few methods implement domain adaptation on 3-D computer vision problems. The existing UDA approaches operating on point clouds try to extract domain-invariant features in different domains for feature alignment. However, higher commonality brings less diversity and results a loss of detailed information. In this article, we propose a novel dual-branch feature alignment network (DFAN) architecture for domain adaptation on point cloud visual tasks to better exploit the respective characteristics of local and global features. Our approach specializes in the extraction and alignment of global and local features with different strategies in each branch to complement each other. We also introduce a hierarchical alignment strategy for local feature alignment and a distribution alignment strategy for global feature alignment. Experiments on the PointDA-10 and PointSegDA datasets show that our approach achieves state-of-the-art performance on the UDA of point cloud classification and segmentation tasks. The ablation study demonstrates the effectiveness of the dual-branch design and the feature alignment strategies.
- Research Article
22
- 10.1016/j.eswa.2017.07.018
- Jul 13, 2017
- Expert Systems with Applications
Face alignment using a deep neural network with local feature learning and recurrent regression
- Conference Article
6
- 10.1109/icinfa.2017.8078942
- Jul 1, 2017
This study analyzes the effectiveness of the global (the whole face) and local (regions of eyes, nose, and mouth) features for face recognition. Features describing human faces are encoded in local ternary patterns. The two-class support vector machine is used as the supervised learning algorithm for training recognition models. In the recognition process, recognition modes based on the global features and local features are cascaded. For identifying a face image, the local features are used iteratively for filtering out candidates that can not be clearly identified by the global features, until the one with highest possibility is concluded. The experimental results show that cascading the recognition models of global and local features obtains better classification accuracy than the single classification process.
- Research Article
- 10.1088/1361-6501/ae0e8a
- Nov 4, 2025
- Measurement Science and Technology
In the field of fault diagnosis, transfer learning methods have achieved remarkable progress in rolling bearing fault diagnosis. However, existing approaches still face challenges in feature extraction and feature alignment between source and target domains, which remains difficult for feature networks to effectively capture both local and global features, and the distribution discrepancies between the two domains further degrade the performance of transfer models. To address these issues, this paper proposes a multilayer domain-adaptive fault diagnosis method that integrates global and local feature representations (MLDA-GLFD). Specifically, an enhanced local feature extraction module (SLFE) is designed by combining depthwise separable convolution and partial convolution to capture local feature information more precisely. In addition, a global feature extraction module (GFAB) is constructed, which incorporates a multi-head self-attention mechanism (MHSA), a global context block (GCblock) and a pyramid pooling module (PPM) to jointly strengthen global feature extraction. To further achieve feature distribution alignment between the source and target domains, a dynamic convolution (DConv) module with a hierarchical domain alignment mechanism is designed to adaptively adjust the receptive field of convolutional kernels. Moreover, a combination of Maximum Mean Discrepancy (MMD) and Multiple Kernel MMD (MKMMD) is employed to accurately align inter-domain features, thereby enhancing the model’s transferability to the target domain. Experimental results on two rolling bearing datasets, CWRU and SDUST, demonstrate that the proposed MLDA-GLFD method achieves average accuracies of 96% and 94%, respectively, significantly outperforming other comparative methods.
- Conference Article
178
- 10.1109/cvpr.2005.433
- Jan 1, 2005
Object recognition is a central problem in computer vision research. Most object recognition systems have taken one of two approaches, using either global or local features exclusively. This may be in part due to the difficulty of combining a single global feature vector with a set of local features in a suitable manner. In this paper, we show that combining local and global features is beneficial in an application where rough segmentations of objects are available. We present a method for classification with local features using non-parametric density estimation. Subsequently, we present two methods for combining local and global features. The first uses a stacking ensemble technique, and the second uses a hierarchical classification system. Results show the superior performance of these combined methods over the component classifiers, with a reduction of over 20% in the error rate on a challenging marine science application.
- Research Article
12
- 10.1016/j.artmed.2022.102341
- Jul 2, 2022
- Artificial Intelligence in Medicine
Global and local attentional feature alignment for domain adaptive nuclei detection in histopathology images
- Research Article
26
- 10.1109/tip.2020.2965306
- Jan 1, 2020
- IEEE Transactions on Image Processing
Practically, it is more feasible to collect compact visual features rather than the video streams from hundreds of thousands of cameras into the cloud for big data analysis and retrieval. Then the problem becomes which kinds of features should be extracted, compressed and transmitted so as to meet the requirements of various visual tasks. Recently, many studies have indicated that the activations from the convolutional layers in convolutional neural networks (CNNs) can be treated as local deep features describing particular details inside an image region, which are then aggregated (e.g., using Fisher Vectors) as a powerful global descriptor. Combination of local and global features can satisfy those various needs effectively. It has also been validated that, if only local deep features are coded and transmitted to the cloud while the global features are recovered using the decoded local features, the aggregated global features should be lossy and consequently would degrade the overall performance. Therefore, this paper proposes a joint coding framework for local and global deep features (DFJC) extracted from videos. In this framework, we introduce a coding scheme for real-valued local and global deep features with intra-frame lossy coding and inter-frame reference coding. The theoretical analysis is performed to understand how the number of inliers varies with the number of local features. Moreover, the inter-feature correlations are exploited in our framework. That is, local feature coding can be accelerated by making use of the frame types determined with global features, while the lossy global features aggregated with the decoded local features can be used as a reference for global feature coding. Extensive experimental results under three metrics show that our DFJC framework can significantly reduce the bitrate of local and global deep features from videos while maintaining the retrieval performance.
- Peer Review Report
1
- 10.7554/elife.78635.sa2
- Oct 10, 2022
Author response: A connectomics-based taxonomy of mammals
- Research Article
20
- 10.1007/s00371-021-02136-z
- May 5, 2021
- The Visual Computer
Facial expressions can be represented largely by the dynamic variations of important facial expression parts, i.e., eyebrows, eyes, nose, and mouth. The features of these parts are regarded as local features. However, facial global information is also useful for recognition because it is a necessary complement to local features. In this paper, a spatio-temporal integrated model that jointly learns local and global features is proposed for video expression recognition. Firstly, to capture the action of facial key units, a spatio-temporal attention part-gradient-based hierarchical bidirectional recurrent neural network (spatio-temporal attention PGHRNN) is constructed. It can capture the dynamic variations of gradients around facial landmark points. In addition, a new kind of spatial attention mechanism is introduced to recalibrate the features of facial various parts adaptively. Secondly, to complement the local features extracted by the spatio-temporal attention PGHRNN, a squeeze-and-excitation residual network of 50 layers with long short-term memory network (SE-ResNet-50-LSTM) is used as a global feature extractor and classifier. Finally, to integrate the local and global features and improve the performance of facial expression recognition, a joint adaptive fine-tuning method (JAFTM) is proposed to combine the two networks, which can adaptively adjust the network weights. Extensive experiments demonstrate that our proposed model can achieve a superior recognition accuracy of 98.95% on CK + for 7-class facial expressions and 85.40% on MMI database, which outperforms other state-of-the-art methods.
- Conference Article
1
- 10.1109/snpd.2007.188
- Jul 1, 2007
It is presented that extraction of global and local feature based on digital microbial image analysis and fusion local and global features for microbe recognition. The global features are extracted by invariant moments and co-occurrence matrix, in which invariant moments computation is simplified by computing geometric moment and central moment based on the edge of microbe instead of the field of it. Curvature changing detection for microbe is characterized as local features by wavelet transform. Min-max was applied for normalization and after the fusion of normalized global match degree and normalized local match degree, the recognition result is the class that included the template image corresponding to the largest fused match degree. The experimental results show that fusing local and global features is effective for microbe image analysis and recognition.
- Research Article
2
- 10.4236/oalib.1106358
- Jan 1, 2020
- OALib
Chinese characters contain three levels of visual information: strokes, radicals, and structures. Stroke, as the basic unit of Chinese characters, is the foundation of the formation of Chinese characters. Some research has shown that strokes can enhance the semantic and orthographic representation in the process of Chinese character acquisition. This shows that the processing of Chinese characters is based on feature processing. This is consistent with the feature processing theory of letter and object recognition. Recent research on visual recognition shows that the recognition of letters and words is based on the termination and the other features (vertices and midsegment) used in object recognition. In this study, we used delayed segment technique on Chinese character recognition to examine the importance of terminations, vertices and midsegment in different masking degree. Experiment 1 selected about 43 high frequency Chinese characters with left-right structures and asked the participants to naming Chinese characters. In experiment 1, we try to examine the importance of terminations, vertices and midsegments based on line unit and consider the degree of masking these features. It was found that the main effect of masking degree was significant. The masking of 35% and 55% of the features makes the RTs significantly longer. The main effect of feature type was significant. Masking vertices and midsegment can make RTs significantly longer. But termination didn’t show its feature effect in experiment 1. When masking these three features (vertices, midsegment and termination) in experiment 1, the features are based on line units, which may destroy the integrity of strokes in Chinese characters. That termination effects couldn’t be obvious in experiment 1 may be caused by this reason. To verify this, in experiment 2, we explore the vertices, midsegment and termination based on stroke units. The results show that the main effect of masking degree was significant, and the RTs were longer by masking 35% and 55% degree. The main effect of feature types was significant. RTs on character recognition are significantly longer by masking the vertices, midsegment and termination. This shows that the vertices, midsegment and termination based on stroke unit are also important in Chinese character recognition. Based on the above research results, it can be seen that the key features on Chinese character recognition are in line with the neuronal recycling hypothesis. According to neuronal recycling hypothesis, human’s ability to learn from culture depends on the retrieval of preexisting brain circuits, and the key features in word recognition may be derived from our key features in object processing. The hook is considered to be a special termination in Chinese characters. The results of experiment 2 also emphasized the key role of hook features in character recognition.
- Research Article
7
- 10.1016/j.neucom.2022.04.093
- Apr 20, 2022
- Neurocomputing
Part-facial relational and modality-style attention networks for heterogeneous face recognition
- Conference Article
11
- 10.1109/ispacs.2007.4445969
- Jan 1, 2007
Palmprint recognition is a rapidly developing biometrics technology over the last decade. However, there exist some typical problems when capturing palmprint images. First, the delta region in the center palm will raise the uneven light and brightness of the palmprint images varying with hand pressure, stretching and palm structure. Second, it is hard to align the palmprint images precisely to the same position, especially when the subjects are required to spread their hand on the scanner surface, even for the same palm. Either the global or the local features cannot satisfy the need for high recognition accuracy. Therefore, we propose a novel method using fusion of local and global features, extracted by non-negative factorization with sparseness constraint (NMFsc) and prominent component analysis (PCA), respectively, to improve the recognition performance. Experiments demonstrate the strong supplementary between local and global features for palmprint recognition.
- Conference Article
131
- 10.1109/cvprw.2017.161
- Jul 1, 2017
We investigate the problem of representing an entire video using CNN features for human action recognition. Currently, limited by GPU memory, we have not been able to feed a whole video into CNN/RNNs for end-to-end learning. A common practice is to use sampled frames as inputs and video labels as supervision. One major problem of this popular approach is that the local samples may not contain the information indicated by global labels. To deal with this problem, we propose to treat the deep networks trained on local inputs as local feature extractors. After extracting local features, we aggregate them into global features and train another mapping function on the same training data to map the global features into global labels. We study a set of problems regarding this new type of local features such as how to aggregate them into global features. Experimental results on HMDB51 and UCF101 datasets show that, for these new local features, a simple maximum pooling on the sparsely sampled features lead to significant performance improvement.
- Research Article
6
- 10.1038/s41598-025-90440-2
- Feb 20, 2025
- Scientific Reports
Emotion, a fundamental mapping of human responses to external stimuli, has been extensively studied in human–computer interaction, particularly in areas such as intelligent cockpits and systems. However, accurately recognizing emotions from facial expressions remains a significant challenge due to lighting conditions, posture, and micro-expressions. Emotion recognition using global or local facial features is a key research direction. However, relying solely on global or local features often results in models that exhibit uneven attention across facial features, neglecting key variations critical for detecting emotional changes. This paper proposes a method for modeling and extracting key facial features by integrating global and local facial data. First, we construct a comprehensive image preprocessing model that includes super-resolution processing, lighting and shading processing, and texture enhancement. This preprocessing step significantly enriches the expression of image features. Second, A global facial feature recognition model is developed using an encoder-decoder architecture, which effectively eliminates environmental noise and generates a comprehensive global feature dataset for facial analysis. Simultaneously, the Haar cascade classifier is employed to extract refined features from key facial regions, including the eyes, mouth, and overall face, resulting in a corresponding local feature dataset. Finally, a two-branch convolutional neural network is designed to integrate both global and local facial feature datasets, enhancing the model’s ability to recognize facial characteristics accurately. The global feature branch fully characterizes the global features of the face, while the local feature branch focuses on the local features. An adaptive fusion module integrates the global and local features, enhancing the model’s ability to differentiate subtle emotional changes. To evaluate the accuracy and robustness of the model, we train and test it on the FER-2013 and JAFFE emotion datasets, achieving average accuracies of 80.59% and 97.61%, respectively. Compared to existing state-of-the-art models, our refined face feature extraction and fusion model demonstrates superior performance in emotion recognition. Additionally, the comparative analysis shows that emotional features across different faces show similarities. Building on psychological research, we categorize the dataset into three emotion classes: positive, neutral, and negative. The accuracy of emotion recognition is significantly improved under the new classification criteria. Additionally, the self-built dataset is used to validate further that this classification approach has important implications for practical applications.