Cross-Modal Alignment of Local and Global Features for Zero-Shot Chinese Character Recognition

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Chinese character recognition (CCR) is a pivotal domain in computer vision due to its complexity and diverse applications, especially given the extensive character categories posing challenges in identifying unseen characters. Addressing the zero-shot hurdle, we propose a CLIP-style model, which independently extracts features from aligned Chinese character images and Ideographic Description Sequences (IDS), achieving cross-modal alignment. Our approach encompasses local and global feature alignment. Initially, we introduce learnable discrete tokens to represent shared embeddings for visual and textual modalities, capturing the local context of Chinese characters. Then, encoding each radical extracts local features, mapped to shared discrete tokens via attention mechanisms. Additionally, encoding the entire character obtains global features. Training utilizes contrastive loss to facilitate cross-modal alignment. Experimental results confirm our method’s superiority over conventional approaches, demonstrating remarkable performance on zero-shot Chinese character recognition benchmarks.

Similar Papers
  • Research Article
  • Cite Count Icon 13
  • 10.1109/tgrs.2022.3171038
DFAN: Dual-Branch Feature Alignment Network for Domain Adaptation on Point Clouds
  • Jan 1, 2022
  • IEEE Transactions on Geoscience and Remote Sensing
  • Liangwei Shi + 4 more

Unsupervised domain adaptation (UDA) significantly reduces the gap between the source domain and the target domain in machine learning and computer vision tasks. Most UDA approaches are applied to images and videos, and only a few methods implement domain adaptation on 3-D computer vision problems. The existing UDA approaches operating on point clouds try to extract domain-invariant features in different domains for feature alignment. However, higher commonality brings less diversity and results a loss of detailed information. In this article, we propose a novel dual-branch feature alignment network (DFAN) architecture for domain adaptation on point cloud visual tasks to better exploit the respective characteristics of local and global features. Our approach specializes in the extraction and alignment of global and local features with different strategies in each branch to complement each other. We also introduce a hierarchical alignment strategy for local feature alignment and a distribution alignment strategy for global feature alignment. Experiments on the PointDA-10 and PointSegDA datasets show that our approach achieves state-of-the-art performance on the UDA of point cloud classification and segmentation tasks. The ablation study demonstrates the effectiveness of the dual-branch design and the feature alignment strategies.

  • Research Article
  • Cite Count Icon 22
  • 10.1016/j.eswa.2017.07.018
Face alignment using a deep neural network with local feature learning and recurrent regression
  • Jul 13, 2017
  • Expert Systems with Applications
  • Byung-Hwa Park + 2 more

Face alignment using a deep neural network with local feature learning and recurrent regression

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/icinfa.2017.8078942
Cascading global and local features for face recognition using support vector machines and local ternary patterns
  • Jul 1, 2017
  • Jia-Ching Jang Jian + 5 more

This study analyzes the effectiveness of the global (the whole face) and local (regions of eyes, nose, and mouth) features for face recognition. Features describing human faces are encoded in local ternary patterns. The two-class support vector machine is used as the supervised learning algorithm for training recognition models. In the recognition process, recognition modes based on the global features and local features are cascaded. For identifying a face image, the local features are used iteratively for filtering out candidates that can not be clearly identified by the global features, until the one with highest possibility is concluded. The experimental results show that cascading the recognition models of global and local features obtains better classification accuracy than the single classification process.

  • Research Article
  • 10.1088/1361-6501/ae0e8a
MLDA-GLFD: a multi-layer domain-adaptive rolling bearing fault diagnosis method fusing global and local features
  • Nov 4, 2025
  • Measurement Science and Technology
  • Yadong Jiang + 5 more

In the field of fault diagnosis, transfer learning methods have achieved remarkable progress in rolling bearing fault diagnosis. However, existing approaches still face challenges in feature extraction and feature alignment between source and target domains, which remains difficult for feature networks to effectively capture both local and global features, and the distribution discrepancies between the two domains further degrade the performance of transfer models. To address these issues, this paper proposes a multilayer domain-adaptive fault diagnosis method that integrates global and local feature representations (MLDA-GLFD). Specifically, an enhanced local feature extraction module (SLFE) is designed by combining depthwise separable convolution and partial convolution to capture local feature information more precisely. In addition, a global feature extraction module (GFAB) is constructed, which incorporates a multi-head self-attention mechanism (MHSA), a global context block (GCblock) and a pyramid pooling module (PPM) to jointly strengthen global feature extraction. To further achieve feature distribution alignment between the source and target domains, a dynamic convolution (DConv) module with a hierarchical domain alignment mechanism is designed to adaptively adjust the receptive field of convolutional kernels. Moreover, a combination of Maximum Mean Discrepancy (MMD) and Multiple Kernel MMD (MKMMD) is employed to accurately align inter-domain features, thereby enhancing the model’s transferability to the target domain. Experimental results on two rolling bearing datasets, CWRU and SDUST, demonstrate that the proposed MLDA-GLFD method achieves average accuracies of 96% and 94%, respectively, significantly outperforming other comparative methods.

  • Conference Article
  • Cite Count Icon 178
  • 10.1109/cvpr.2005.433
Combining Local and Global Image Features for Object Class Recognition
  • Jan 1, 2005
  • D.A Lisin + 4 more

Object recognition is a central problem in computer vision research. Most object recognition systems have taken one of two approaches, using either global or local features exclusively. This may be in part due to the difficulty of combining a single global feature vector with a set of local features in a suitable manner. In this paper, we show that combining local and global features is beneficial in an application where rough segmentations of objects are available. We present a method for classification with local features using non-parametric density estimation. Subsequently, we present two methods for combining local and global features. The first uses a stacking ensemble technique, and the second uses a hierarchical classification system. Results show the superior performance of these combined methods over the component classifiers, with a reduction of over 20% in the error rate on a challenging marine science application.

  • Research Article
  • Cite Count Icon 12
  • 10.1016/j.artmed.2022.102341
Global and local attentional feature alignment for domain adaptive nuclei detection in histopathology images
  • Jul 2, 2022
  • Artificial Intelligence in Medicine
  • Zhi Wang + 5 more

Global and local attentional feature alignment for domain adaptive nuclei detection in histopathology images

  • Research Article
  • Cite Count Icon 26
  • 10.1109/tip.2020.2965306
Joint Coding of Local and Global Deep Features in Videos for Visual Search.
  • Jan 1, 2020
  • IEEE Transactions on Image Processing
  • Lin Ding + 4 more

Practically, it is more feasible to collect compact visual features rather than the video streams from hundreds of thousands of cameras into the cloud for big data analysis and retrieval. Then the problem becomes which kinds of features should be extracted, compressed and transmitted so as to meet the requirements of various visual tasks. Recently, many studies have indicated that the activations from the convolutional layers in convolutional neural networks (CNNs) can be treated as local deep features describing particular details inside an image region, which are then aggregated (e.g., using Fisher Vectors) as a powerful global descriptor. Combination of local and global features can satisfy those various needs effectively. It has also been validated that, if only local deep features are coded and transmitted to the cloud while the global features are recovered using the decoded local features, the aggregated global features should be lossy and consequently would degrade the overall performance. Therefore, this paper proposes a joint coding framework for local and global deep features (DFJC) extracted from videos. In this framework, we introduce a coding scheme for real-valued local and global deep features with intra-frame lossy coding and inter-frame reference coding. The theoretical analysis is performed to understand how the number of inliers varies with the number of local features. Moreover, the inter-feature correlations are exploited in our framework. That is, local feature coding can be accelerated by making use of the frame types determined with global features, while the lossy global features aggregated with the decoded local features can be used as a reference for global feature coding. Extensive experimental results under three metrics show that our DFJC framework can significantly reduce the bitrate of local and global deep features from videos while maintaining the retrieval performance.

  • Peer Review Report
  • Cite Count Icon 1
  • 10.7554/elife.78635.sa2
Author response: A connectomics-based taxonomy of mammals
  • Oct 10, 2022
  • Laura E Suarez + 6 more

Author response: A connectomics-based taxonomy of mammals

  • Research Article
  • Cite Count Icon 20
  • 10.1007/s00371-021-02136-z
A spatio-temporal integrated model based on local and global features for video expression recognition
  • May 5, 2021
  • The Visual Computer
  • Min Hu + 4 more

Facial expressions can be represented largely by the dynamic variations of important facial expression parts, i.e., eyebrows, eyes, nose, and mouth. The features of these parts are regarded as local features. However, facial global information is also useful for recognition because it is a necessary complement to local features. In this paper, a spatio-temporal integrated model that jointly learns local and global features is proposed for video expression recognition. Firstly, to capture the action of facial key units, a spatio-temporal attention part-gradient-based hierarchical bidirectional recurrent neural network (spatio-temporal attention PGHRNN) is constructed. It can capture the dynamic variations of gradients around facial landmark points. In addition, a new kind of spatial attention mechanism is introduced to recalibrate the features of facial various parts adaptively. Secondly, to complement the local features extracted by the spatio-temporal attention PGHRNN, a squeeze-and-excitation residual network of 50 layers with long short-term memory network (SE-ResNet-50-LSTM) is used as a global feature extractor and classifier. Finally, to integrate the local and global features and improve the performance of facial expression recognition, a joint adaptive fine-tuning method (JAFTM) is proposed to combine the two networks, which can adaptively adjust the network weights. Extensive experiments demonstrate that our proposed model can achieve a superior recognition accuracy of 98.95% on CK + for 7-class facial expressions and 85.40% on MMI database, which outperforms other state-of-the-art methods.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/snpd.2007.188
Local and Global Features Extracting and Fusion for Microbial Recognition
  • Jul 1, 2007
  • Li Xiaojuan + 3 more

It is presented that extraction of global and local feature based on digital microbial image analysis and fusion local and global features for microbe recognition. The global features are extracted by invariant moments and co-occurrence matrix, in which invariant moments computation is simplified by computing geometric moment and central moment based on the edge of microbe instead of the field of it. Curvature changing detection for microbe is characterized as local features by wavelet transform. Min-max was applied for normalization and after the fusion of normalized global match degree and normalized local match degree, the recognition result is the class that included the template image corresponding to the largest fused match degree. The experimental results show that fusing local and global features is effective for microbe image analysis and recognition.

  • Research Article
  • Cite Count Icon 2
  • 10.4236/oalib.1106358
The role of features in Chinese character recognition
  • Jan 1, 2020
  • OALib
  • Feifan Luo + 1 more

Chinese characters contain three levels of visual information: strokes, radicals, and structures. Stroke, as the basic unit of Chinese characters, is the foundation of the formation of Chinese characters. Some research has shown that strokes can enhance the semantic and orthographic representation in the process of Chinese character acquisition. This shows that the processing of Chinese characters is based on feature processing. This is consistent with the feature processing theory of letter and object recognition. Recent research on visual recognition shows that the recognition of letters and words is based on the termination and the other features (vertices and midsegment) used in object recognition. In this study, we used delayed segment technique on Chinese character recognition to examine the importance of terminations, vertices and midsegment in different masking degree. Experiment 1 selected about 43 high frequency Chinese characters with left-right structures and asked the participants to naming Chinese characters. In experiment 1, we try to examine the importance of terminations, vertices and midsegments based on line unit and consider the degree of masking these features. It was found that the main effect of masking degree was significant. The masking of 35% and 55% of the features makes the RTs significantly longer. The main effect of feature type was significant. Masking vertices and midsegment can make RTs significantly longer. But termination didn’t show its feature effect in experiment 1. When masking these three features (vertices, midsegment and termination) in experiment 1, the features are based on line units, which may destroy the integrity of strokes in Chinese characters. That termination effects couldn’t be obvious in experiment 1 may be caused by this reason. To verify this, in experiment 2, we explore the vertices, midsegment and termination based on stroke units. The results show that the main effect of masking degree was significant, and the RTs were longer by masking 35% and 55% degree. The main effect of feature types was significant. RTs on character recognition are significantly longer by masking the vertices, midsegment and termination. This shows that the vertices, midsegment and termination based on stroke unit are also important in Chinese character recognition. Based on the above research results, it can be seen that the key features on Chinese character recognition are in line with the neuronal recycling hypothesis. According to neuronal recycling hypothesis, human’s ability to learn from culture depends on the retrieval of preexisting brain circuits, and the key features in word recognition may be derived from our key features in object processing. The hook is considered to be a special termination in Chinese characters. The results of experiment 2 also emphasized the key role of hook features in character recognition.

  • Research Article
  • Cite Count Icon 7
  • 10.1016/j.neucom.2022.04.093
Part-facial relational and modality-style attention networks for heterogeneous face recognition
  • Apr 20, 2022
  • Neurocomputing
  • Jian Yu + 3 more

Part-facial relational and modality-style attention networks for heterogeneous face recognition

  • Conference Article
  • Cite Count Icon 11
  • 10.1109/ispacs.2007.4445969
Palmprint recognition using fusion of local and global features
  • Jan 1, 2007
  • Xin Pan + 2 more

Palmprint recognition is a rapidly developing biometrics technology over the last decade. However, there exist some typical problems when capturing palmprint images. First, the delta region in the center palm will raise the uneven light and brightness of the palmprint images varying with hand pressure, stretching and palm structure. Second, it is hard to align the palmprint images precisely to the same position, especially when the subjects are required to spread their hand on the scanner surface, even for the same palm. Either the global or the local features cannot satisfy the need for high recognition accuracy. Therefore, we propose a novel method using fusion of local and global features, extracted by non-negative factorization with sparseness constraint (NMFsc) and prominent component analysis (PCA), respectively, to improve the recognition performance. Experiments demonstrate the strong supplementary between local and global features for palmprint recognition.

  • Conference Article
  • Cite Count Icon 131
  • 10.1109/cvprw.2017.161
Deep Local Video Feature for Action Recognition
  • Jul 1, 2017
  • Zhenzhong Lan + 3 more

We investigate the problem of representing an entire video using CNN features for human action recognition. Currently, limited by GPU memory, we have not been able to feed a whole video into CNN/RNNs for end-to-end learning. A common practice is to use sampled frames as inputs and video labels as supervision. One major problem of this popular approach is that the local samples may not contain the information indicated by global labels. To deal with this problem, we propose to treat the deep networks trained on local inputs as local feature extractors. After extracting local features, we aggregate them into global features and train another mapping function on the same training data to map the global features into global labels. We study a set of problems regarding this new type of local features such as how to aggregate them into global features. Experimental results on HMDB51 and UCF101 datasets show that, for these new local features, a simple maximum pooling on the sparsely sampled features lead to significant performance improvement.

  • Research Article
  • Cite Count Icon 6
  • 10.1038/s41598-025-90440-2
A fine-grained human facial key feature extraction and fusion method for emotion recognition
  • Feb 20, 2025
  • Scientific Reports
  • Shiwei Li + 4 more

Emotion, a fundamental mapping of human responses to external stimuli, has been extensively studied in human–computer interaction, particularly in areas such as intelligent cockpits and systems. However, accurately recognizing emotions from facial expressions remains a significant challenge due to lighting conditions, posture, and micro-expressions. Emotion recognition using global or local facial features is a key research direction. However, relying solely on global or local features often results in models that exhibit uneven attention across facial features, neglecting key variations critical for detecting emotional changes. This paper proposes a method for modeling and extracting key facial features by integrating global and local facial data. First, we construct a comprehensive image preprocessing model that includes super-resolution processing, lighting and shading processing, and texture enhancement. This preprocessing step significantly enriches the expression of image features. Second, A global facial feature recognition model is developed using an encoder-decoder architecture, which effectively eliminates environmental noise and generates a comprehensive global feature dataset for facial analysis. Simultaneously, the Haar cascade classifier is employed to extract refined features from key facial regions, including the eyes, mouth, and overall face, resulting in a corresponding local feature dataset. Finally, a two-branch convolutional neural network is designed to integrate both global and local facial feature datasets, enhancing the model’s ability to recognize facial characteristics accurately. The global feature branch fully characterizes the global features of the face, while the local feature branch focuses on the local features. An adaptive fusion module integrates the global and local features, enhancing the model’s ability to differentiate subtle emotional changes. To evaluate the accuracy and robustness of the model, we train and test it on the FER-2013 and JAFFE emotion datasets, achieving average accuracies of 80.59% and 97.61%, respectively. Compared to existing state-of-the-art models, our refined face feature extraction and fusion model demonstrates superior performance in emotion recognition. Additionally, the comparative analysis shows that emotional features across different faces show similarities. Building on psychological research, we categorize the dataset into three emotion classes: positive, neutral, and negative. The accuracy of emotion recognition is significantly improved under the new classification criteria. Additionally, the self-built dataset is used to validate further that this classification approach has important implications for practical applications.

Save Icon
Up Arrow
Open/Close