Related Topics
Articles published on Common Semantic Space
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
59 Search results
Sort by Recency
- Research Article
- 10.3390/electronics15040830
- Feb 14, 2026
- Electronics
- Abdul Rahaman Wahab Sait + 1 more
Stance detection has emerged as an essential tool in natural language processing for understanding how individuals express agreement, disagreement, or neutrality toward specific targets in social and online discourse. It plays a crucial role in bilingual and multilingual environments, including English-Arabic social media ecosystems, where differences in language structure, discourse style, and data availability pose significant challenges for reliable stance modelling. Existing approaches often struggle with target awareness, cross-lingual generalization, robustness to noisy user-generated text, and the interpretability of model decisions. This study aims to build a reliable, explainable target-aware bilingual stance-detection framework that generalizes across heterogeneous stance formats and languages without retraining on a dataset specific to the target language. Thus, a unified dual-encoder architecture based on mDeBERTa-v3 is proposed. Cross-language contrastive learning offers an auxiliary training objective to align English and Arabic stance representations in a common semantic space. Robustness-oriented regularization is used to mitigate the effects of informal language, vocabulary variation, and adversarial noise. To promote transparency and trustworthiness, the framework incorporates token-level rationale extraction, enables fine-grained interpretability, and supports analysis of hallucination. The proposed model is tested on a combined bilingual test set and two structurally distinct zero-shot benchmarks: MT-CSD and AraStance. Experimental results show consistent performance, with accuracies of 85.0% and 86.8% and F1-scores of 84.7% and 86.8% on the zero-shot benchmarks, confirming stable performance and realistic generalization. Ultimately, these findings reveal that effective bilingual stance detection can be achieved via explicit target conditioning, cross-lingual alignment, and explainability-driven design.
- Research Article
- 10.12731/2658-4034-2025-16-5-841
- Oct 31, 2025
- Russian Journal of Education and Psychology
- Svetlana O Fominykh
Background.The purpose of the article is to conceptually substantiate the formation and development of the polysubject model of pedagogical education in Russia, clarify the conceptual apparatus and propose a framework for analyzing its institutional and methodological foundations. The work specifies that a polysubject in education is a special type of community that arises from the integration of many equal participants and has the qualities of collective subjectivity; the provisions on polysubject interaction (I. V. Vachkov) are used as a theoretical support. Methodologically, the model unfolds in the logic of the synergetic approach: contradictions, crises and external challenges are considered as triggers for self-organization and maturation of polysubjectivity (the effect of "growth points" of complex systems). The network form of implementation of educational programs, enshrined in Federal Law No. 273-FZ, as well as initiatives of the period of the national project "Education" (2019-2024) and the creation of centers for continuous professional development of teachers as an infrastructure of horizontal interaction and distributed participation are analyzed as a normative and program framework. Materials and methods.Within the framework of this methodological framework, working definitions are formulated and clarified, and a coherent research design is proposed: the stage-by-stage formation of a polysubject (initiation – development of interaction – stability – adaptability – self-development) and signs of maturity (collective identity "we", the ability to self-regulate and distribute roles, horizontal cooperation, joint meaning-making and initiation of innovations). These elements are derived from domestic sources, primarily from the works of I. V. Vachkov and synergetic literature, and are correlated with the legal and organizational framework of modern teacher training. The study uses a set of methods: systemic structural and synergetic analysis of domestic and foreign pedagogical concepts, strategic documents and regulatory legal acts on network forms of training; modeling, interpretation. Results.A research design of the stages of development of a polysubject model has been formulated: from initiation (awareness of the need to combine efforts and primary coordination of goals) through the development of interaction (the emergence of stable channels of communication and cooperation), sustainability (the formation of a common structure of roles, procedures and meanings), adaptability (the ability of a community to restructure the distribution of functions and generate new forms of joint activity in response to changes in the environment) – to self-development (a combination of internal motivation of participants, self-regulation and initiative, when the community becomes a source of innovation). Conclusion.Qualitative growth in teacher training is achieved where the plurality of participants in education turns into collective subjectivity with its properties – a common semantic space, self-regulation, cooperation and the ability to innovate.
- Research Article
- 10.1007/s42486-025-00196-x
- Jul 26, 2025
- CCF Transactions on Pervasive Computing and Interaction
- Mohadese Aali + 2 more
Smartphone continuous user identification via hybrid transfer learning: leveraging common semantic space and pre-trained models
- Research Article
- 10.1007/s00530-025-01681-0
- Jan 25, 2025
- Multimedia Systems
- Zhanyang Liang + 1 more
From coarse to fine: a two-stage common semantic space construction for unpaired cross modal retrieval
- Research Article
- 10.1109/tgrs.2025.3648057
- Jan 1, 2025
- IEEE Transactions on Geoscience and Remote Sensing
- Lanxiao Wang + 5 more
Recently, remote sensing image captioning (RSIC) has become an emerging research hot spot that requires models to understand and describe remote sensing images. However, the huge modal gap between vision and text makes that it is difficult to achieve accurate cross-modal transformation for RSIC. Existing methods usually directly transform the vision modal into the text modal based on the multi-task learning strategy or visual attention mechanism, which do not make full use of existing prior information to build explicit cross-modal knowledge for vision and text transformation. Considering to utilize the ability of cross-modal alignment in the vision-language model (VLM), we propose a novel dual prompts aware cross-modal semantic interaction and fusion network for RSIC. It can explicitly dig out potential entity concepts and predict scene class in the images. And it further builds dual prompts to achieve cross-modal interaction and fusion, which can build cross-modal common semantic space to provide prior information for caption generation. Specifically, we first introduce an entity-concept exporter to obtain explicit entity concepts in the image based on pre-setting entity space. Next, we design a multi-scale scene predictor to obtain fine-grained visual semantic features and scene class. Then, we propose a prompt aware cross-modal interaction module to build cross-modal common semantic space as intermediate connection for caption generation. Finally, we further design a prompt aware attention fusion module for the transformer decoder, which can utilize cross-modal prompt features to generate accurate captions. We conduct extensive experiments on three challenging datasets, including UCM-Captions, RSICD and NWPU-Captions, and our method achieves SoTA performance. In the typical remote sensing image captioning dataset RSICD, our method achieves 3.3% and 20.0% improvement in BLEU@4 and CIDEr respectively, which show the effectiveness of our method.
- Research Article
18
- 10.1109/tnnls.2023.3330975
- Jan 1, 2025
- IEEE transactions on neural networks and learning systems
- Kaihang Jiang + 5 more
In the past decades, supervised cross-modal hashing methods have attracted considerable attentions due to their high searching efficiency on large-scale multimedia databases. Many of these methods leverage semantic correlations among heterogeneous modalities by constructing a similarity matrix or building a common semantic space with the collective matrix factorization method. However, the similarity matrix may sacrifice the scalability and cannot preserve more semantic information into hash codes in the existing methods. Meanwhile, the matrix factorization methods cannot embed the main modality-specific information into hash codes. To address these issues, we propose a novel supervised cross-modal hashing method called random online hashing (ROH) in this article. ROH proposes a linear bridging strategy to simplify the pair-wise similarities factorization problem into a linear optimization one. Specifically, a bridging matrix is introduced to establish a bidirectional linear relation between hash codes and labels, which preserves more semantic similarities into hash codes and significantly reduces the semantic distances between hash codes of samples with similar labels. Additionally, a novel maximum eigenvalue direction (MED) embedding method is proposed to identify the direction of maximum eigenvalue for the original features and preserve critical information into modality-specific hash codes. Eventually, to handle real-time data dynamically, an online structure is adopted to solve the problem of dealing with new arrival data chunks without considering pairwise constraints. Extensive experimental results on three benchmark datasets demonstrate that the proposed ROH outperforms several state-of-the-art cross-modal hashing methods.
- Research Article
- 10.32603/2412-8562-2024-10-4-43-52
- Sep 19, 2024
- Discourse
- N V Kazarinova
Introduction. The subject of this article is communicative resources, which are selected by partners from a huge variety of linguistic means, functional styles, rhetorical techniques, and subject activities, and through which a common semantic space between individuals is generated.Methodology and sources. As a methodological platform we propose the principles of praxis-oriented social semiotics. Being one of the relatively late trends of semiotic research, social semiotics focuses attention not so much on signs or sign systems as on social signifying practices as regular, repetitive, recognisable types of actions: the actions of communication participants are endowed with the properties of sign systems through which relations between individuals are discovered, made visible and meaningful. The mechanisms of meaning production, from the point of view of the participants of the event, are analysed using the concept of ‘discursive practices’, understood as speech actions in order to solve a variety of practical interpersonal tasks.Results and discussion. As an individual learns a variety of interpersonal discursive practices, different relational systems become available to him/her. Specific discursive practices allow individuals to present different versions of their self, justify their actions, maintain dominance and/or subordination relations, thus demonstrating mastery of the communicative situation. Following the principles of praxis-oriented social semiotics, it can be argued that interpersonal relationships involve what can be called a discourse of trust, personal involvement in relationships, intimacy/closeness.Conclusion. The described symbolic regulators allow not only to semiotically comprehend the social positions of interacting parties, but also to offer self-control procedures and an arsenal of communicative actions to the participants of interpersonal communication as means of behaviour management in situations of interpersonal communication.
- Research Article
43
- 10.1609/aaai.v38i3.28034
- Mar 24, 2024
- Proceedings of the AAAI Conference on Artificial Intelligence
- Xin Jiang + 5 more
Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category, which poses challenges due to subtle inter-class discrepancies and large intra-class variations. However, prevailing approaches primarily focus on uni-modal visual concepts. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks, yet the applicability of such models to FGVC tasks remains uncertain. In this paper, we aim to fully exploit the capabilities of cross-modal description to tackle FGVC tasks and propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image pertaining (CLIP) model. Our MP-FGVC comprises a multimodal prompts scheme and a multimodal adaptation scheme. The former includes Subcategory-specific Vision Prompt (SsVP) and Discrepancy-aware Text Prompt (DaTP), which explicitly highlights the subcategory-specific discrepancies from the perspectives of both vision and language. The latter aligns the vision and text prompting elements in a common semantic space, facilitating cross-modal collaborative reasoning through a Vision-Language Fusion Module (VLFM) for further improvement on FGVC. Moreover, we tailor a two-stage optimization strategy for MP-FGVC to fully leverage the pre-trained CLIP model and expedite efficient adaptation for FGVC. Extensive experiments conducted on four FGVC datasets demonstrate the effectiveness of our MP-FGVC.
- Research Article
4
- 10.1186/s12864-024-09967-9
- Feb 14, 2024
- BMC Genomics
- Ping Zhang + 6 more
BackgroundBrain diseases pose a significant threat to human health, and various network-based methods have been proposed for identifying gene biomarkers associated with these diseases. However, the brain is a complex system, and extracting topological semantics from different brain networks is necessary yet challenging to identify pathogenic genes for brain diseases.ResultsIn this study, we present a multi-network representation learning framework called M-GBBD for the identification of gene biomarker in brain diseases. Specifically, we collected multi-omics data to construct eleven networks from different perspectives. M-GBBD extracts the spatial distributions of features from these networks and iteratively optimizes them using Kullback–Leibler divergence to fuse the networks into a common semantic space that represents the gene network for the brain. Subsequently, a graph consisting of both gene and large-scale disease proximity networks learns representations through graph convolution techniques and predicts whether a gene is associated which brain diseases while providing associated scores. Experimental results demonstrate that M-GBBD outperforms several baseline methods. Furthermore, our analysis supported by bioinformatics revealed CAMP as a significantly associated gene with Alzheimer's disease identified by M-GBBD.ConclusionCollectively, M-GBBD provides valuable insights into identifying gene biomarkers for brain diseases and serves as a promising framework for brain networks representation learning.
- Research Article
14
- 10.1145/3637442
- Jan 11, 2024
- ACM Transactions on Multimedia Computing, Communications, and Applications
- Dan Shi + 4 more
Most cross-modal retrieval methods assume the multi-modal training data is complete and has a one-to-one correspondence. However, in the real world, multi-modal data generally suffers from missing modality information due to the uncertainty of data collection and storage processes, which limits the practical application of existing cross-modal retrieval methods. Although some solutions have been proposed to generate the missing modality data using a single pseudo sample, this may lead to incomplete semantic restoration and sub-optimal retrieval results due to the limited semantic information it provides. To address this challenge, this article proposes an Incomplete Cross-Modal Retrieval with Deep Correlation Transfer (ICMR-DCT) method that can robustly model incomplete multi-modal data and dynamically capture the adjacency semantic correlation for cross-modal retrieval. Specifically, we construct intra-modal graph attention-based auto-encoder to learn modality-invariant representations by performing semantic reconstruction through intra-modality adjacency correlation mining. Then, we design dual cross-modal alignment constraints to project multi-modal representations into a common semantic space, thus bridging the heterogeneous modality gap and enhancing the discriminability of the common representation. We further introduce semantic preservation to enhance adjacency semantic information and achieve cross-modal semantic correlation. Moreover, we propose a nearest-neighbor weighting integration strategy with cross-modal correlation transfer to generate the missing modality data according to inter-modality mapping relations and adjacency correlations between each sample and its neighbors, which improves the robustness of our method against incomplete multi-modal training data. Extensive experiments on three widely tested benchmark datasets demonstrate the superior performance of our method in cross-modal retrieval tasks under both complete and incomplete retrieval scenarios. Our used datasets and source codes are available at https://github.com/shidan0122/DCT.git .
- Research Article
- 10.35679/2226-0226-2024-14-4-795-801
- Jan 1, 2024
- Scientific Review: Theory and Practice
- Vladimir Ivanovich Kolesov + 1 more
In modern business in the investment market, the system of training digitalization is considered in the process of socially responsible investment, which has not infrequently become used in digital online. The paper actualizes the very relevance of the communicative game model in educational technologies, and offers an example of a business game model of business training in a digital environment. A method is presented in a digital game for balancing in the process of acquiring knowledge by students. To achieve the goals of a group discussion, the moderator needs to calculate the communicative parameters of the participants. The task of the moderator is to form a common semantic space in which the group could express itself as a subject of collective mental activity. The method of creative marketing, invented by Alex Osborne, one of the founders of the BBDO advertising company, is considered. The moderation method is also considered, as an educational methodology, formed in the 60s -70s of the twentieth century in Germany; the method of active psychological and pedagogical influence proposed by A.V. Lazarev. To increase the effectiveness of business trainings, it is necessary to strengthen motivation on the part of students, which creates the presence in the organization of a clear understanding of the need for training, the finiteness of human resources with the required qualifications in the labor market. It is also possible to increase the effectiveness of business trainings by introducing modern training methods using the resources of the digital environment and improving the mobility of corporate training structures. The use of business trainings by cadets and university students can prepare for realistic conditions of interaction in the human resources market of civil society, increase professional competencies, creativity, and communication skills.
- Research Article
7
- 10.1088/1361-6501/ad0613
- Nov 2, 2023
- Measurement Science and Technology
- Yu Zeyu + 3 more
Ultrasonic inspection of pipeline welds still uses the traditional visual inspection signal method to identify pipeline defects. The identification of defects relies entirely on the subjective judgment of practitioners and is highly dependent on their level of experience. Deep learning models have achieved very good results in classification tasks, but they rely on a large number of annotated data samples for each category. However, it is difficult to collect a large number of samples with different defects and annotate them for the classification of pipe welding defects. Based on the idea of zero-shot learning (ZSL), which makes full use of experts’ semantic descriptions of defect categories, artificial semantic features are integrated cross-modally with ultrasonic inspection signal features. In this way, a common semantic space containing seen and unseen classes is constructed to achieve the detection of various defects. Meanwhile, to alleviate the problem of extreme imbalance of training data between the seen and unseen classes in ZSL model training, a ZSL model Feature-GAN-ZSL (FGZ) fused with a generative adversarial network (GAN) is proposed. The model utilizes a Feature-GAN network to generate unseen class features during training and adds a classifier to enhance the generation of features with stronger discriminative power. In the experiments, sample data for porosity, incomplete penetration, and cracks were used as visible classes, and samples for incomplete fusion and slag entrapment were used as unseen classes. Five state-of-the-art models in the ZSL domain were compared. The results show that the FGZ model has a good ability to recognize various defects, not only the types of defects that participated in the training but also the defects that did not participate in the training. This plays a perfect role in dealing with various pipeline welding defects.
- Research Article
43
- 10.1109/tcsvt.2023.3257193
- Oct 1, 2023
- IEEE Transactions on Circuits and Systems for Video Technology
- Wentao Ma + 4 more
Cross-modal retrieval aims to enable a flexible bi-directional retrieval experience across different modalities (e.g., searching for videos with texts). Many existing efforts tend to learn a common semantic representation embedding space in which items of different modalities can be directly compared, wherein the positive global representations of video-text pairs are pulled close while the negative ones are pushed apart via pair-wise ranking loss. However, such a vanilla loss would unfortunately yield ambiguous feature embeddings for texts of different videos, causing inaccurate cross-modal matching and unreliable retrievals. Toward this end, we propose a multimodal contrastive knowledge distillation method for instance video-text retrieval, called MCKD, by adaptively using the general knowledge of self-supervised model (teacher) to calibrate mixed boundaries. Specifically, the teacher model is tailored for robust (less-ambiguous) visual-text joint semantic space by maximizing mutual information of co-occurred modalities during multimodal contrastive learning. This robust and structural inter-instance knowledge is then distilled, with the help of explicit discrimination loss, to a student model for improved matching performance. Extensive experiments on four public benchmark video-text datasets (MSR-VTT, TGIF, VATEX, and Youtube2Text) demonstrate that our MCKD can achieve at most 8.8%, 6.4%, 5.9%, and 5.3% improvement in text-to-video performance by the <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\text{R}\text{@}1$ </tex-math></inline-formula> metric, compared with 14 SoTA baselines.
- Research Article
3
- 10.1016/j.image.2023.117018
- Jul 7, 2023
- Signal Processing: Image Communication
- Chunpu Sun + 4 more
Multi-label adversarial fine-grained cross-modal retrieval
- Research Article
42
- 10.1109/tmm.2023.3254199
- Jan 1, 2023
- IEEE Transactions on Multimedia
- Xiaoqing Liu + 5 more
The amount of multi-modal data available on the Internet is enormous. Cross-modal hash retrieval maps heterogeneous cross-modal data into a single Hamming space to offer fast and flexible retrieval services. However, existing cross-modal methods mainly rely on the feature-level similarity between multi-modal data and ignore the relationship between relative rankings and label-level fine-grained similarity of neighboring instances. To overcome these issues, we propose a novel <underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">D</u> eep <underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">C</u> ross-modal <underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">H</u> ashing based on <underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">S</u> emantic <underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">C</u> onsistent <underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">R</u> anking (DCH-SCR) that comprehensively investigates the intra-modal semantic similarity relationship. Firstly, to the best of our knowledge, it is an early attempt to preserve semantic similarity for cross-modal hashing retrieval by combining label-level and feature-level information. Secondly, the inherent gap between modalities is narrowed by developing a ranking alignment loss function. Thirdly, the compact and efficient hash codes are optimized based on the common semantic space. Finally, we use the gradient to specify the optimization direction and introduce the Normalized Discounted Cumulative Gain (NDCG) to achieve varying optimization strengths for data pairs with different similarities. Extensive experiments on three real-world image-text retrieval datasets demonstrate the superiority of DCH-SCR over several state-of-the-art cross-modal retrieval methods.
- Research Article
5
- 10.1109/access.2023.3239858
- Jan 1, 2023
- IEEE Access
- Huaying Zhang + 4 more
A cross-modal image retrieval that explicitly considers semantic relationships between images and texts is proposed. Most conventional cross-modal image retrieval methods retrieve the target images by directly measuring the similarities between the candidate images and query texts in a common semantic embedding space. However, such methods tend to focus on a one-to-one correspondence between a predefined image-text pair during the training phase, and other semantically similar images and texts are ignored. By considering the many-to-many correspondences between semantically similar images and texts, a common embedding space is constructed to assure semantic relationships, which allows users to accurately find more images that are related to the input query texts. Thus, in this paper, we propose a cross-modal image retrieval method that considers semantic relationships between images and texts. The proposed method calculates the similarities between texts as semantic similarities to acquire the relationships. Then, we introduce a loss function that explicitly constructs the many-to-many correspondences between semantically similar images and texts from their semantic relationships. We also propose an evaluation metric to assess whether each method can construct an embedding space considering the semantic relationships. Experimental results demonstrate that the proposed method outperforms conventional methods in terms of this newly proposed metric.
- Research Article
30
- 10.1109/tcyb.2021.3081615
- Nov 1, 2022
- IEEE transactions on cybernetics
- Xiaozhao Fang + 5 more
Cross-modal retrieval has attracted considerable attention for searching in large-scale multimedia databases because of its efficiency and effectiveness. As a powerful tool of data analysis, matrix factorization is commonly used to learn hash codes for cross-modal retrieval, but there are still many shortcomings. First, most of these methods only focus on preserving locality of data but they ignore other factors such as preserving reconstruction residual of data during matrix factorization. Second, the energy loss of data is not considered when the data of cross-modal are projected into a common semantic space. Third, the data of cross-modal are directly projected into a unified semantic space which is not reasonable since the data from different modalities have different properties. This article proposes a novel method called average approximate hashing (AAH) to address these problems by: 1) integrating the locality and residual preservation into a graph embedding framework by using the label information; 2) projecting data from different modalities into different semantic spaces and then making the two spaces approximate to each other so that a unified hash code can be obtained; and 3) introducing a principal component analysis (PCA)-like projection matrix into the graph embedding framework to guarantee that the projected data can preserve the main energy of data. AAH obtains the final hash codes by using an average approximate strategy, that is, using the mean of projected data of different modalities as the hash codes. Experiments on standard databases show that the proposed AAH outperforms several state-of-the-art cross-modal hashing methods.
- Research Article
3
- 10.1089/big.2020.0243
- Oct 18, 2022
- Big Data
- Basant Agarwal + 3 more
The cross-lingual plagiarism detection (CLPD) is a challenging problem in natural language processing. Cross-lingual plagiarism is when a text is translated from any other language and used as it is without proper acknowledgment. Most of the existing methods provide good results for monolingual plagiarism detection, whereas the performances of existing methods for the CLPD are very limited. The reason for this is that it is difficult to represent the text from two different languages in a common semantic space. In this article, a novel Siamese architecture-based model is proposed to detect the cross-lingual plagiarism in English-Hindi language pairs. The proposed model combines the convolutional neural network (CNN) and bidirectional long short-term memory (Bi-LSTM) network to learn the semantic similarity among the cross-lingual sentences for the English-Hindi language pairs. In the proposed model, the CNN model learns the local context of words, whereas the Bi-LSTM model learns the global context of sentences in forward and backward directions. The performances of the proposed models are evaluated on the benchmark data set, that is, Microsoft paraphrase corpus, which is converted in the English-Hindi language pairs. The proposed model outperforms other models giving 67%, 72%, and 67% weighted average precision, recall, and F1-measure scores. The experimental results show the effectiveness of the proposed models over the baseline models because the proposed model is very efficient in representing the cross-lingual text very efficiently.
- Research Article
4
- 10.3390/math10183346
- Sep 15, 2022
- Mathematics
- Fudong Nian + 3 more
This paper strives to improve the performance of video–text retrieval. To date, many algorithms have been proposed to facilitate the similarity measure of video–text retrieval from the single global semantic to multi-level semantics. However, these methods may suffer from the following limitations: (1) largely ignore the relationship semantic which results in semantic levels are insufficient; (2) it is incomplete to constrain the real-valued features of different modalities to be in the same space only through the feature distance measurement; (3) fail to handle the problem that the distributions of attribute labels in different semantic levels are heavily imbalanced. To overcome the above limitations, this paper proposes a novel multi-level cross-modal semantic alignment network (MCSAN) for video–text retrieval by jointly modeling video–text similarity on global, entity, action and relationship semantic levels in a unified deep model. Specifically, both video and text are first decomposed into global, entity, action and relationship semantic levels by carefully designing spatial–temporal semantic learning structures. Then, we utilize KLDivLoss and a cross-modal parameter-share attribute projection layer as statistical constraints to ensure that representations from different modalities in different semantic levels are projected into a common semantic space. In addition, a novel focal binary cross-entropy (FBCE) loss function is presented, which is the first effort to model the unbalanced attribute distribution problem for video–text retrieval. MCSAN is practically effective to take the advantage of the complementary information among four semantic levels. Extensive experiments on two challenging video–text retrieval datasets, namely, MSR-VTT and VATEX, show the viability of our method.
- Research Article
22
- 10.1162/dint_a_00154
- Jul 1, 2022
- Data Intelligence
- Huifang Du + 4 more
Abstract COVID-19 evolves rapidly and an enormous number of people worldwide desire instant access to COVID-19 information such as the overview, clinic knowledge, vaccine, prevention measures, and COVID-19 mutation. Question answering (QA) has become the mainstream interaction way for users to consume the ever-growing information by posing natural language questions. Therefore, it is urgent and necessary to develop a QA system to offer consulting services all the time to relieve the stress of health services. In particular, people increasingly pay more attention to complex multi-hop questions rather than simple ones during the lasting pandemic, but the existing COVID-19 QA systems fail to meet their complex information needs. In this paper, we introduce a novel multi-hop QA system called COKG-QA, which reasons over multiple relations over large-scale COVID-19 Knowledge Graphs to return answers given a question. In the field of question answering over knowledge graph, current methods usually represent entities and schemas based on some knowledge embedding models and represent questions using pre-trained models. While it is convenient to represent different knowledge (i.e., entities and questions) based on specified embeddings, an issue raises that these separate representations come from heterogeneous vector spaces. We align question embeddings with knowledge embeddings in a common semantic space by a simple but effective embedding projection mechanism. Furthermore, we propose combining entity embeddings with their corresponding schema embeddings which served as important prior knowledge, to help search for the correct answer entity of specified types. In addition, we derive a large multi-hop Chinese COVID-19 dataset (called COKG-DATA for remembering) for COKG-QA based on the linked knowledge graph OpenKG-COVID19 launched by OpenKG①, including comprehensive and representative information about COVID-19. COKG-QA achieves quite competitive performance in the 1-hop and 2-hop data while obtaining the best result with significant improvements in the 3-hop. And it is more efficient to be used in the QA system for users. Moreover, the user study shows that the system not only provides accurate and interpretable answers but also is easy to use and comes with smart tips and suggestions.