Cosine Similarity-Based Evidences Selection for Fact Verification Using SBERT on the FEVER Dataset

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

The spread of misinformation on digital platforms has emphasized the urgent need for automated fact verification systems. However, selecting the most semantically relevant evidence to support or refute a claim remains a challenge, especially within the widely used FEVER dataset. Traditional approaches like TF-IDF often fall short in capturing the contextual meaning between claims and evidence. This study addresses the problem by comparing TF-IDF with Sentence-BERT (SBERT) in measuring semantic similarity. The novelty of this research lies in embedding both claims and evidence using SBERT, then calculating cosine similarity to quantify their semantic relevance. Before embedding, standard preprocessing steps were applied, including tokenization, stemming, lowercasing, and stopword removal. A quantitative approach is used to compute cosine similarity between claim-evidence pairs using both TF-IDF and SBERT embeddings. Similarity analysis, distribution statistics, and t-tests are conducted to evaluate the methods. The results show that SBERT achieves higher similarity with the “SUPPORTS” category (0.65) and stronger negative similarity with “NOT ENOUGH INFO” (-0.90), compared to TF-IDF (0.49 and -0.62, respectively). SBERT also demonstrates more stable score distributions and significantly higher t-test values across all label comparisons, indicating stronger semantic discrimination. These findings confirm that SBERT outperforms TF-IDF in identifying the most relevant evidence. The new dataset generated can serve as a foundation for future fact verification model development.

Similar Papers
  • Research Article
  • 10.1016/j.neunet.2025.107959
Concept-enhanced heterogeneous graph network for fact verification.
  • Aug 1, 2025
  • Neural networks : the official journal of the International Neural Network Society
  • Zhendong Chen + 3 more

Concept-enhanced heterogeneous graph network for fact verification.

  • Research Article
  • 10.1145/3629975
Mountain Gazelle Optimizer with Deep Learning Driven Satirical News Classification on Low-resource Language Corpus
  • Oct 28, 2023
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Badriyya B Al-Onazi + 7 more

The development of satirical and fake news on digital platforms has source of major concern about the spread of misinformation and its control on society. As part of the Arabic language, fake news detection (FND) shows particular problems because of language difficulties and the scarcity of labeled data. FND on Arabic corpus utilizing deep learning (DL) contains leveraging advanced neural network (NN) techniques and methods to automatically recognize and classify deceptive data in the Arabic language text. This procedure is vital in combating the spread of disinformation and misinformation, promoting media literacy, and make sure the credibility of data sources for the Arabic-speaking community. Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) are common selections for FND because of their capability for learning hierarchical features and model sequential data from the text. In this view, this study develops a Mountain Gazelle Optimizer with Deep Learning-Driven Fake News Classification on Arabic Corpus (MGODL-FNCAC) technique. The presented MGODL-FNCAC approach aims to increase the performance of the fake news classification on the Arabic corpus. Primarily, the MGODL-FNCAC technique involves different stages of pre-processing to make the input data compatible for classification. For fake news detection, the MGODL-FNCAC technique applies the deep belief network (DBN) model. At last, the MGO approach can be used for the better hyperparameter tuning of the DBN approach, which supports in enhancing the overall training process and detection rate. The simulation outcomes of the MGODL-FNCAC technique can be examined on Arabic corpus data. The extensive outcomes exhibit the importance of the MGODL-FNCAC system over other methodologies with maximum accuracy of 97.68% and 95.14% on Covid19Fakes and Satirical dataset, respectively.

  • Research Article
  • Cite Count Icon 220
  • 10.1609/aaai.v33i01.33016859
Combining Fact Extraction and Verification with Neural Semantic Matching Networks
  • Jul 17, 2019
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Yixin Nie + 2 more

The increasing concern with misinformation has stimulated research efforts on automatic fact checking. The recentlyreleased FEVER dataset introduced a benchmark factverification task in which a system is asked to verify a claim using evidential sentences from Wikipedia documents. In this paper, we present a connected system consisting of three homogeneous neural semantic matching models that conduct document retrieval, sentence selection, and claim verification jointly for fact extraction and verification. For evidence retrieval (document retrieval and sentence selection), unlike traditional vector space IR models in which queries and sources are matched in some pre-designed term vector space, we develop neural models to perform deep semantic matching from raw textual input, assuming no intermediate term representation and no access to structured external knowledge bases. We also show that Pageview frequency can also help improve the performance of evidence retrieval results, that later can be matched by using our neural semantic matching network. For claim verification, unlike previous approaches that simply feed upstream retrieved evidence and the claim to a natural language inference (NLI) model, we further enhance the NLI model by providing it with internal semantic relatedness scores (hence integrating it with the evidence retrieval modules) and ontological WordNet features. Experiments on the FEVER dataset indicate that (1) our neural semantic matching method outperforms popular TF-IDF and encoder models, by significant margins on all evidence retrieval metrics, (2) the additional relatedness score and WordNet features improve the NLI model via better semantic awareness, and (3) by formalizing all three subtasks as a similar semantic matching problem and improving on all three stages, the complete model is able to achieve the state-of-the-art results on the FEVER test set (two times greater than baseline results).1

  • Research Article
  • Cite Count Icon 2
  • 10.37394/232018.2024.12.2
Using Cluster Analysis for Author Classification of Albanian Texts: A Study on the Effectiveness of Stop Words
  • Oct 19, 2023
  • WSEAS TRANSACTIONS ON COMPUTER RESEARCH
  • Denisa Kaçorri + 2 more

Cluster analysis is a statistical approach that identifies uniform clusters within data. The closeness of data is measured quantitatively using distance functions. Specifically for text data mining, clustering serves as a method of categorization of words based on the similarity of their occurrence within texts and classifying texts by topics or author. Hierarchical clustering is a powerful technique for identifying natural groupings within datasets, which can be especially useful for unsupervised text classification. This paper aims to utilize cluster analysis to establish Albanian texts clusters by authors. Using agglomerative hierarchical clustering we classify Albanian texts by authors according to the similarity of their word frequency. The similarity of texts is evaluated using cosine and Euclidean distances. Considering two study cases, respectively with and without Albanian stop words we conclude that the best clustering by authors of the Albanian documents is achieved with 87% accuracy using Ward’s method with cosine distance in the case of study by removing stop words.

  • Research Article
  • 10.37905/jacedu.v4i2.28124
Social Cohesion with Digital Platforms to Realise a Good Social Society
  • Nov 30, 2024
  • Jambura Journal Civic Education
  • A Ramli Rasjid + 4 more

Digital platforms are not only remodeling the way we interact but also offer new opportunities to build strong social cohesion. Analyzing various case studies and digital initiatives reveals how digital tools can expand social networks, deepen community participation, and instill constructive civic values. Through an exploration of innovative strategies and best practices, this article presents a practical guide to harnessing the power of digital platforms to create a more connected, engaged, and responsible society, thus making good citizens. The findings provide new perspectives on how technology can strengthen social fabric and advance quality citizenship in an increasingly connected world. This research aims to analyze the role of digital platforms in improving social cohesion by assessing the potential and challenges posed by using digital platforms such as social media, community apps, and online forums. Based on a survey of 50 respondents from campus and civilian populations, most people believe that digital platforms have great potential to strengthen relationships between individuals, expand social interactions, and increase solidarity within communities. However, the main challenges identified were social polarisation, the spread of misinformation, and the technology access gap that still exists in some communities. The findings emphasize the importance of thoughtful moderation policies, digital literacy education, and efforts to reduce social divides to maximize the positive potential of digital platforms. With the right approach, digital platforms can build a more inclusive and harmonious social society, which supports social cohesion among individuals and groups. This research provides important insights for policymakers and society on how to utilize digital technologies positively and constructively.

  • Research Article
  • Cite Count Icon 1
  • 10.1609/aaai.v38i17.29825
CFEVER: A Chinese Fact Extraction and VERification Dataset
  • Mar 24, 2024
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Ying-Jia Lin + 7 more

We present CFEVER, a Chinese dataset designed for Fact Extraction and VERification. CFEVER comprises 30,012 manually created claims based on content in Chinese Wikipedia. Each claim in CFEVER is labeled as “Supports”, “Refutes”, or “Not Enough Info” to depict its degree of factualness. Similar to the FEVER dataset, claims in the “Supports” and “Refutes” categories are also annotated with corresponding evidence sentences sourced from single or multiple pages in Chinese Wikipedia. Our labeled dataset holds a Fleiss’ kappa value of 0.7934 for five-way inter-annotator agreement. In addition, through the experiments with the state-of-the-art approaches developed on the FEVER dataset and a simple baseline for CFEVER, we demonstrate that our dataset is a new rigorous benchmark for factual extraction and verification, which can be further used for developing automated systems to alleviate human fact-checking efforts. CFEVER is available at https://ikmlab.github.io/CFEVER.

  • Research Article
  • 10.70382/tijssra.v07i6.028
MEDIA LITERACY, MISINFORMATION AND DISINFORMATION ON SOCIAL MEDIA DURING THE 2023 ELECTIONS IN PLATEAU STATE
  • Mar 17, 2025
  • International Journal of Social Science Research and Anthropology
  • Shalgan, Lohnan Moses + 1 more

The social media seems to have increased the velocity of spreading falsehoods and misinformation, being even more influential around electioneering times. This study looks at how low media literacy has worsened the spread of misinformation in the 2023 general elections in Plateau State, Nigeria. Using a mixed methods research design, the study attempts to measure media literacy across different demographics and its effects on voter perception and decision-making using both qualitative and quantitative approaches. The outcomes suggest that the low levels of media literacy predispose individuals to manipulation from that misinformation into electoral choices, widening political polarisation and destroying the integrity of democracy. This study identifies the main sources and channels for the spread of misinformation across social media sites, political campaigning, and some traditional media. This study focuses on the furtherance of false narratives and the curtailment of critical engagement with factual information through algorithmic content, echo chambers, and cognitive biases. The study suggests targeted media literacy campaigns informing voters how to identify credible sources, policy regulations on disinformation spread, and technological approaches such as AI fact-checking systems to identify and flag misleading content as possible ways out of the problem. Enhancing partnership between government agencies, civil society organizations, and digital platforms was noted as a significant step toward combating misinformation and creating an informed electorate. This study provides an overview to the general subject of information disorder and emphasizes the urgent need to make media literacy an anchor against manipulative activities of the opinion in any democratic process.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.3390/bdcc6020033
RoBERTaEns: Deep Bidirectional Encoder Ensemble Model for Fact Verification
  • Mar 22, 2022
  • Big Data and Cognitive Computing
  • Muchammad Naseer + 3 more

The application of the bidirectional encoder model to detect fake news has been widely applied because of its ability to provide factual verification with good results. Good fact verification requires the most optimal model and has the best evaluation to make news readers trust the reliable and accurate verification results. In this study, we evaluated the application of a homogeneous ensemble (HE) on RoBERTa to improve the accuracy of a model. We improve the HE method using a bagging ensemble from three types of RoBERTa models. Then, each prediction is combined to build a new model called RoBERTaEns. The FEVER dataset is used to train and test our model. The experimental results showed that the proposed method, RoBERTaEns, obtained a higher accuracy value with an F1-Score of 84.2% compared to the other RoBERTa models. In addition, RoBERTaEns has a smaller margin of error compared to the other models. Thus, it proves that the application of the HE functions increases the accuracy of a model and produces better values in handling various types of fact input in each fold.

  • Research Article
  • Cite Count Icon 3
  • 10.1016/j.neunet.2024.106424
A syntactic evidence network model for fact verification
  • Oct 1, 2024
  • Neural Networks
  • Zhendong Chen + 6 more

A syntactic evidence network model for fact verification

  • Conference Article
  • Cite Count Icon 12
  • 10.1145/3485447.3512135
EvidenceNet: Evidence Fusion Network for Fact Verification
  • Apr 25, 2022
  • Zhendong Chen + 6 more

Fact verification is a challenging task that requires the retrieval of multiple pieces of evidence from a reliable corpus for verifying the truthfulness of a claim. Although the current methods have achieved satisfactory performance, they still suffer from one or more of the following three problems: (1) unable to extract sufficient contextual information from the evidence sentences; (2) containing redundant evidence information and (3) incapable of capturing the interaction between claim and evidence. To tackle the problems, we propose an evidence fusion network called EvidenceNet. The proposed EvidenceNet model captures global contextual information from various levels of evidence information for deep understanding. Moreover, a gating mechanism is designed to filter out redundant information in evidence. In addition, a symmetrical interaction attention mechanism is also proposed for identifying the interaction between claim and evidence. We conduct extensive experiments based on the FEVER dataset. The experimental results have shown that the proposed EvidenceNet model outperforms the current fact verification methods and achieves the state-of-the-art performance.

  • PDF Download Icon
  • Research Article
  • 10.17645/up.7262
Digital Platforms as (Dis)Enablers of Urban Co-Production: Evidence From Bengaluru, India
  • Mar 28, 2024
  • Urban Planning
  • Deepa Kylasam Iyer + 1 more

This article examines how digital platforms focused on citizen engagement affect urban transformation based on multiple case studies from Bengaluru, India. The research question is: What type of initiatives and designs of digital citizen platforms enable co-production? Co-production is defined as the use of assets and resources between the public sector and citizens to produce better outcomes and improve the efficiency of urban services. The study uses qualitative and quantitative approaches. Evaluative metrics of citizen engagement in digital platforms are done at two levels: platform metrics and initiative metrics. Each platform is evaluated under several variables that indicate the type of ownership, period of operation, aims and types of initiatives, and impact and levels of engagement. Then, the digital platforms are mapped for the extent of digital co-production that matches the type of digital interaction with a form of citizen–government relationship. The findings indicate that the orientation of digital co-production, where it exists, seems to be around the dimensions of co-testing and co-evaluation rather than co-design and co-financing. Furthermore, the digital platforms under study primarily view citizens as users rather than collaborators, limiting the scope of digital co-production. The involvement of urban local governments and private partners in a single platform strengthens the degree of citizen engagement, including the scope for co-production. Finally, there is a strong offline counterpart to citizen engagement through digital platforms where true co-production exists.

  • Research Article
  • 10.26483/ijarcs.v8i3.3059
Context Based Spell Checking using Document Semantic
  • Apr 30, 2017
  • International Journal of Advanced Research in Computer Science
  • Amalu Laji + 1 more

Context Based errors are those errors which are wrongly used in sentences, though it seems valid. These errors would turn sentences into meaningless, which in turned to invalid documents. These words would not rectify in context or would do an automatic spelling checking. This paper explores the idea of Document Semantic called Latent Semantic analysis which can correct the context from this minor difference. This paper also reflects the importance of using this terminology. Keywords: stop word removal; stemming; term-frequency; singular value decomposition; cosine similarity.

  • Research Article
  • Cite Count Icon 9
  • 10.17485/ijst/2016/v9i12/86631
Maulik: A Plagiarism Detection Tool for Hindi Documents
  • Mar 29, 2016
  • Indian Journal of Science and Technology
  • Urvashi Garg + 1 more

Objective: The objective of this paper is to present an automated plagiarism detection software tool called Maulik. There are many plagiarism detection tools available for English text. Maulik detects plagiarism in Hindi documents. Method: Maulik divides the text into n-grams and then matches it with the text present in repository as well as with documents present online. Preprocessing techniques such as stop word removal and stemming has been used. The best value of n-gram for finding out the similarity of two Hindi documents has also been found out. Cosine similarity has been used for finding the similarity score. Findings: Similarity score of 96.3 has been achieved which is higher as compared to the existing Hindi plagiarism detection tools such as Plagiarism checker, Plagiarism finder, Plagiarisma, Dupli checker, Quetext. These tools compared only exact matches ignoring the language specific constraints whereas Maulik is capable of finding plagiarism if root of a word is used or a word is replaced by its synonyms. Application: Maulik is a software tool which discourages plagiarism as well as motivates the writing skills of people. Keywords: Cosine Similarity, Plagiarism, Stemming, Stop Word, Synonyms

  • Conference Article
  • 10.1145/1621995.1622040
Experience report
  • Oct 5, 2009
  • Youngik Yang + 1 more

Annotating function of genes accurately is one of the most important tasks in molecular biology and medical sciences. The new sequencing technology, called the next generation sequencing technology, made sequencing the whole genomes possible with a fraction of cost of sequencing by using the traditional sequencing technology. As a result, the amount of sequence data has been growing very rapidly, but the computational method for gene function annotation is yet to be fully developed. Thus annotation of gene function is a serious bottleneck to achieving the high-throughput genome projects. The most commonly used gene annotation technique is to transfer annotation of genes based on the sequence similarity; annotation of top-ranked genes in terms of sequence similarity is simply transferred to the function of a target gene. However, this sequence-similarity based gene function annotation is often incorrect. As a result, genome projects still rely on expensive, error-prone, labor-intensive, manual process. Combining annotation and sequence similarity can improve the accuracy of gene function annotation significantly. We have been developing a computational method for comparing gene annotation in text. In this paper, we will discuss issues in comparing genome annotation in a text format. To compute textual similarity, we used cosine similarity. Since cosine similarity is effective only after preprocessing with textual variations, we used commonly used text preprocessing techniques such as removing stop words and stemming as well as gene annotation specific preprocessing such as handling synonyms and gene symbols using databases of biology terminologies such as BioThesaurus and MeSH. In experiments with annotations of a number of bacterial genomes, our method was able to handle many difficult cases (syntactically different but semantically equivalent gene function annotations) correctly.

  • PDF Download Icon
  • Research Article
  • 10.33769/aupse.633838
PREPROCESSING STEPS IN fMRI: SMOOTHING
  • Dec 1, 2019
  • Communications Faculty of Sciences University of Ankara Series A2-A3 Physical Sciences and Engineering
  • Hacer Daşgın + 2 more

Functional magnetic resonance imaging is a technique with a primary and dominant effect in the investigation of the cognitive functions of the brain since it has a complex structure. In this study, data obtained from single subject was examined. First statistical parametric mapping results were obtained after applying the standard preprocessing steps with including smoothness. Spatial smoothing was performed using a 3 mm Gaussian kernel which is twice of the voxel size. Second, statistical parametric mapping results were obtained with applying standard preprocessing steps without smoothing. The effects of these two applications on the mapping results were compared for selected slices and locations in terms of statistical and pattern.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon