ORICERT – FRAUDULENT DEGREE AND MARKSHEET DETECTION
Identifying fake degrees and marksheets is becoming more and more important intoday's educational environment, since thespread of fake credentials jeopardizes employer confidence, academic integrity, and social standards. An overview of the most recent developments in marksheet and fraudulent degree detection techniques—including both established techniques and cutting-edge technologies—is provided Conventional techniques for identifying forged credentials frequently entail manual verification procedures, such as physical document inspection, institution verification,and cross- referencing with official databases. Although these techniques are still fundamental, they are prone to human error and frequently ineffective whenmanaging high numbers of credentials. Recent years have seen a boom intechnology advancements meant to improve the scalability and accuracy of fraudulent credential detection in response to these difficulties. For example, machine learning algorithms have demonstrated potential in automating the authentication process through the analysis of anomalies and patterns found in digital documents. Using natural language processing (NLP) techniques, textual data from academic transcripts and diplomas may be extracted and analysed to help find discrepancies or anomalies. Additionally, blockchain technology has become a disruptive force in credential verification by providing decentralized, immutable ledgers for academic accomplishments. Employers and educational institutions may reduce the danger of credential fraud by establishing safe, unchangeable archives of student Records by utilizing blockchain-based credentialing systems. Notwithstanding these developments, there are still issues with fraudulent credential detection, such as the necessity for international collaboration in the fight against cross-border credential fraud and the adaptation of fraudsters to changing detection techniques. Furthermore, the ethical implications related to algorithmic bias and data privacy emphasize how crucial it is to implement and supervise detection systems responsibly. KEYWORD: Degree Marksheet Verification Record.
- Research Article
87
- 10.1007/s10278-017-0027-x
- Oct 27, 2017
- Journal of Digital Imaging
A significant volume of medical data remains unstructured. Natural language processing (NLP) and machine learning (ML) techniques have shown to successfully extract insights from radiology reports. However, the codependent effects of NLP and ML in this context have not been well-studied. Between April 1, 2015 and November 1, 2016, 9418 cross-sectional abdomen/pelvis CT and MR examinations containing our internal structured reporting element for cancer were separated into four categories: Progression, Stable Disease, Improvement, or No Cancer. We combined each of three NLP techniques with five ML algorithms to predict the assigned label using the unstructured report text and compared the performance of each combination. The three NLP algorithms included term frequency-inverse document frequency (TF-IDF), term frequency weighting (TF), and 16-bit feature hashing. The ML algorithms included logistic regression (LR), random decision forest (RDF), one-vs-all support vector machine (SVM), one-vs-all Bayes point machine (BPM), and fully connected neural network (NN). The best-performing NLP model consisted of tokenized unigrams and bigrams with TF-IDF. Increasing N-gram length yielded little to no added benefit for most ML algorithms. With all parameters optimized, SVM had the best performance on the test dataset, with 90.6 average accuracy and F score of 0.813. The interplay between ML and NLP algorithms and their effect on interpretation accuracy is complex. The best accuracy is achieved when both algorithms are optimized concurrently.
- Research Article
- 10.52756/ijerr.2024.v45spl.005
- Nov 30, 2024
- International Journal of Experimental Research and Review
Analyzing user interface (UI) bugs is an important step taken by testers and developers to assess the usability of the software product. UI bug classification helps in understanding the nature and cause of software failures. Manually classifying thousands of bugs is an inefficient and tedious job for both testers and developers. Objective of this research is to develop a classification model for the User Interface (UI) related bugs using supervised Machine Learning (ML) algorithms and Natural Language Processing (NLP) techniques. Also, to assess the effect of different sampling and feature vectorization techniques on the performance of ML algorithms. Classification is based upon ‘Summary’ feature of the bug report and utilizes six classifiers i.e., Gaussian Naïve Bayes (GNB), Multinomial Naïve Bayes (MNB), Logistic Regression (LR), Support Vector Machines (SVM), Random Forest (RF) and Gradient Boosting (GB). Dataset obtained is vectored using two vectorization techniques of NLP i.e., Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF). ML models are trained after vectorization and data balancing. The models ' hyperparameter tuning (HT) has also been done using the grid search approach to improve their efficacy. This work provides a comparative performance analysis of ML techniques using Accuracy, Precision, Recall and F1 Score. Performance results showed that a UI bug classification model can be built by training a tuned SVM classifier using TF-IDF and SMOTE (Synthetic Minority Oversampling Techniques). SVM classifier provided the highest performance measure with Accuracy: 0.88, Precision: 0.86, Recall: 0.85 and F1: 0.85. Result also inferred that the performance of ML algorithms with TF-IDF is better than BoW in most cases. This work provides classification of bugs that are related to only the user interface. Also, the effect of two different feature extraction techniques and sampling techniques on algorithms were analyzed, adding novelty to the research work.
- Research Article
109
- 10.1109/access.2022.3183083
- Jan 1, 2022
- IEEE Access
Every year, phishing results in losses of billions of dollars and is a major threat to the Internet economy. Phishing attacks are now most often carried out by email. To better comprehend the existing research trend of phishing email detection, several review studies have been performed. However, it is important to assess this issue from different perspectives. None of the surveys have ever comprehensively studied the use of Natural Language Processing (NLP) techniques for detection of phishing except one that shed light on the use of NLP techniques for classification and training purposes, while exploring a few alternatives. To bridge the gap, this study aims to systematically review and synthesise research on the use of NLP for detecting phishing emails. Based on specific predefined criteria, a total of 100 research articles published between 2006 and 2022 were identified and analysed. We study the key research areas in phishing email detection using NLP, machine learning algorithms used in phishing detection email, text features in phishing emails, datasets and resources that have been used in phishing emails, and the evaluation criteria. The findings include that the main research area in phishing detection studies is feature extraction and selection, followed by methods for classifying and optimizing the detection of phishing emails. Amongst the range of classification algorithms, support vector machines (SVMs) are heavily utilised for detecting phishing emails. The most frequently used NLP techniques are found to be TF-IDF and word embeddings. Furthermore, the most commonly used datasets for benchmarking phishing email detection methods is the Nazario phishing corpus. Also, Python is the most commonly used one for phishing email detection. It is expected that the findings of this paper can be helpful for the scientific community, especially in the field of NLP application in cybersecurity problems. This survey also is unique in the sense that it relates works to their openly available tools and resources. The analysis of the presented works revealed that not much work had been performed on Arabic language phishing emails using NLP techniques. Therefore, many open issues are associated with Arabic phishing email detection.
- Research Article
- 10.2174/0118722121300281240823174052
- Oct 2, 2024
- Recent Patents on Engineering
Advanced technologies on the internet create an environment for information exchange among communities. However, some individuals exploit these environments to spread false news. False News, or Fake News (FN), refers to misleading information deliberately crafted to harm the reputation of individuals, products, or services. Identifying FN is a challenging issue for the research community. Many researchers have proposed approaches for FN detection using Machine Learning (ML) and Natural Language Processing (NLP) techniques. In this article, we propose a combined approach for FN detection, leveraging both ML and NLP techniques. We first extract all terms from the dataset after applying appropriate preprocessing techniques. A Feature Selection Algorithm (FSA) is then employed to identify the most important features based on their scores. These selected features are used to represent the dataset documents as vectors. The term weight measure determines the significance of each term in the vector representation. These document vectors are combined with vector representations obtained through an NLP technique. Specifically, we use the Bidirectional Encoder Representations from Transformers (BERT) model to represent the document vectors. The BERT small case model is employed to generate features, which are then used to create the document vectors. The combined vector, comprising ML-based document vector representations and NLP-based vector representations, is fed into various ML algorithms. These algorithms are used to build a model for classification. Our combined approach for FN detection achieved the highest accuracy of 96.72% using the Random Forest algorithm, with document vectors that included content-based features of size 4000 concatenated with outputs from the 9th to 12th BERT encoder layers.
- Research Article
- 10.52783/jisem.v10i19s.3009
- Mar 12, 2025
- Journal of Information Systems Engineering and Management
Our methodology utilizes a supervised learning approach, employing Random Forest and Gradient Boosting Machines (GBM) trained on a comprehensive dataset that includes email headers, content, and sender behavior. This approach allows our models to discern complex patterns associated with phishing attempts, achieving a 92% detection rate, a substantial improvement over the traditional signature-based methods' 65% rate. Additionally, we integrated NLP techniques, specifically Word2Vec and GloVe, to extract semantic features from email content, enhancing our system's ability to identify malicious intent. The incorporation of NLP not only improves the precision of phishing detection by an additional 15% compared to conventional methods but also emphasizes the importance of semantic analysis in cybersecurity. This enhancement is crucial for understanding the subtle cues within email content that may indicate phishing, offering a more robust and effective defense mechanism for rural areas. By combining supervised learning with quantum computing and NLP, our approach addresses the significant gaps in traditional cybersecurity methods. This multi-layered strategy ensures a more reliable and efficient way to safeguard rural communities from the increasing threat of cyber attacks. The advanced AI techniques employed here leverage both the predictive power of machine learning and the nuanced understanding of language provided by NLP, setting a new standard in cybersecurity practices. The results of our study highlight the effectiveness of the proposed methodology, demonstrating a potential to markedly improve cybersecurity in resource-constrained rural environments. With a 92% phishing detection rate and an increase in precision through the use of NLP, our approach promises a significant advancement in the protection against cyber threats for rural areas, offering a comprehensive and scalable solution. This research presents an innovative multi-layered AI approach, utilizing quantum computing to enhance cybersecurity in rural areas vulnerable to phishing threats. The paper details the integration of sophisticated machine learning techniques—Random Forest and Gradient Boosting Machines (GBM)—with Natural Language Processing (NLP) tools like Word2Vec and GloVe, achieving significant improvements in phishing detection rates. Through a comprehensive analysis of existing cybersecurity strategies and the limitations of traditional signature-based detection methods, this study proposes a robust solution tailored for rural settings such as Siddlagatta, Chikkaballapur, and Devanahalli. By incorporating quantum computing, the approach not only overcomes the constraints of classical computing but also leverages the predictive prowess of AI to offer a more reliable and effective defense against cyber threats. The results demonstrate a promising increase in detection rates, underscoring the potential of this quantum-enhanced, AI-driven strategy to significantly bolster cybersecurity in resource-limited rural environments. Introduction : Cybersecurity in rural areas remains a pivotal concern, exacerbated by limited access to sophisticated technological resources and infrastructure. This paper introduces an advanced multi-layered artificial intelligence (AI) approach, utilizing quantum computing to enhance phishing threat detection in rural environments. Focusing on regions like Siddlagatta, Chikkaballapur, and Devanahalli, the study integrates supervised learning algorithms—Random Forest and Gradient Boosting Machines (GBM)—with Natural Language Processing (NLP) techniques to improve the detection and analysis of phishing attempts. By leveraging machine learning to surpass traditional signature-based methods, this approach significantly boosts detection rates, presenting a tailored, effective solution to protect these vulnerable communities against evolving cyber threats.. Objectives : The objectives of this research are to develop and implement a multi-layered artificial intelligence (AI) approach, utilizing quantum computing to enhance the detection of phishing threats in rural areas. Specifically, the study aims to address the limitations of traditional signature-based detection methods by integrating advanced machine learning algorithms such as Random Forest and Gradient Boosting Machines (GBM) with Natural Language Processing (NLP) techniques. This integration seeks to improve the precision of identifying malicious intent in email communications by analyzing semantic features. The research also explores the effectiveness of these AI techniques in rural settings where cybersecurity resources are scarce, aiming to provide a more robust and efficient solution that can significantly reduce the incidence of phishing attacks in these vulnerable communities. Methods : The proposed methodology entails the development of a web-based platform that melds social networking functionalities with sophisticated agricultural tools and services. By utilizing user profiles, the system effectively categorizes key stakeholders such as farmers, suppliers, experts, and policymakers to foster focused engagement and collaborative efforts. The integration of data from IoT sensors, satellite imagery, and user contributions is channeled into a central system that supports real-time analysis and informed decision-making. Moreover, the platform employs algorithms designed to align stakeholders with pertinent resources, market possibilities, and professional advice. Enhanced communication features like forums, direct messaging, and video conferencing are incorporated to promote interactive exchanges among users. A pilot phase involving select agricultural communities will be initiated to evaluate the practicality and impact of the framework, with subsequent adjustments driven by user feedback and analytic assessments. The ultimate goal of this framework is to boost connectivity, facilitate the efficient distribution of resources, and empower all involved parties through a scalable and intuitive interface. This approach not only aims to revolutionize the way agricultural communities interact and operate but also seeks to provide a robust foundation for continuous growth and innovation in the sector. Results : The simulated results of the study demonstrate a significant enhancement in phishing detection capabilities through the integration of a multi-layered AI approach in rural settings. The deployment of advanced machine learning algorithms, such as Random Forest and Gradient Boosting Machines (GBM), along with Natural Language Processing (NLP) techniques, notably increased the phishing detection rate to 92%, a substantial improvement over the 65% detection rate achieved by traditional signature-based methods. Additionally, the incorporation of NLP through tools like Word2Vec and GloVe improved the precision of identifying malicious intent by an additional 15%, emphasizing the effectiveness of semantic analysis in distinguishing phishing attempts. These results highlight the potential of combining machine learning and quantum computing to address the unique cybersecurity challenges faced in rural areas, providing a robust solution that significantly enhances the detection and prevention of phishing threats.. Conclusions : The research presented in this paper successfully demonstrates the efficacy of a multi-layered AI approach in significantly enhancing cybersecurity against phishing threats in rural areas. By integrating advanced machine learning algorithms with Natural Language Processing techniques and quantum computing, the study achieved a notable increase in phishing detection rates, outperforming traditional signature-based methods with a detection rate of 92%. This approach not only addresses the limitations inherent in existing cybersecurity measures but also tailors its strategy to the unique challenges posed by the limited resources and infrastructure in rural environments. The integration of semantic analysis through NLP further enhanced the precision of threat detection, providing a more nuanced understanding of malicious intent. Overall, the study underscores the potential of sophisticated AI technologies to transform cybersecurity practices in underserved areas, ensuring more effective protection against evolving cyber threats.
- Book Chapter
20
- 10.1007/978-3-319-30319-2_3
- Jan 1, 2016
Due to the growing volume of available textual information, there is a great demand for Natural Language Processing (NLP) techniques that can automatically process and manage texts, supporting the information retrieval and communication in core areas of society (e.g. healthcare, business, and science). NLP techniques have to tackle the often ambiguous and linguistic structures that people use in everyday speech. As such, there are many issues that have to be considered, for instance slang, grammatical errors, regional dialects, figurative language , etc. Figurative Language (FL), such as irony , sarcasm , simile, and metaphor, poses a serious challenge to NLP systems. FL is a frequent phenomenon within human communication, occurring both in spoken and written discourse including books, websites, fora, chats, social network posts, news articles and product reviews. Indeed, knowing what people think can help companies, political parties, and other public entities in strategizing and decision-making polices. When people are engaged in an informal conversation, they almost inevitably use irony (or sarcasm) to express something else or different than stated by the literal sentence meaning. Sentiment analysis methods can be easily misled by the presence of words that have a strong polarity but are used sarcastically, which means that the opposite polarity was intended. Several efforts have been recently devoted to detect and tackle FL phenomena in social media. Many of applications rely on task-specific lexicons (e.g. dictionaries, word classifications) or Machine Learning algorithms. Increasingly, numerous companies have begun to leverage automated methods for inferring consumer sentiment from online reviews and other sources. A system capable of interpreting FL would be extremely beneficial to a wide range of practical NLP applications. In this sense, this chapter aims at evaluating how two specific domains of FL, sarcasm and irony, affect Sentiment Analysis (SA) tools. The study’s ultimate goal is to find out if FL hinders the performance (polarity detection) of SA systems due to the presence of ironic context. Our results indicate that computational intelligence approaches are more suitable in presence of irony and sarcasm in Twitter classification.
- Research Article
- 10.63345/jqst.v1i4.97
- Nov 1, 2024
- Journal of Quantum Science and Technology
The rapid proliferation of social media platforms has transformed the landscape of information dissemination, enabling unprecedented access to news and opinions. However, this democratization of information has also facilitated the spread of misinformation, posing significant risks to public health, safety, and trust in societal institutions. This paper addresses the urgent need for effective strategies to detect and assess the impact of misinformation on social media through Natural Language Processing (NLP) techniques. We begin by examining the nature of misinformation and its detrimental effects on public discourse and decision-making. The complexity of detecting misleading content is compounded by the subtleties of language and context, necessitating advanced NLP methodologies. This study proposes a comprehensive framework that integrates various NLP techniques, including sentiment analysis, named entity recognition, and machine learning algorithms, to enhance the detection of false narratives. The methodology section details our approach, which includes data collection from diverse social media platforms, preprocessing text data, and employing a range of machine learning classifiers to identify and categorize misinformation. We utilize annotated datasets to train and validate our models, ensuring that our approach is robust and adaptable to different contexts and types of misinformation. Results from our experiments indicate that the proposed NLP techniques significantly improve the accuracy of misinformation detection compared to traditional methods. We provide quantitative metrics that demonstrate the effectiveness of our approach, including precision, recall, and F1-score, while also offering qualitative insights into the types of misinformation prevalent on social media. Additionally, we discuss the implications of our findings for risk and impact assessment, emphasizing the importance of timely intervention in mitigating the spread of false information. In conclusion, this research contributes to the ongoing discourse on misinformation by highlighting the potential of NLP in creating more effective detection frameworks.
- Research Article
14
- 10.1007/s10805-021-09422-4
- Jun 5, 2021
- Journal of Academic Ethics
Is academic integrity research presented from a positive integrity standpoint? This paper uses Natural Language Processing (NLP) techniques to explore a data set of 8,507 academic integrity papers published between 1904 and 2019.Two main techniques are used to linguistically examine paper titles: (1) bigram (word pair) analysis and (2) sentiment analysis. The analysis sees the three main bigrams used in paper titles as being “academic integrity” (2.38%), “academic dishonesty” (2.06%) and “plagiarism detection” (1.05%). When only highly cited papers are considered, negative integrity bigrams dominate positive integrity bigrams. For example, the 100 most cited academic integrity papers of all time are three times more likely to have “academic dishonesty” included in their titles than “academic integrity”. Similarly, sentiment analysis sees negative sentiment outperforming positive sentiment in the most cited papers.The history of academic integrity research is seen to place the field at a disadvantage due to negative portrayals of integrity. Despite this, analysis shows that change towards positive integrity is possible. The titles of papers by the ten most prolific academic integrity researchers are found to use positive terminology in more cases that not. This suggests an approach for emerging academic integrity researchers to model themselves after.
- Research Article
- 10.36948/ijfmr.2024.v06i04.24743
- Jul 23, 2024
- International Journal For Multidisciplinary Research
Blockchain technology has emerged as a powerful tool for enhancing transparency, security, efficiency, and inclusivity in record-keeping, credentialing, and educational transactions in the digital age. This paper aims to highlight the potential benefits and challenges of integrating blockchain technology into educational environments, emphasizing security, transparency, and credential verification as key areas of focus. Blockchain technology offers transformative potential in the field of education, particularly in credential and certification processes, providing secure, transparent, and globally accessible solutions for verifying academic achievements and professional skills. As blockchain technology continues to evolve, further advancements and case studies specific to enhancing privacy in educational processes are expected to emerge, addressing the unique challenges and regulatory requirements associated with handling sensitive educational data. Further R&D in blockchain technology within the education sector holds promise for addressing existing challenges, unlocking new opportunities, and advancing the capabilities of educational institutions worldwide.
- Research Article
- 10.54216/jcim.160116
- Jan 1, 2025
- Journal of Cybersecurity and Information Management
A resume is the first impression between you and a potential employer. Therefore, the importance of a resume can never be underestimated. Selecting the right candidates for a job within a company can be a daunting task for recruiters when they have to review hundreds of resumes. To reduce time and effort, we can use NLTK and Natural Language Processing (NLP) techniques to extract essential data from a resume. NLTK is a free, open source, community-driven project and the leading platform for building Python programs to work with human language data. To select the best resume according to the company’s requirements, an algorithm such as KNN is used. To be selected from hundreds of resumes, your resume must be one of the best. Therefore, our work also focuses on creating an automated system that can recommend the right skills and courses to help the desired candidates by using Natural Language Processing to analyze writing style (linguistic fingerprints) and also used to measure style and analyze word frequency from the submitted resume. Through semantic search and relying on individual resumes, forensic experts can query the huge semantic datasets provided to companies and institutions and facilitate the work of government forensics by obtaining official institutional databases. With global cybercrime and the increase in applicants seeking work and leveraging their multilingual data, Natural Language Processing (NLP) is making it easier. Through the important relationship between Natural Language Processing (NLP) and digital forensics, NLP techniques are increasingly being used to enhance investigations involving digital evidence and leverage the support of NLP for open-source data by analyzing massive amounts of public data.
- Conference Article
2
- 10.1109/bigdata52589.2021.9671552
- Dec 15, 2021
Textual data, such as clinical notes, product or movie reviews in online stores, transcripts, chat records, and business documents, are widely collected nowadays and can be used to support a large spectrum of Big Data applications. At the same time, textual data, collected about individuals or from individuals, can be susceptible to inference attacks that may leak private and/or sensitive information about individuals.The increasing concerns of privacy risks in textual data preclude sharing or exchanging textual data across different parties/organizations for various applications such as record linkage, similar entity matching, natural language processing (NLP), or machine learning on large collections of textual data. This has led to the development of privacy preserving techniques for applying matching, machine learning or NLP techniques on textual data that contain personal and sensitive information about individuals. While cryptographic techniques are highly secure and accurate, they incur significant amount of computational cost for encoding and matching data – especially textual data – due to the complex nature of text.In this paper, we propose an efficient textual data encoding and matching algorithm using probabilistic techniques based on counting Bloom filters combined with Differential privacy. We apply our algorithm to a popular use case scenario that involves privacy preserving topic modeling – a widely used NLP technique – in order to identify common or collective topics in texts across multiple parties without learning the individual topics of each party, and show its effectiveness in supporting this application. Finally, through extensive experimental evaluation on three large text datasets against a state-of-the-art probabilistic encoding algorithm for privacy preserving LDA topic modelling, we show that our method provides a better privacy-utility trade-off at the cost of more computation complexity and memory space, while still being computationally efficient (log-linear complexity in the size of documents) for Big data compared to cryptographic techniques that have quadratic complexity.
- Research Article
- 10.53550/eec.2025.v31i01s.036
- Jan 1, 2025
- Ecology, Environment and Conservation
Agriculture and Technology are becoming increasingly inextricably linked on a global basis. Agriculture technology innovations have transformed farming techniques, providing answers to some of the world’s most urgent problems. Precision agriculture, enabled by technology such as GPS, drones, and data analytics, is increasing agricultural yields and reducing resource use. Genetic engineering and biotechnology have resulted in the creation of genetically engineered crops that are resistant to pests and thrive in harsh environments. Organic and regenerative agriculture, for example, are gaining popularity as a result of the desire to safeguard the environment. Furthermore, blockchain technology is improving transparency and traceability throughout the food supply chain. The Natural Language Processing (NLP) is having a big influence in the agricultural sector. NLP is transforming several parts of the business by using the capabilities of language understanding and processing. It aids agricultural monitoring and advising services by evaluating weather reports, research papers, and textual data to provide farmers with real-time advice on optimal planting schedules, pest and disease control, and harvest timing. By analyzing multiple data sources for signals of outbreaks, NLP assists in the early identification of pests and illnesses, allowing farmers to take appropriate preventative actions. It also plays an important role in market analysis and price prediction, analyzing massive volumes of textual market data to give insights into market patterns, price variations, and demand projections. NLP also benefits soil health and nutrient management by analyzing soil data and research materials to provide specific suggestions. Furthermore, NLP aids farm management by summarizing complicated data from several sources and helps with language translation and localization, making agricultural information available globally. Furthermore, NLP-powered chatbots and virtual assistants give rapid access to agricultural assistance, increasing farmers’ accessibility and convenience. Overall, NLP’s many agricultural applications contribute to more efficient, sustainable, and informed farming operations by bridging the gap between textual data and practical insights.
- Research Article
- 10.1186/s40163-025-00248-8
- Jun 13, 2025
- Crime Science
BackgroundFraud is a prevalent offence that extends beyond financial loss, impacting victims emotionally, psychologically, and physically. Advances in online communication technologies continue to create new opportunities for fraud, and fraudsters increasingly using these channels for deception. With the progression of technologies like Generative Artificial Intelligence (GenAI), there is a growing concern that fraud will increase in scale using these advanced methods, with offenders employing deep-fakes in phishing campaigns, for example. However, the application of AI, particularly Natural Language Processing (NLP), to detect and analyse patterns of online fraud remains understudied. This review addresses this gap by investigating the potential role of AI in analysing online fraud using text data.MethodsWe conducted a Systematic Literature Review (SLR) to investigate the application of AI and Natural Language Processing (NLP) techniques for online fraud detection. The review adhered to the PRISMA-ScR protocol, with eligibility criteria including language, publication type, relevance to online fraud, use of text data, and AI methodologies. Out of 2457 academic records screened, 350 met our eligibility criteria, and 223 were analysed and included herein. ResultsWe discuss the state-of-the-art AI and NLP techniques used to analyse various online fraud categories; the data sources used for training the AI and NLP models; the AI and NLP algorithms and models built; and the performance metrics employed for model evaluation. We find that the current state of research on online fraud is broken into the various scam activities that take place, and more specifically, we identify 16 different frauds that researchers focus on. Finally, we present the most recent and best-performing AI methods employed for detecting online scams and fraud activities. ConclusionsThis SLR enhances academic understanding of AI-based detection methods for online fraud and offers insights for policymakers, law enforcement, and businesses on safeguarding against such activities. We conclude that existing approaches focusing on specific scams are unlikely to generalise effectively, as they will require new models to be developed for each fraud type. Furthermore, we conclude that the evolving nature of scams limits the effectiveness of models trained on outdated data. We also identify that researchers often omit discussions of the limitations of their data or training biases. Finally, we find issues in the consistency with which the performance of models is reported, with some studies selectively presenting metrics, leading to potential biases in model evaluation.
- Research Article
1
- 10.11113/jt.v77.6502
- Nov 26, 2015
- Jurnal Teknologi
Sentiment analysis has emerged as one of the most powerful tools in business intelligence. With the aim of proposing an effective sentiment analysis technique, we have performed experiments on analyzing the sentiments of 3,424 tweets using both statistical and natural language processing (NLP) techniques as part of our background study. For statistical technique, machine learning algorithms such as Support Vector Machines (SVMs), decision trees and Naïve Bayes have been explored. The results show that SVM consistently outperformed the rest in both classifications. As for sentiment analysis using NLP techniques, we used two different tagging methods for part-of-speech (POS) tagging. Subsequently, the output is used for word sense disambiguation (WSD) using WordNet, followed by sentiment identification using SentiWordNet. Our experimental results indicate that adjectives and adverbs are sufficient to infer the sentiment of tweets compared to other combinations. Comparatively, the statistical approach records higher accuracy than the NLP approach by approximately 17%.
- Research Article
8
- 10.23889/ijpds.v6i1.1757
- Jan 1, 2022
- International Journal of Population Data Science
IntroductionUnstructured text data (UTD) are increasingly found in many databases that were never intended to be used for research, including electronic medical record (EMR) databases. Data quality can impact the usefulness of UTD for research. UTD are typically prepared for analysis (i.e., preprocessed) and analyzed using natural language processing (NLP) techniques. Different NLP methods are used to preprocess UTD and may affect data quality.ObjectiveOur objective was to systematically document current research and practices about NLP preprocessing methods to describe or improve the quality of UTD, including UTD found in EMR databases.MethodsA scoping review was undertaken of peer-reviewed studies published between December 2002 and January 2021. Scopus, Web of Science, ProQuest, and EBSCOhost were searched for literature relevant to the study objective. Information extracted from the studies included article characteristics (i.e., year of publication, journal discipline), data characteristics, types of preprocessing methods, and data quality topics. Study data were presented using a narrative synthesis.ResultsA total of 41 articles were included in the scoping review; over 50% were published between 2016 and 2021. Almost 20% of the articles were published in health science journals. Common preprocessing methods included removal of extraneous text elements such as stop words, punctuation, and numbers, word tokenization, and parts of speech tagging. Data quality topics for articles about EMR data included misspelled words, security (i.e., de-identification), word variability, sources of noise, quality of annotations, and ambiguity of abbreviations.ConclusionsMultiple NLP techniques have been proposed to preprocess UTD, with some differences in techniques applied to EMR data. There are similarities in the data quality dimensions used to characterize structured data and UTD. While a few general-purpose measures of data quality that do not require external data; most of these focus on the measurement of noise.
- Research Article
- 10.55041/ijsrem44944
- Apr 18, 2025
- INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
- Research Article
- 10.55041/ijsrem44813
- Apr 17, 2025
- INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
- Research Article
- 10.55041/ijsrem44894
- Apr 17, 2025
- INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
- Research Article
- 10.55041/ijsrem44681
- Apr 17, 2025
- INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
- Research Article
- 10.55041/ijsrem44850
- Apr 17, 2025
- INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
- Research Article
- 10.55041/ijsrem44933
- Apr 17, 2025
- INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
- Research Article
- 10.55041/ijsrem44888
- Apr 17, 2025
- INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
- Research Article
- 10.55041/ijsrem44811
- Apr 17, 2025
- INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
- Research Article
- 10.55041/ijsrem44844
- Apr 17, 2025
- INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
- Research Article
- 10.55041/ijsrem44861
- Apr 17, 2025
- INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.