Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Using Natural Language Processing to Extract Information from Unstructured code-change version control data: lessons learned

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Context: Natural Language Processing (NLP) is a branch of artificial intelligence that extracts information from language. In the field of software engineering, NLP has been employed to extract key information from free-form text, to generate models from the analysis of text or to categorize code changes according to their commit messages. In literature, most of the approaches NLP-based focused on the impact of code changes on program execution or software architecture. Objective: In this study, we have applied NLP to code-change data to identify patterns of software code modifications and used Machine Learning techniques to build a model that determines how software has evolved over time and identifies area of code that presents problems. Method: Considering that software projects use version control systems, such as github, to manage their code, we have collected software information by using git commands. These data contain different unstructured information about the various files in a project. Each modification entry includes a message that explains the reasons for the change. According to the content of the message, it is possible to identify key terms that can be used during the classification of the entries. Results: In this study, we have considered the change history of software available on github to the High Energy Physics community. With the use of NLP techniques we have cleaned the messages and extracted some key terms to categorize both software problems and some other changes performed by developers, like the addition of a third party dependency or a script that starts a given service. We have built a code change dictionary combining the terms already in existing literature with the ones gathered directly from the software and its github repository. Finally, we have applied some Machine Learning (ML) techniques to determine any connection between code changes and software problems: we have removed redundant entries to avoid any bias in the outcomes of the ML techniques. Conclusion: We show in detail our approach adopted to construct historical code change datasets of categorized commit messages by following a multi-label classification methodology. Our model performance seems promising in terms of accuracy, precision, recall and F1-score.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 6
  • 10.3390/app122110773
Natural Language Processing Application on Commit Messages: A Case Study on HEP Software
  • Oct 24, 2022
  • Applied Sciences
  • Yue Yang + 2 more

Version Control and Source Code Management Systems, such as GitHub, contain a large amount of unstructured historical information of software projects. Recent studies have introduced Natural Language Processing (NLP) to help software engineers retrieve information from a very large collection of unstructured data. In this study, we have extended our previous study by increasing our datasets and machine learning and clustering techniques. We have followed a complex methodology made up of various steps. Starting from the raw commit messages we have employed NLP techniques to build a structured database. We have extracted their main features and used them as input of different clustering algorithms. Once each entry was labelled, we applied supervised machine learning techniques to build a prediction and classification model. We have developed a machine learning-based model to automatically classify commit messages of a software project. Our model exploits a ground-truth dataset that includes commit messages obtained from various GitHub projects belonging to the High Energy Physics context. The contribution of this paper is two-fold: it proposes a ground-truth database and it provides a machine learning prediction model that automatically identifies the more change-prone areas of code. Our model has obtained a very high average accuracy (0.9590), precision (0.9448), recall (0.9382), and F1-score (0.9360).

  • Research Article
  • Cite Count Icon 69
  • 10.1097/acm.0000000000002414
Using Machine Learning to Assess Physician Competence: A Systematic Review.
  • Mar 1, 2019
  • Academic Medicine
  • Roger D Dias + 2 more

To identify the different machine learning (ML) techniques that have been applied to automate physician competence assessment and evaluate how these techniques can be used to assess different competence domains in several medical specialties. In May 2017, MEDLINE, EMBASE, PsycINFO, Web of Science, ACM Digital Library, IEEE Xplore Digital Library, PROSPERO, and Cochrane Database of Systematic Reviews were searched for articles published from inception to April 30, 2017. Studies were included if they applied at least one ML technique to assess medical students', residents', fellows', or attending physicians' competence. Information on sample size, participants, study setting and design, medical specialty, ML techniques, competence domains, outcomes, and methodological quality was extracted. MERSQI was used to evaluate quality, and a qualitative narrative synthesis of the medical specialties, ML techniques, and competence domains was conducted. Of 4,953 initial articles, 69 met inclusion criteria. General surgery (24; 34.8%) and radiology (15; 21.7%) were the most studied specialties; natural language processing (24; 34.8%), support vector machine (15; 21.7%), and hidden Markov models (14; 20.3%) were the ML techniques most often applied; and patient care (63; 91.3%) and medical knowledge (45; 65.2%) were the most assessed competence domains. A growing number of studies have attempted to apply ML techniques to physician competence assessment. Although many studies have investigated the feasibility of certain techniques, more validation research is needed. The use of ML techniques may have the potential to integrate and analyze pragmatic information that could be used in real-time assessments and interventions.

  • Research Article
  • Cite Count Icon 23
  • 10.1016/j.procs.2024.10.198
Advancements of SMS Spam Detection: A Comprehensive Survey of NLP and ML Techniques
  • Jan 1, 2024
  • Procedia Computer Science
  • Mohammed Rasol Al Saidat + 2 more

Advancements of SMS Spam Detection: A Comprehensive Survey of NLP and ML Techniques

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 67
  • 10.22581/muet1982.2201.07
Resume Classification System using Natural Language Processing and Machine Learning Techniques
  • Jan 1, 2022
  • Mehran University Research Journal of Engineering and Technology
  • Irfan Ali + 4 more

The selection of a suitable job applicant from the pool of thousands applications is often daunting job for an employer. The categorization of job applications submitted in form of Resumes against available vacancy(s) takes significant time and efforts of an employer. Thus, Resume Classification System (RCS) using the Natural Language Processing (NLP) and Machine Learning (ML) techniques could automate this tedious process. Moreover, the automation of this process can significantly expedite and transparent the applicants’ screening process with mere human involvement. This experimental study presents an automated NLP and ML-based RCS that classifies the Resumes according to job categories with performance guarantees. This study employs various ML algorithms and NLP techniques to measure the accuracy of RCS and proposes a solution with better accuracy and reliability in different settings. To demonstrate the significance of NLP and ML techniques for RCS, the extracted features were evaluated on nine ML classification models namely Support Vector Machine - SVM (Linear, SGD, SVC and NuSVC), Naïve Bayes (Bernoulli, Multinomial & Gaussian), K-Nearest Neighbor (KNN), and Logistic Regression (LR). The Term-Frequency-Inverse-Document-Frequency (TF-IDF) feature representation scheme was proved suitable for RCS. The developed models were evaluated using the Confusion Matrix, F-Score, Recall, Precision, and overall Accuracy. The experimental results indicate that using the One-Vs-Rest-Classification strategy for this multi-class Resume classification task, the SVM class of Machine Learning classifiers performed better on the study dataset of over nine hundred sixty plus parsed resumes with more than 96% accuracy. The promising results suggest that NLP and ML techniques employed in this study could be used for developing an efficient RCS.

  • Research Article
  • Cite Count Icon 4
  • 10.1007/s10994-005-1399-6
Guest Editors Introduction: Machine Learning in Speech and Language Technologies
  • Sep 1, 2005
  • Machine Learning
  • Pascale Fung + 1 more

Machine learning techniques have long been the foundations of speech processing. Bayesian classiflcation, decision trees, unsupervised clustering, the EM algorithm, maximum entropy, etc. are all part of existing speech recognition systems. The success of statistical speech recognition has led to the rise of statistical and empirical methods in natural language processing. Indeed, many of the machine learning techniques used in language processing, from statistical part-of-speech tagging to the noisy channel model for machine translation have roots in work conducted in the speech fleld. However, advances in learning theory and algorithmic machine learning approaches in recent years have led to signiflcant changes in the direction and emphasis of the statistical and learning centered research in natural language processing and made a mark on natural language and speech processing. Approaches such as memory based learning, a range of linear classiflers such as Boosting, SVMs and SNoW and others have been successfully applied to a broad range of natural language problems, and these now inspire new research in speech retrieval and recognition. We have seen an increasingly close collaboration between voice and language processing researchers in some of the shared tasks such as spontaneous speech recognition and understanding, voice data information extraction, and machine translation. The purpose of this special issue was to invite speech and language researchers to communicate with each other, and with the machine learning community on the latest machine learning advances in their work. The call for papers was met with great enthusiasm from the speech and natural language community. Thirty six submissions were received; each paper was reviewed by at least three reviewers. Only ten papers were selected re∞ecting not only some of the best work on machine learning in the areas of natural language and spoken language processing but also what we view as a collection of papers that represent current trends in these areas of research both from the perspective of

  • Research Article
  • Cite Count Icon 1
  • 10.1093/eurpub/ckae144.1156
Predicting diabetes prognosis using machine learning techniques in a Hungarian clinical database
  • Oct 28, 2024
  • European Journal of Public Health
  • A Nagy + 3 more

Background Chronic conditions such as type 2 diabetes mellitus have great impact on patients’ quality of life. Although clinical databases provide a perfect base for research aimed at improving diabetes care, analyzing such databases requires extensive pre-processing, mainly due to the large amount of unstructured data. Our study aimed to present the steps of generating a dataset from a large clinical database, and to apply machine learning-based analytical techniques regarding. Methods Data of the Clinical Center of University of Debrecen was used. To structure the unstructured data, regular expressions and natural language processing methods were used. The main machine learning models were as follows: Gradient Boosting Machines to predict the risk of complication development; Long Short-Term Memory Networks to forecast future health outcomes. All analysis and procedures were done using Python. Results The database contains approximately 1600 tables, with a total size of 1.9 terabytes, where the largest table is 21.07 gigabytes with approximately 44.64 million rows. The final dataset consisted of 40,332 patients. Most variables originate from the unstructured data, including complications and comorbidities of diabetes, as well as physical and laboratory parameters. Related to laboratory parameters, the number of measurements and the median value for every half-year were available. The diagnosis time of the complications’ occurrence is also presented. Machine learning methods were more accurate compared with traditional statistical methods in predicting the prognosis (p < 0.05). Discussion Our research highlights the importance of clinical data in chronic disease management. There are challenges in pre-processing and managing datasets, but machine learning-based methods are very efficient not only in extracting useful information from unstructured data but also in predicting the prognosis and identifying potential intervention points for better care. Key messages • Natural language processing can be used to obtain useful information from unstructured clinical data. • Using machine learning techniques on clinical data could improve diabates care.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 34
  • 10.3390/info12110444
Investigating Machine Learning & Natural Language Processing Techniques Applied for Predicting Depression Disorder from Online Support Forums: A Systematic Literature Review
  • Oct 27, 2021
  • Information
  • Isuri Anuradha Nanomi Arachchige + 2 more

Depression is a common mental health disorder that affects an individual’s moods, thought processes and behaviours negatively, and disrupts one’s ability to function optimally. In most cases, people with depression try to hide their symptoms and refrain from obtaining professional help due to the stigma related to mental health. The digital footprint we all leave behind, particularly in online support forums, provides a window for clinicians to observe and assess such behaviour in order to make potential mental health diagnoses. Natural language processing (NLP) and Machine learning (ML) techniques are able to bridge the existing gaps in converting language to a machine-understandable format in order to facilitate this. Our objective is to undertake a systematic review of the literature on NLP and ML approaches used for depression identification on Online Support Forums (OSF). A systematic search was performed to identify articles that examined ML and NLP techniques to identify depression disorder from OSF. Articles were selected according to the PRISMA workflow. For the purpose of the review, 29 articles were selected and analysed. From this systematic review, we further analyse which combination of features extracted from NLP and ML techniques are effective and scalable for state-of-the-art Depression Identification. We conclude by addressing some open issues that currently limit real-world implementation of such systems and point to future directions to this end.

  • Research Article
  • Cite Count Icon 1
  • 10.1093/ecco-jcc/jjae190.1406
P1232 A Novel Inflammatory Bowel Disease Registry Powered by Artificial Intelligence and Natural Language Processing
  • Jan 22, 2025
  • Journal of Crohn's and Colitis
  • J Liu + 7 more

Background Accurate data registries may assist clinicians and researchers to gain insights into inflammatory bowel disease(IBD) and provide opportunities to improve overall patient care. However, most data registries are limited by the amount of time needed to collect and record patient-level data. Machine learning and natural language processing(NLP) can facilitate data collection, storage, and retrieval, reducing or even eliminating the need for human data entry. The aim of this study was to describe and validate a novel IBD repository(IBD Data Lake), leveraging machine learning and NLP techniques, as useful tools to curate and retrieve pertinent, real time clinical data in the IBD patient population. Methods The IBD Data Lake was created by medical professionals, translational researchers, and data strategists at the IBD Centre of British Columbia. Structured and unstructured data were extracted from patients’ electronic medical record and were transferred to a secure cloud infrastructure and curated into a searchable database. A customized user interface was created to search the IBD Data Lake. An advanced NLP service(Comprehend MedicalTM) was employed to extract clinical information from the unstructured text and data from medical documents in PDF format. Manual chart review was used as the gold standard to validate all information from the IBD Data Lake. Results A list of 208 patients(104 IBD patients matched to 104 non-IBD patients) from the IBD Data Lake was generated between July 1, 2018 and July 31, 2023. After a thorough chart review, the IBD cohort comprised 101 IBD patients and the non-IBD cohort included 102 non-IBD patients. The IBD Data Lake’s performance metrics for identifying IBD patients were as follows: sensitivity 98.1%, specificity 97.1%, positive predictive value 97.1%, and negative predictive value 98.1%.The machine learning and NLP components of the IBD Data Lake demonstrated high performance in analyzing key IBD unstructured clinical characteristics: for distinction of ulcerative colitis or Crohn’s disease, sensitivity was 100% and specificity 98.2%; for smoking status, sensitivity was 100% and specificity 96.9%; and for extraintestinal manifestations, sensitivity was 92% and specificity 100%. Conclusion A novel IBD Data Lake that integrates machine learning and NLP techniques has been validated. IBD patients have been identified with great accuracy and the machine learning/NLP components of the IBD Data Lake allow for a comprehensive and timely extraction and organization of unstructured data. This will ultimately lay the groundwork for recruitment of specific IBD cohorts of interest to address the gaps that remain in our knowledge. It has the potential to drive innovation in the field of IBD and gastroenterology.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.1007/s11517-025-03414-x
Rapid trauma classification under data scarcity: an emergency on-scene decision model combining natural language processing and machine learning
  • Jan 1, 2025
  • Medical & Biological Engineering & Computing
  • Jun Tang + 3 more

Trauma has become a major cause of increased morbidity and mortality worldwide. In emergency response, the classification of injuries is crucial as it helps to quickly determine the criticality of the injured, allocate rescue resources rationally, and decide the priority order of treatment. However, emergency scenes are often chaotic environments, making it difficult for rescue personnel to collect complete and accurate information about the injured in a short period. The combination of artificial intelligence and emergency rescue is gradually changing the rescue model, improving the efficiency of rescue operations. We selected data from 26,810 trauma patients admitted to Chongqing Daping Hospital between 2013 and 2024. We propose a fast tiered medical treatment method with a two-layer structure under emergency limited data conditions, which integrates natural language processing (NLP) and machine learning (ML) techniques. The tiered medical treatment model utilizes NLP to capture semantic features of unstructured text data, while utilizing four ML algorithms to process structured numerical data. Additionally, we conducted external validation using 245 data entries from the Chongqing Emergency Center. The experimental results show that gradient boosting and logistic regression have the best performance in the two-layer ML algorithms. Based on these two algorithms, our model outperformed the multilayer perceptron (MLP) model on the test dataset, achieving an accuracy of 91.17%, which is 4.33% higher than that of the MLP model. The specificity, F1-score, and AUC of our model were 97.06%, 86.85%, and 0.949, respectively. For the external dataset, the model achieved accuracy, specificity, F1-score, and AUC of 87.35%, 95.78%, 80.37%, and 0.848, respectively. These results demonstrate the model’s high generalizability and prediction accuracy. A model integrating NLP and ML techniques enables rapid tiered medical treatment based on limited data from the emergency scene, with significant advantages in terms of prediction accuracy.Graphical

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/ccet56606.2022.10080860
Sentimental Analysis of Movie Review using Machine Learning Approach
  • Dec 23, 2022
  • Md.Thoufiq Zumma + 5 more

A key area of machine learning called sentiment analysis seeks to extract subjective data from textual evaluations. The most popular technique for anticipating user ratings is sentiment analysis, and several machine-learning techniques have been employed to provide precise predictions. Sentiment analysis is the skill of examining information regarding what the general public really thinks about your company, a text, an opinion, a social media post, etc. It is a very potent tool in the analytics toolbox. Natural language processing and text mining both have a close connection to the study of sentiment. It can be used to evaluate the reviewer's viewpoint on certain topics or the review's overall polarity. The accuracy of the model is evaluated using sentiment analysis on the IMDB movie reviews dataset utilizing machine learning (ML) and natural language processing (NLP) techniques. Natural language processing and machine learning combine to provide the fundamental building blocks of sentiment analysis. Provides context to grasp the meaning of any text by enhancing the capabilities of machine learning and natural language processing. Using machine learning classification methods, this study suggests a prediction model for the sentiment analysis of movie reviews. This study aids researchers in choosing the most effective method for doing accurate and timely emotive analysis on IMDB movie reviews. Here, want to estimate the general polarity of the review using machine learning and natural language processing (NLP).

  • Supplementary Content
  • Cite Count Icon 2
  • 10.5167/uzh-61703
Fine-grained code changes and bugs: Improving bug prediction
  • Jan 1, 2012
  • Zurich Open Repository and Archive (University of Zurich)
  • Emanuel Giger

Software development and, in particular, software maintenance are time consuming and require detailed knowledge of the structure and the past development activities of a software system. Limited resources and time constraints make the situation even more difficult. Therefore, a significant amount of research effort has been dedicated to learning software prediction models that allow project members to allocate and spend the limited resources efficiently on the (most) critical parts of their software system. Prominent examples are bug prediction models and change prediction models: Bug prediction models identify the bug-prone modules of a software system that should be tested with care; change prediction models identify modules that change frequently and in combination with other modules, i.e., they are change coupled. By combining statistical methods, data mining approaches, and machine learning techniques software prediction models provide a structured and analytical basis to make decisions.Researchers proposed a wide range of approaches to build effective prediction models that take into account multiple aspects of the software development process. They achieved especially good prediction performance, guiding developers towards those parts of their system where a large share of bugs can be expected. For that, they rely on change data provided by version control systems (VCS). However, due to the fact that current VCS track code changes only on file-level and textual basis most of those approaches suffer from coarse-grained and rather generic change information. More fine-grained change information, for instance, at the level of source code statements, and the type of changes, e.g., whether a method was renamed or a condition expression was changed, are often not taken into account. Therefore, investigating the development process and the evolution of software at a fine-grained change level has recently experienced an increasing attention in research.The key contribution of this thesis is to improve software prediction models by using fine-grained source code changes. Those changes are based on the abstract syntax tree structure of source code and allow us to track code changes at the fine-grained level of individual statements. We show with a series of empirical studies using the change history of open-source projects how prediction models can benefit in terms of prediction performance and prediction granularity from the more detailed change information.First, we compare fine-grained source code changes and code churn, i.e., lines modified, for bug prediction. The results with data from the Eclipse platform show that fine grained-source code changes significantly outperform code churn when classifying source files into bug- and not bug-prone, as well as when predicting the number of bugs in source files. Moreover, these results give more insights about the relation of individual types of code changes, e.g., method declaration changes and bugs. For instance, in our dataset method declaration changes exhibit a stronger correlation with the number of bugs than class declaration changes.Second, we leverage fine-grained source code changes to predict bugs at method-level. This is beneficial as files can grow arbitrarily large. Hence, if bugs are predicted at the level of files a developer needs to manually inspect all methods of a file one by one until a particular bug is located.Third, we build models using source code properties, e.g., complexity, to predict whether a source file will be affected by a certain type of code change. Predicting the type of changes is of practical interest, for instance, in the context of software testing as different change types require different levels of testing: While for small statement changes local unit-tests are mostly sufficient, API changes, e.g., method declaration changes, might require system-wide integration-tests which are more expensive. Hence, knowing (in advance) which types of changes will most likely occur in a source file can help to better plan and develop tests, and, in case of limited resources, prioritize among different types of testing.Finally, to assist developers in bug triaging we compute prediction models based on the attributes of a bug report that can be used to estimate whether a bug will be fixed fast or whether it will take more time for resolution.The results and findings of this thesis give evidence that fine-grained source code changes can improve software prediction models to provide more accurate results.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.1186/1471-2105-12-s3-i1
Topics in machine learning for biomedical literature analysis and text retrieval
  • Jun 9, 2011
  • BMC Bioinformatics
  • Rezarta Islamaj Doğan + 1 more

Life science researchers and health care professionals rely heavily on biomedical literature databases such as MEDLINE to access information essential for research, health care, education, as well as to keep up with the latest developments in their fields. Providing ways to efficiently access and analyze text information is critical and is becoming more challenging with the increasing volume of publications in the biomedical domain. The last decade has shown an exponential rate of growth of biomedical literature [1]. Natural language processing, a symbiosis of computer science and linguistics disciplines, addresses the computational aspects of automatic text processing. This field offers a fertile ground for machine learning algorithms. The challenges presented when processing natural language offer new opportunities to the existing machine learning methods and promote the development of new ones. The special session of “Machine Learning in Biomedical Literature Analysis and Text Retrieval” was held for the first time as part of the 9th International Conference on Machine Learning and Applications, in Washington DC on December 12-14, 2010. The goal of this session was to present advancements in machine learning techniques that can improve the analysis of biomedical text. In this supplement we present a collection of papers originally presented and published in the proceedings of the International Conference on Machine Learning and Applications (ICMLA 2010). These papers constitute an advance beyond the work originally presented at the conference and have gone through a separate rigorous review process. They represent a wide cross-section of the type of work that goes on in machine learning today, with its focus on biomedical literature. Papers in this supplement touch on multiple existing machine learning methods such as wide margin classifiers and conditional random fields. They suggest novel applications for these methods as well as propose new machine learning techniques, such as novel methods for constructing training data and gold standards. From the literature analysis and text retrieval perspectives this collection of papers covers multiple topics including tokenization, named entity recognition, word-sense disambiguation, sequence labeling, and relationship extraction. Tokenization is typically the first step in natural language processing and is often assumed to be trivial. Unfortunately, it is quite challenging, especially in the biomedical domain. Barrett and Weber-Jahnke [2] present an intriguing scheme for building a tokenizer. Named entity recognition is an important component of text analysis tools. Three papers in the supplement touch on named entity recognition. Yeganova et al. [3] present a method of detecting abbreviations and their definitions in biomedical literature. Islamaj Dogan et al. [4] present an approach that detects with high accuracy clinical problems, treatment and test phrases in patient records and doctor notes. Benton at al. [5] present a system for de-identifying personal information in medical message board text. Many applications are believed to benefit from identifying the correct word sense in entity recognition tasks. MetaMap [6], for example, is a system that provides UMLS [7] concept and semantic type annotation to free text and can significantly benefit from word-sense disambiguation. Jimeno-Yepes et al. [8] work on a knowledge-based word sense disambiguation approach that uses collocation analysis to improve the knowledge-based word sense disambiguation system. Automatic extraction of bibliographic data, such as article titles, author names, abstracts, and references are essential to citation databases, such as MEDLINE. Zhang et al. [9] examine the task of identifying the components of bibliographic references. They treat the problem as a sequence labeling problem. Accessibility to gold-standard training data allows scientist to focus on the solution of the problem at hand. In this collection we include two papers that are dedicated to this issue. Wilbur and Kim [10] treat human relevance judgments of MEDLINE document pairs to improve on gold standard annotations, whereas Yeganova et al. [3] present a method that relies on naturally occurring positive training examples and synthetically generated negative training examples to train their model. Finally, Islamaj Dogan et al. [4] investigate a clinical relationship extraction problem. They approach it as a classification task, training classifiers to assign a relationship type to a pair of clinical concepts after performing entity recognition.

  • Research Article
  • Cite Count Icon 52
  • 10.1148/rg.2021210025
Bag-of-Words Technique in Natural Language Processing: A Primer for Radiologists.
  • Aug 13, 2021
  • RadioGraphics
  • Krishna Juluru + 3 more

Natural language processing (NLP) is a methodology designed to extract concepts and meaning from human-generated unstructured (free-form) text. It is intended to be implemented by using computer algorithms so that it can be run on a corpus of documents quickly and reliably. To enable machine learning (ML) techniques in NLP, free-form text must be converted to a numerical representation. After several stages of preprocessing including tokenization, removal of stop words, token normalization, and creation of a master dictionary, the bag-of-words (BOW) technique can be used to represent each remaining word as a feature of the document. The preprocessing steps simplify the documents but also potentially degrade meaning. The values of the features in BOW can be modified by using techniques such as term count, term frequency, and term frequency-inverse document frequency. Experience and experimentation will guide decisions on which specific techniques will optimize ML performance. These and other NLP techniques are being applied in radiology. Radiologists' understanding of the strengths and limitations of these techniques will help in communication with data scientists and in implementation for specific tasks. Online supplemental material is available for this article. ©RSNA, 2021.

  • Discussion
  • Cite Count Icon 29
  • 10.1161/circoutcomes.115.002125
Natural Language Processing and the Promise of Big Data: Small Step Forward, but Many Miles to Go.
  • Aug 18, 2015
  • Circulation: Cardiovascular Quality and Outcomes
  • Thomas M Maddox + 1 more

The promise of big data has captured healthcare’s imagination. Although the term lacks a consensus definition, it generally refers to electronic health data sets characterized by the 3 Vs: volume, variety, and velocity.1,2 Volume refers to the sheer amount of healthcare data currently generated by clinical operations, administration, and patients themselves. By one estimate, ≈25 000 petabytes of healthcare data will be available by 2020—an amount that could fill 500 billion file cabinets.2 Variety refers to the wide range of healthcare data formats. For example, electronic health records (EHRs) contain both structured and unstructured (or free-text) data, diagnostic images come in a variety of multimedia formats, and patient data are generated from wearables, mobile devices, medical devices, and social media—each with its own format. Velocity refers to the rapidity with which new data are generated, and thus the speed at which it needs incorporation into data sets and analyses to provide real-time insights into health care. Article see p 477 The potential of such data is enormous. Insights from big data could fuel innovation and improvement in clinical operations, research and development, and public health.1 However, the potential of big data to realize these lofty aspirations is matched by the challenge of organizing, analyzing, and generating actionable insights from it. One of the biggest challenges in realizing the potential of big data is in abstracting it. With the passage of the HITECH (The Health Information Technology for Economic and Clinical Health) Act in 2009, the adoption of EHRs in clinical practice has accelerated, and now over half of office-based practices and hospitals are using some form of EHR.3,4 As a result, more point-of-care clinical data, previously inaccessible in its paper format, is potentially available. However, the variety aspect of EHR data—its mix …

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.mex.2025.103407
Evaluating sentiment analysis models: A comparative analysis of vaccination tweets during the COVID-19 phase leveraging DistilBERT for enhanced insights.
  • Jun 1, 2025
  • MethodsX
  • Renuka Agrawal + 5 more

This study investigates public sentiment toward COVID-19 vaccinations by analyzing Twitter data using advanced machine learning (ML) and natural language processing (NLP) techniques. Recognizing social media as a valuable source for gauging public opinion during health crises, the research aims to inform policies on content moderation and misinformation control.•Comparative Analysis of Embedding Techniques and ML Models: The study evaluates two embedding techniques-TF-IDF and Word2Vec-across five ML models: LinearSVC, Random Forest, Gradient Boosting Machine (GBM), XGBoost, and AdaBoost.•The models were tested using two training-testing splits (70-30 and 80-20) to assess their performance on noisy, unlabeled, and imbalanced sentiment data.•Utilization of DistilBERT for Pseudo-Labeling: To enhance labeling accuracy, DistilBERT was employed for pseudo-labeling, capturing semantic nuances often missed by traditional ML techniques. This approach enabled more effective sentiment classification of tweets. The findings underscore the effectiveness of automated annotation, hybrid modeling, and embedding strategies in analyzing unstructured social media data. Such approaches provide valuable insights for public health applications, particularly in understanding vaccine hesitancy and shaping communication strategies. The study highlights the potential of integrating advanced NLP techniques to better comprehend and respond to public sentiments during pandemics or similar emergencies.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant