Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

SUS audit aided by natural language processing: A comparative evaluation of BERT models in the analysis of health news

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Established in 1988, Brazil’s Unified Health System (SUS) faces significant challenges regarding funding, regional inequalities, and resource oversight, necessitating innovative management solutions. Current auditing processes remain inefficient, as manually detecting irregularities within vast volumes of documents is both time-consuming and costly. This study applies Named Entity Recognition (NER) and text classification techniques to analyze health news relevant to SUS audits. We compare the performance of BERT, BERT-CRF, and ModBERTBr models to identify the most effective approach for optimizing content selection, thereby aiding investigations and combating corruption. A controlled experimental design was employed, following a pipeline of tokenization, label alignment, supervised training, and statistical analysis. Models were evaluated using accuracy, recall, precision, F1-score, and Mean Training Time (MTT). In the NER task, BERT-CRF demonstrated superior performance, achieving the best results in recall (0.880), precision (0.855), and F1-score (0.860). Conversely, the standard BERT model achieved the best overall performance in text classification, significantly outperforming ModBERTBr across all metrics.

Similar Papers
  • Conference Article
  • Cite Count Icon 2
  • 10.1117/12.2622410
Radar technical language modeling with named entity recognition and text classification
  • May 27, 2022
  • Jackson Zaunegger + 4 more

This paper introduces the radar text data set (RadarTD) for technical language modeling. This data set is comprised of sentences containing radar parameters, values, and units determined from real-world values. This data set is created based on values determined from published academic research. Additionally, each statement is assigned a sentiment label and goal priority label. Preliminary investigations into the applicability of this data set are explored using the BERT model and several bi-directional LSTM models. These models are evaluated on text classification and named entity recognition tasks. This study evaluates the applicability of technical language modeling using neural networks to analyze input statements for cognitive radar applications. These findings suggest that this data set can be used to achieve reasonable performance for both text classification and named entity recognition for autonomous radar applications.

  • Research Article
  • 10.3390/e28030261
EDAER: Entropy-Driven Approach for Entity and Relation Extraction in Chinese Cyber Threat Intelligence
  • Feb 27, 2026
  • Entropy
  • Yong Li + 6 more

Cyber threat intelligence (CTI) has been explored to strengthen system security via taking raw threat data from various data sources and transforming it into actionable insights that enable organizations to predict, detect, and respond to cyber threats. Named entity recognition (NER) and relation extraction (RE) are the key tasks of CTI data mining. However, current CTI NER and/or RE research is mainly focused on English CTI, which is not directly transferable to Chinese CTI due to fundamental linguistic and terminological differences. Moreover, the existing limited studies on Chinese CTI do not effectively address uncertainty in predictions in low-resource scenarios where entities and relations are sparse. This work aims to improve the performance of NER and RE tasks in low-resource Chinese CTI scenarios, and we make two major contributions. The first is that we construct a Chinese CTI dataset, which includes 16 types of entities and 9 types of relations—more than those of the existing open-source dataset on Chinese CTI. The second is that we propose an entropy-driven approach for entity and relation (EDAER) extraction. EDAER is the first to combine the techniques of RoBERTa_wwm, Mamba, RDCNN and CRF to perform NER tasks. In addition, EDAER is the first to apply entropy to quantify the uncertainty of the model’s predictions in NER and RE tasks in Chinese CTI scenarios. Moreover, EDAER is the first to apply contrastive learning techniques in Chinese CTI scenarios to learn meaningful features by maximizing the similarity between positive samples and minimizing the similarity between negative samples. Extensive experimental results on public and our built datasets demonstrate that our proposed approach performs the best. These results show that (1) RoBERTa_wwwm significantly outperforms BERT on both NER and RE tasks; (2) Mamba outperforms BiLSTM on the NER task; (3) the entropy-based dynamic gating mechanism contributes to performance improvements in both NER and RE tasks; and (4) the uncertainty-guided contrastive learning mechanism is helpful for performance improvement in the NER task.

  • Research Article
  • Cite Count Icon 15
  • 10.1017/s1351324922000080
Enhancing deep neural networks with morphological information
  • Feb 21, 2022
  • Natural Language Engineering
  • Matej Klemen + 2 more

Deep learning approaches are superior in natural language processing due to their ability to extract informative features and patterns from languages. The two most successful neural architectures are LSTM and transformers, used in large pretrained language models such as BERT. While cross-lingual approaches are on the rise, most current natural language processing techniques are designed and applied to English, and less-resourced languages are lagging behind. In morphologically rich languages, information is conveyed through morphology, for example, through affixes modifying stems of words. The existing neural approaches do not explicitly use the information on word morphology. We analyse the effect of adding morphological features to LSTM and BERT models. As a testbed, we use three tasks available in many less-resourced languages: named entity recognition (NER), dependency parsing (DP) and comment filtering (CF). We construct baselines involving LSTM and BERT models, which we adjust by adding additional input in the form of part of speech (POS) tags and universal features. We compare the models across several languages from different language families. Our results suggest that adding morphological features has mixed effects depending on the quality of features and the task. The features improve the performance of LSTM-based models on the NER and DP tasks, while they do not benefit the performance on the CF task. For BERT-based models, the added morphological features only improve the performance on DP when they are of high quality (i.e., manually checked) while not showing any practical improvement when they are predicted. Even for high-quality features, the improvements are less pronounced in language-specific BERT variants compared to massively multilingual BERT models. As in NER and CF datasets manually checked features are not available, we only experiment with predicted features and find that they do not cause any practical improvement in performance.

  • Research Article
  • Cite Count Icon 27
  • 10.1016/j.jbi.2022.104279
Negation-based transfer learning for improving biomedical Named Entity Recognition and Relation Extraction
  • Jan 4, 2023
  • Journal of Biomedical Informatics
  • Hermenegildo Fabregat + 3 more

Negation-based transfer learning for improving biomedical Named Entity Recognition and Relation Extraction

  • Conference Article
  • Cite Count Icon 29
  • 10.1109/icis54925.2022.9882514
Chinese Named Entity Recognition based on BERT-CRF Model
  • Jun 26, 2022
  • Shulin Hu + 3 more

Named entity recognition (NER) is an important research direction in natural language processing (NLP). Traditional machine learning algorithms in NER have problems such as low accuracy, highly dependent feature design, poor domain adaptability, and inability to handle the different contexts of multiple meanings of the term in recognizing Chinese entities. Based on these problems, this paper adopts a method based on the BERT-CRF model in Chinese NER. The BERT preprocessing language model generates word vectors that represent contextual semantic information, automatically extract numerous word-level features and semantic features in text, and decodes through the CRF layer generates entity tag sequences. In this paper, the BERT model has been fine-tuned to make the model perform better on NER tasks, and the experimental verification is carried out on the People’s Daily dataset, and the F1 value reaches 94.5%.

  • Research Article
  • Cite Count Icon 6
  • 10.24920/003589
Medical Knowledge Extraction and Analysis from Electronic Medical Records Using Deep Learning.
  • Jan 1, 2019
  • Chinese Medical Sciences Journal
  • Li Peilin + 4 more

Medical Knowledge Extraction and Analysis from Electronic Medical Records Using Deep Learning.

  • Research Article
  • Cite Count Icon 1
  • 10.3897/biss.8.140428
BiodiViz: Leveraging NER and RE for Automated Knowledge Graph Generation in Biodiversity Research
  • Oct 29, 2024
  • Biodiversity Information Science and Standards
  • Angela Shannen Tan + 2 more

In biodiversity research, the integration of machine learning and data visualization is increasingly important for uncovering valuable insights from academic literature. This study introduces an innovative knowledge graph application, BiodiViz, designed to translate intricate text into intuitive visual representations, fostering a deeper comprehension of biodiversity relationships. BiodiViz uses the top-performing Named Entity Recognition (NER) and Relation Extraction (RE) models to automatically generate a comprehensive knowledge graph for biodiversity research. The NER model extracts and categorizes entities like organisms, phenomena, and habitats, while the RE model identifies relationships such as "have," "occur in," and "influence" from the BiodivNERE dataset (Abdelmageed et al. 2022). These entities and relationships are organized into nodes and edges within a graph. Researchers input text into BiodiViz, producing a visual knowledge graph that simplifies the analysis of complex biodiversity data, reducing manual effort and enhancing efficiency. Named Entity Recognition & Relation Extraction BiodiViz leverages advanced Bidirectional Encoder Representations from Transformers (BERT)-based Large Language Models (LLMs) (Rogers et al. 2020), fine-tuned specifically for NER and RE tasks using the BiodivNERE dataset. The fine-tuning process involved various models, including BERT (Devlin et al. 2019), ELECTRA (Clark et al. 2020), and BiodivBERT (Abdelmageed et al. 2023). These models were evaluated for performance using the results of their F1-score as the main metric, which is the harmonic mean of precision (the proportion of true positive results among all positive predictions) and recall (the proportion of true positive results among all actual positives), with BiodivBERT achieving an F1-score of 77.16% for the NER task, while BERT excelled in the RE task with an F1-score of 81.28%. Rigorous hyperparameter optimization further enhanced the performance of BiodivBERT in the RE task by 3.38%. The BiodivNERE corpora by Abdelmageed et al. (2022) were used to fine-tune several models for NER and RE tasks in the biodiversity domain. The first corpus from the BiodivNERE corpora is BiodivNER, which is a gold standard dataset (manually labelled test corpora) for evaluating NER tasks. The fine-tuning process employed the token classification method from the Hugging Face library (Hugging Face 2023b), which assigns labels to each token in a sequence. Experiments were conducted with a batch size of four, meaning the model processes four examples/rows of data at a time before making an update to improve its learning. This is due to the constraints of the NVIDIA® GeForce RTX™ 3060 graphics processor. (NVIDIA 2024) Model performance was evaluated using the seqeval library (Nakayama 2018), focusing on accuracy, precision, recall, and F1 scores. For text classification, the second corpus, BiodivRE, was utilized, following previous research recommendations to explore fine-tuning settings for BiodivBERT. Hyperparameter optimization (Feurer and Hutter 2019) was conducted using Hugging Face’s Trainer API with an Optuna backend (Hugging Face 2023a), concentrating on learning rate and the number of training epochs (i.e., the number of complete passes through the entire dataset during model training). The BiodiViz Knowledge Graph Application The fine-tuned NER and RE models with the best F1-scores—BiodivBERT and BERT, respectively—were integrated into the knowledge graph application. Fig. 1 illustrates the flowchart of the application pipeline. Each sentence in the input text will go through the NER model to identify and label the entities within the sentence. Subsequently, these labeled entities, together with the original sentence, will be input into the RE model. The RE model will analyze every pair of entities for a potential relation and output the type of relation they share. The application will then utilize this data to create a graph with appropriate labels and color-coding. An example of the application's user interface with the knowledge graph is shown in Fig. 2. This study highlights the practical application of machine learning and data visualization in advancing biodiversity research, emphasizing the importance of developing user-friendly tools to support scientific exploration and discovery. The BiodiViz application, including the code and resources, is available on GitHub*1, providing an accessible tool for biodiversity researchers to streamline their analyses.

  • Research Article
  • Cite Count Icon 9
  • 10.54097/hset.v34i.5482
News Short Text Classification Based on Bert Model and Fusion Model
  • Feb 28, 2023
  • Highlights in Science, Engineering and Technology
  • Hongyang Cui + 2 more

Text classification task is one of the most fundamental tasks in NLP, and the classification of short news text could be the basis for many other tasks. In this paper, we applied a fusion model combining Bert and TextRNN with some modified details to expect higher accuracy of text classification. We used the THUCNews as dataset which consists of two columns one for news text and the other for numbers. The original dataset was seperated into three parts: training set, validation set and test set. Besides, we used BERT model which contains two pre-training tasks and TextRNN model which refers to the use of RNN to solve text classification problems. We trained these two models in parallel, and then the optimal Bert and TextRNN models obtained through training and parameter tuning are added with a fully-connected layer to receive the final results by weighting the efficiency of Bert and TextRNN. The fusion model solves the problem of over-fitting and under-fitting of a single model, and helps to obtain a model with better generalization performance. The experimental results show the sharp change in loss and accuracy as well as the final accuracy of the BERT model. The precision, recall-rate and F1-score are also evaluated in this paper. The accuracy of fusion model of BERT and TextRNN is much better than single Bert model and has a gap to 1.76%.

  • Book Chapter
  • Cite Count Icon 9
  • 10.1007/978-3-030-61377-8_46
A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese
  • Jan 1, 2020
  • Luiz Henrique Bonifacio + 3 more

Deep language models, like ELMo, BERT and GPT, have achieved impressive results on several natural language tasks. These models are pretrained on large corpora of unlabeled general domain text and later supervisedly trained on downstream tasks. An optional step consists of finetuning the language model on a large intradomain corpus of unlabeled text, before training it on the final task. This aspect is not well explored in the current literature. In this work, we investigate the impact of this step on named entity recognition (NER) for Portuguese legal documents. We explore different scenarios considering two deep language architectures (ELMo and BERT), four unlabeled corpora and three legal NER tasks for the Portuguese language. Experimental findings show a significant improvement on performance due to language model finetuning on intradomain text. We also evaluate the finetuned models on two general-domain NER tasks, in order to understand whether the aforementioned improvements were really due to domain similarity or simply due to more training data. The achieved results also indicate that finetuning on a legal domain corpus hurts performance on the general-domain NER tasks. Additionally, our BERT model, finetuned on a legal corpus, significantly improves on the state-of-the-art performance on the LeNER-Br corpus, a Portuguese language NER corpus for the legal domain.

  • Research Article
  • Cite Count Icon 5
  • 10.1371/journal.pone.0318726
Leveraging natural language processing for efficient information extraction from breast cancer pathology reports: Single-institution study.
  • Feb 18, 2025
  • PloS one
  • Phillip Park + 8 more

Pathology reports provide important information for accurate diagnosis of cancer and optimal treatment decision making. In particular, breast cancer has known to be the most common cancer in women worldwide. For the data extraction of breast cancer pathology reports in a single institute, we assessed the accuracy of methods between regular expression and natural language processing (NLP). A total of 1,215 breast cancer pathology reports were annotated for NLP model development. As NLP models, we considered three BERT models with specific vocabularies including BERT-basic, BioBERT, and ClinicalBERT. K-fold cross-validation was used to verify the performance of the BERT model. The results between the regular expression and the BERT model were compared using the named entity recognition (NER) techniques. Among three BERT models, BioBERT was the most accurate parsing model (average performance = 0.99901) for breast cancer pathology when set to k = 5. BioBERT also had the lowest error rate for all items in the breast cancer pathology report compared to other BERT models (accuracy for all variables ≥ 0.9). Therefore, we finally selected BioBERT as the NLP model. When comparing the results of BioBERT and regular expressions using NER, we identified that BioBERT was more accurate than regular expression method, especially for some items such as intraductal component (BioBERT: 1.0, RegEx: 0.1644), lymph node (BioBERT: 0.9886, RegEx: 0.4792), and lymphovascular invasion (BioBERT: 0.9918, RegEx: 0.3759). Our results showed that the NLP model, BioBERT, had higher accuracy than regular expression, suggesting the importance of BioBERT in the processing of breast cancer pathology reports.

  • Research Article
  • Cite Count Icon 1
  • 10.22146/ijccs.99841
Offensive Language and Hate Speech Detection using BERT Model
  • Oct 31, 2024
  • IJCCS (Indonesian Journal of Computing and Cybernetics Systems)
  • Fadila Shely Amalia + 1 more

Hate speech detection is an important issue in sentiment analysis and natural language processing. This study aims to improve the effectiveness of hate speech detection in English text using the BERT model, along with modified preprocessing techniques to enhance the F1-score. The dataset, sourced from Kaggle, contains English text with hate speech content. Evaluation results show a significant improvement in the model's accuracy and overall text classification performance. The BERT model achieved 89.11% accuracy on test data, correctly predicting 85 out of 95 samples. While the model excels at classifying offensive text with around 95% accuracy, it struggles to distinguish between hate and offensive text, with some confusion between neither and offensive categories. The classification report shows F1-scores of 0.43 for the hate class, 0.94 for the offensive class, and 0.84 for the neither class, with a weighted average F1-score of 0.89 and a macro average of 0.73. These results indicate that the BERT model delivers solid performance in detecting hate speech, though there is room for improvement, particularly in distinguishing certain classes.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-981-99-1414-2_43
Named Entity Recognition over Dialog Dataset Using Pre-trained Transformers
  • Jan 1, 2023
  • Archana Patil + 2 more

Need of natural language processing (NLP) applications and advancement in deep learning (DL) techniques have increased the need of large amount of human readable data leading to interesting research area named entity recognition (NER), which is a sub-task of natural language processing. NER identifies and tags different real-life entities in their pre-defined categories. Pre-defined categories include person, locations, times, organizations, events, etc., depending upon dataset in hand. For different natural language processing applications such as information retrieval (IR), question answering system (QAS), text summarization (TS), and machine translation (MT), NER forms a base work. Performance of earlier NER techniques is good but requires human intervention for forming domain-specific features or rules. Performance of NER system is further improved by application of emerging deep learning models, and also use of NER in dialog system is less explored area. So, in this paper, our aim is to mention different techniques which can be applied to do NER task and fine-tune the prê-trained transformers to work on in-car dialog dataset for NER task and evaluate the performance of system.

  • Research Article
  • Cite Count Icon 48
  • 10.1371/journal.pone.0246310
A pre-training and self-training approach for biomedical named entity recognition.
  • Feb 9, 2021
  • PLOS ONE
  • Shang Gao + 3 more

Named entity recognition (NER) is a key component of many scientific literature mining tasks, such as information retrieval, information extraction, and question answering; however, many modern approaches require large amounts of labeled training data in order to be effective. This severely limits the effectiveness of NER models in applications where expert annotations are difficult and expensive to obtain. In this work, we explore the effectiveness of transfer learning and semi-supervised self-training to improve the performance of NER models in biomedical settings with very limited labeled data (250-2000 labeled samples). We first pre-train a BiLSTM-CRF and a BERT model on a very large general biomedical NER corpus such as MedMentions or Semantic Medline, and then we fine-tune the model on a more specific target NER task that has very limited training data; finally, we apply semi-supervised self-training using unlabeled data to further boost model performance. We show that in NER tasks that focus on common biomedical entity types such as those in the Unified Medical Language System (UMLS), combining transfer learning with self-training enables a NER model such as a BiLSTM-CRF or BERT to obtain similar performance with the same model trained on 3x-8x the amount of labeled data. We further show that our approach can also boost performance in a low-resource application where entities types are more rare and not specifically covered in UMLS.

  • Components
  • Cite Count Icon 1
  • 10.1371/journal.pone.0246310.r006
A pre-training and self-training approach for biomedical named entity recognition
  • Feb 9, 2021
  • Nicolas Fiorini + 4 more

Named entity recognition (NER) is a key component of many scientific literature mining tasks, such as information retrieval, information extraction, and question answering; however, many modern approaches require large amounts of labeled training data in order to be effective. This severely limits the effectiveness of NER models in applications where expert annotations are difficult and expensive to obtain. In this work, we explore the effectiveness of transfer learning and semi-supervised self-training to improve the performance of NER models in biomedical settings with very limited labeled data (250-2000 labeled samples). We first pre-train a BiLSTM-CRF and a BERT model on a very large general biomedical NER corpus such as MedMentions or Semantic Medline, and then we fine-tune the model on a more specific target NER task that has very limited training data; finally, we apply semi-supervised self-training using unlabeled data to further boost model performance. We show that in NER tasks that focus on common biomedical entity types such as those in the Unified Medical Language System (UMLS), combining transfer learning with self-training enables a NER model such as a BiLSTM-CRF or BERT to obtain similar performance with the same model trained on 3x-8x the amount of labeled data. We further show that our approach can also boost performance in a low-resource application where entities types are more rare and not specifically covered in UMLS.

  • Conference Article
  • Cite Count Icon 26
  • 10.5220/0011749400003393
German BERT Model for Legal Named Entity Recognition
  • Jan 1, 2023
  • Harshil Darji + 2 more

The use of BERT, one of the most popular language models, has led to improvements in many Natural Language Processing (NLP) tasks. One such task is Named Entity Recognition (NER) i.e. automatic identification of named entities such as location, person, organization, etc. from a given text. It is also an important base step for many NLP tasks such as information extraction and argumentation mining. Even though there is much research done on NER using BERT and other popular language models, the same is not explored in detail when it comes to Legal NLP or Legal Tech. Legal NLP applies various NLP techniques such as sentence similarity or NER specifically on legal data. There are only a handful of models for NER tasks using BERT language models, however, none of these are aimed at legal documents in German. In this paper, we fine-tune a popular BERT language model trained on German data (German BERT) on a Legal Entity Recognition (LER) dataset. To make sure our model is not overfitting, we performed a stratified 10-fold cross-validation. The results we achieve by fine-tuning German BERT on the LER dataset outperform the BiLSTM-CRF+ model used by the authors of the same LER dataset. Finally, we make the model openly available via HuggingFace.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant