Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Computational text analysis on unstructured police data: a scoping review

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Police reports made following attendance at various events (e.g., crashes, domestic violence, theft) often contain rich contextual details including indicators of mental health issues or abuse types, and persons/entities involved and their relationships, which are not typically captured in structured administrative data, interviews or official statistics. However, the sheer volume of information along with strict data access protocols render manual analysis impractical. Computational text analysis methods offer a feasible and effective approach to automatically process this underutilized data source. This article is an overview of studies using computational text analysis (e.g., text mining, natural language processing (NLP)), on unstructured police data, serving as a guide for researchers interested in employing similar methodologies. This scoping review was conducted in accordance with the PRISMA-SCR guidelines, following the two screening processes (title/abstract and full text screening) and the development of a pre-defined protocol. A search was conducted across seven electronic databases (ProQuest, IEEE Xplore, Scopus, PubMed, Web of Science, Criminal Justice Abstracts, Google Scholar) covering the past 20 years. A total of 5426 records were identified. After removing duplicate entries and screening titles/abstracts and full-text publications, 61 studies met the inclusion criteria. Included studies were published between 2004 and 2024, with most from the United States, Australia and the Netherlands. Most studies used opensource tools: Bidirectional Encoder Representations from Transformers (BERT), natural language tool kit (NLTK), scikit-learn, or General Architecture for Text Engineering (GATE) to analyze unstructured police data. Our review indicates applications of computational text analysis on unstructured police data have moderate to high performance. Common limitations included variable data quality, with reliability depending on the level of detail provided by the police report’s author, and failure to report ethical implications or methodological limitations. Computational text analysis can extract key information from unstructured police data. However, future research should clearly report ethics approvals and implications, and methodological limitations. Establishing a structured data-sharing framework between law enforcement and researchers is also crucial to facilitate access and support high quality, impactful research in this field.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 16
  • 10.47813/2782-5280-2024-3-1-0311-0320
Bidirectional encoders to state-of-the-art: a review of BERT and its transformative impact on natural language processing
  • Mar 2, 2024
  • Информатика. Экономика. Управление - Informatics. Economics. Management
  • Rajesh Gupta

First developed in 2018 by Google researchers, Bidirectional Encoder Representations from Transformers (BERT) represents a breakthrough in natural language processing (NLP). BERT achieved state-of-the-art results across a range of NLP tasks while using a single transformer-based neural network architecture. This work reviews BERT's technical approach, performance when published, and significant research impact since release. We provide background on BERT's foundations like transformer encoders and transfer learning from universal language models. Core technical innovations include deeply bidirectional conditioning and a masked language modeling objective during BERT's unsupervised pretraining phase. For evaluation, BERT was fine-tuned and tested on eleven NLP tasks ranging from question answering to sentiment analysis via the GLUE benchmark, achieving new state-of-the-art results. Additionally, this work analyzes BERT's immense research influence as an accessible technique surpassing specialized models. BERT catalyzed adoption of pretraining and transfer learning for NLP. Quantitatively, over 10,000 papers have extended BERT and it is integrated widely across industry applications. Future directions based on BERT scale towards billions of parameters and multilingual representations. In summary, this work reviews the method, performance, impact and future outlook for BERT as a foundational NLP technique. We provide background on BERT's foundations like transformer encoders and transfer learning from universal language models. Core technical innovations include deeply bidirectional conditioning and a masked language modeling objective during BERT's unsupervised pretraining phase. For evaluation, BERT was fine-tuned and tested on eleven NLP tasks ranging from question answering to sentiment analysis via the GLUE benchmark, achieving new state-of-the-art results. Additionally, this work analyzes BERT's immense research influence as an accessible technique surpassing specialized models. BERT catalyzed adoption of pretraining and transfer learning for NLP. Quantitatively, over 10,000 papers have extended BERT and it is integrated widely across industry applications. Future directions based on BERT scale towards billions of parameters and multilingual representations. In summary, this work reviews the method, performance, impact and future outlook for BERT as a foundational NLP technique.

  • Research Article
  • 10.70102/afts.2025.1833.176
METAHEURISTIC-DRIVEN HYPERPARAMETER OPTIMIZATION FOR BERT IN SENTIMENT ANALYSIS
  • Oct 30, 2025
  • Archives for Technical Sciences
  • Alaa A El-Demerdash + 1 more

Sentiment analysis has come out as an important activity in natural language processing (NLP) applications whose data analysis is in high demand at present in the modern world. The BERT (Bidirectional Encoder Representations from Transformers) algorithm has proved to be extremely efficient when it comes to sentiment analysis tasks, and its potential is far exceeding that of conventional algorithms, unlocking their potential however would require fine tuning of their hyperparameters. It is quite a feat to optimise the BERT’s various hyperparameters due to the complicated interaction between them (e.g. the learning rate, batch size, dropout rate, attention heads). In this paper, the Salp Swarm Algorithm (SSA) is used as a bio-inspired metaheuristic optimization technique to optimize the fine-tuning process. Through SSA’s exceptionally efficient search capabilities in modelling multidimensional search space, BERT hyperparameters are optimized systematically to the sentiment classification tasks. A benchmark dataset for sentiment analysis (Sentiment140) is used to evaluate the proposed model. The novelty of the presented model is the fact that it dynamically adjusts its search behaviour in response to performance signals, thus it identifies better-performing parameter sets than conventional methods, leading to successful exploitation of the BERT algorithm that has produced high performing configurations. Extensive evaluations against 3 state-of-the-art search algorithms, namely manual tuning, grid search, and random search are conducted on the Sentiment140 benchmark dataset, demonstrating the superiority of the proposed SSA BERT optimization technique over state-of-the-art methods. The SSA-BERT model achieved a maximum accuracy of 96.4 percent, which is far better than manual tuning, grid search, and random search (65.0 percent, 69.5 percent and 72.0 percent respectively). It also performed better than other existing BERT models used in related literature, which showed accuracy levels between 46.4 and 75.7 percent in accordance with different benchmarks Sentiment analysis has come out as an important activity in natural language processing (NLP) applications whose data analysis is in high demand at present in the modern world. The BERT (Bidirectional Encoder Representations from Transformers) algorithm has proved to be extremely efficient when it comes to sentiment analysis tasks, and its potential is far exceeding that of conventional algorithms, unlocking their potential however would require fine tuning of their hyperparameters. It is quite a feat to optimise the BERT’s various hyperparameters due to the complicated interaction between them (e.g. the learning rate, batch size, dropout rate, attention heads). In this paper, the Salp Swarm Algorithm (SSA) is used as a bio-inspired metaheuristic optimization technique to optimize the fine-tuning process. Through SSA’s exceptionally efficient search capabilities in modelling multidimensional search space, BERT hyperparameters are optimized systematically to the sentiment classification tasks. A benchmark dataset for sentiment analysis (Sentiment140) is used to evaluate the proposed model. The novelty of the presented model is the fact that it dynamically adjusts its search behaviour in response to performance signals, thus it identifies better-performing parameter sets than conventional methods, leading to successful exploitation of the BERT algorithm that has produced high performing configurations. Extensive evaluations against 3 state-of-the-art search algorithms, namely manual tuning, grid search, and random search are conducted on the Sentiment140 benchmark dataset, demonstrating the superiority of the proposed SSA BERT optimization technique over state-of-the-art methods. The SSA-BERT model achieved a maximum accuracy of 96.4 percent, which is far better than manual tuning, grid search, and random search (65.0 percent, 69.5 percent and 72.0 percent respectively). It also performed better than other existing BERT models used in related literature, which showed accuracy levels between 46.4 and 75.7 percent in accordance with different benchmarks.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 25
  • 10.1186/s12911-022-01946-y
Automatic text classification of actionable radiology reports of tinnitus patients using bidirectional encoder representations from transformer (BERT) and in-domain pre-training (IDPT)
  • Jul 30, 2022
  • BMC Medical Informatics and Decision Making
  • Jia Li + 10 more

BackgroundGiven the increasing number of people suffering from tinnitus, the accurate categorization of patients with actionable reports is attractive in assisting clinical decision making. However, this process requires experienced physicians and significant human labor. Natural language processing (NLP) has shown great potential in big data analytics of medical texts; yet, its application to domain-specific analysis of radiology reports is limited.ObjectiveThe aim of this study is to propose a novel approach in classifying actionable radiology reports of tinnitus patients using bidirectional encoder representations from transformer BERT-based models and evaluate the benefits of in domain pre-training (IDPT) along with a sequence adaptation strategy.MethodsA total of 5864 temporal bone computed tomography(CT) reports are labeled by two experienced radiologists as follows: (1) normal findings without notable lesions; (2) notable lesions but uncorrelated to tinnitus; and (3) at least one lesion considered as potential cause of tinnitus. We then constructed a framework consisting of deep learning (DL) neural networks and self-supervised BERT models. A tinnitus domain-specific corpus is used to pre-train the BERT model to further improve its embedding weights. In addition, we conducted an experiment to evaluate multiple groups of max sequence length settings in BERT to reduce the excessive quantity of calculations. After a comprehensive comparison of all metrics, we determined the most promising approach through the performance comparison of F1-scores and AUC values.ResultsIn the first experiment, the BERT finetune model achieved a more promising result (AUC-0.868, F1-0.760) compared with that of the Word2Vec-based models(AUC-0.767, F1-0.733) on validation data. In the second experiment, the BERT in-domain pre-training model (AUC-0.948, F1-0.841) performed significantly better than the BERT based model(AUC-0.868, F1-0.760). Additionally, in the variants of BERT fine-tuning models, Mengzi achieved the highest AUC of 0.878 (F1-0.764). Finally, we found that the BERT max-sequence-length of 128 tokens achieved an AUC of 0.866 (F1-0.736), which is almost equal to the BERT max-sequence-length of 512 tokens (AUC-0.868,F1-0.760).ConclusionIn conclusion, we developed a reliable BERT-based framework for tinnitus diagnosis from Chinese radiology reports, along with a sequence adaptation strategy to reduce computational resources while maintaining accuracy. The findings could provide a reference for NLP development in Chinese radiology reports.

  • Preprint Article
  • Cite Count Icon 3
  • 10.20944/preprints202502.2104.v1
Psychological Health Prediction Based on the Fusion of Structured and Unstructured Data in EHR: a Case Study of Low-Income Populations
  • Feb 26, 2025
  • Preprints.org
  • Shurui Wu + 1 more

In view of the high incidence and complexity of mental health problems among low-income people, existing studies have mostly relied on structured data in electronic health records (EHR), ignoring the potential information contained in rich unstructured data. In order to effectively predict the mental health status of low-income people, this study cites a structured and unstructured data fusion model based on the most advanced deep learning technology. First, the BERT (Bidirectional Encoder Representations from Transformers) model is used to perform semantic understanding and feature extraction on the unstructured text data in EHR. Next, TabTransformer (Transformer-based Model for Tabular Data) is used to efficiently encode structured data and capture the complex relationships between data. Finally, through the multimodal fusion mechanism, structured and unstructured features are deeply integrated to form a comprehensive feature representation. In the experimental conclusion, the fusion model shows significant improvements in evaluation metrics such as accuracy, with accuracy increasing from 81.3% of the benchmark model to 85.2%. In addition, the cross-dataset generalization ability test shows that the model maintains good performance stability between different data sources. In the above conclusions, this study demonstrates the effectiveness of the fusion of structured and unstructured data in improving the accuracy of mental health prediction for low-income people, providing strong support for future precision medical interventions.

  • Research Article
  • Cite Count Icon 2
  • 10.7759/cureus.77342
A Comparative Analysis of Machine-Learning Algorithms for Automated International Classification of Diseases (ICD)-10 Coding in Malaysian Death Records.
  • Jan 12, 2025
  • Cureus
  • Muhammad Naufal B Nordin + 9 more

This study explores machine learning (ML) for automating unstructured textual data translation into structured International Classification of Diseases (ICD)-10 codes, aiming to identify algorithms that enhance mortality data accuracy and reliability for public health decisions. This study analyzed death records from January 2017 to June 2022, sourced from Malaysia's Health Informatics Centre, coded into ICD-10. Data anonymization adhered to ethical standards, with 387,650 death registrations included after quality checks. The dataset, limited to three-digit ICD-10 codes, underwent cleaning and an 80:20 training-testing split. Preprocessing involved HTML tag removal and tokenization. ML approaches, including BERT (Bidirectional Encoder Representations from Transformers), Gzip+KNN (K-Nearest Neighbors), XGBoost (Extreme Gradient Boosting), TensorFlow, SVM (Support Vector Machine), and Naive Bayes, were evaluated for automated ICD-10 coding. Models were fine-tuned and assessed across accuracy, F1-score, precision, recall, specificity, and precision-recall curves using Amazon SageMaker (Amazon Web Services, Seattle, WA). Sensitivity analysis addressed unbalanced data scenarios, enhancing model robustness. In assessing ICD-10 coding with ML, Gzip+KNN had the longest training time at 10 hours, with BERT leading in memory use. BERT performed best for the F1-score (0.71) and accuracy (0.82), closely followed by Gzip+KNN. TensorFlow excelled in recall, whereas SVM had the highest specificity but lower overall performance. XGBoost was notably less effective across metrics. Precision-recall analysis showed Gzip+KNN's superiority. On an unbalanced dataset, BERT and Gzip+KNN demonstrated consistent accuracy. Our study highlights that BERT and Gzip+KNN optimize ICD-10 coding, balancing efficiency, resource use, and accuracy. BERT excels in precision with higher memory demands, while Gzip+KNN offers robust accuracy and recall. This suggests significant potential for improving healthcare analytics and decision-making through advanced ML models.

  • Research Article
  • 10.30865/ijics.v9i2.8968
Naïve Bayes and Bidirectional Algorithm Analysis: Encoder Representations From Transformers (BERT) to Teachers' Learning Services to Students Based on the Website of SMK Multi Karya School
  • Jul 31, 2025
  • The IJICS (International Journal of Informatics and Computer Science)
  • Ismail Sianturi + 2 more

This study analyzes the comparison of two algorithms, namely Naive Bayes and Bidirectional Encoder Representations From Transformers (BERT), for the evaluation of the performance of education personnel at SMK MULTI KARYA This study uses manual calculation methods and the Python application. The results showed that the Naive Bayes algorithm gave very consistent results with accuracy, precision, and recall values of 76.67% both in manual calculations and with Pyton. This indicates that the Naive Bayes algorithm is effective in grouping data on the performance of education personnel. Meanwhile, the Bidirectional Encoder Representations From Transformers (BERT) algorithm shows mixed results, while with Python it reaches 12.00%. There are significant differences in recall values and precision between these two calculation methods. Nevertheless, the performance category "Good Performance Staff" remains the most dominant. The difference in results between manual and python calculations is that Naive bayes is a more stable and consistent method across different platforms, whereas Bidirectional Encoder Representations From Transformers (BERT) shows flexibility but with smaller variation in results. Therefore, in the context of education performance evaluation, NAive bayes are more reliable to produce consistent performance categories, while Bidirectional Encoder Representations From Transformers(BERT) can be an alternative with a fairly high level of accuracy but require further consideration in the interpretation of the results..

  • Research Article
  • Cite Count Icon 2
  • 10.14419/mhv83077
Review of sentiment analysis in social media using big data: ‎techniques, tools, and frameworks
  • Jun 9, 2025
  • International Journal of Basic and Applied Sciences
  • Srishti Sudhir Patil + 4 more

Sentiment analysis on social media has emerged as a vital research area due to the growing volume of user-generated content and the in-‎creasing reliance on data-driven decision-making. The adoption of big data technologies has greatly improved sentiment analysis by enabling the rapid processing of unstructured big data. This review presents an in-depth analysis of sentiment analysis methodologies, covering ‎both conventional machine learning (ML) techniques-such as Naïve Bayes, Support Vector Machines, Decision Trees, and Random Forest ‎and advanced deep learning (DL) models, including Recurrent Neural Networks, Long Short-Term Memory Networks, Convolutional Neu-‎ral Networks, and Transformer-based architectures like Bidirectional Encoder Representations from Transformers (BERT) and Generative ‎Pre-trained Transformers (GPT). Furthermore, it examines big data frameworks like Hadoop, Apache Spark, and Apache Flink, along with ‎Natural Language Processing (NLP) tools such as the Natural Language Toolkit (NLTK), spaCy, TextBlob, and Stanford NLP. The paper ‎also discusses ML/DL frameworks like Scikit-learn, TensorFlow, PyTorch, and Keras, along with cloud and edge computing solutions like ‎Google Cloud Artificial Intelligence (AI), Amazon Web Services (AWS) Comprehend, and Edge AI (NVIDIA Jetson). Despite technological advancements, several challenges persist, including issues related to data quality, real-time processing limitations, multilingual analysis ‎complexities, and ethical concerns regarding bias and privacy. The field is also witnessing promising developments, such as Explainable ‎Artificial Intelligence (XAI), federated learning, edge computing, and quantum computing, which offer new directions for future research ‎and practical implementations. This review provides researchers and professionals with valuable insights, outlining potential improvements ‎in sentiment analysis techniques to enhance accuracy, scalability, and ethical considerations across various sectors, including business, ‎healthcare, and smart manufacturing‎.

  • Research Article
  • Cite Count Icon 1
  • 10.25126/jtiik.2024119096
Analisis Perbandingan Model Bert Dan Xlnet Untuk Klasifikasi Tweet Bully Pada Twitter
  • Dec 10, 2024
  • Jurnal Teknologi Informasi dan Ilmu Komputer
  • Teuku Radillah + 2 more

Fenomena bullying di media sosial, khususnya di Twitter, telah menjadi isu yang semakin memprihatinkan dengan dampak signifikan terhadap kesehatan mental pengguna. Dalam rangka mengatasi masalah ini, deteksi otomatis tweet yang mengandung konten bullying menjadi sangat penting. Penelitian ini bertujuan untuk membandingkan performa dua model pemrosesan bahasa alami terbaru, yaitu BERT (Bidirectional Encoder Representations from Transformers) dan XLNet, dalam klasifikasi tweet yang mengandung bullying. Metodologi penelitian ini melibatkan pengumpulan dataset tweet yang telah dilabeli sebagai bullying atau non-bullying. Proses preprocessing teks dilakukan untuk membersihkan dan menyiapkan data sebelum digunakan dalam pelatihan model. Kedua model, BERT dan XLNet, dilatih dan diuji menggunakan dataset yang sama. Evaluasi performa dilakukan dengan menggunakan metrik akurasi, presisi, recall, dan F1-score. Hasil penelitian menunjukkan bahwa kedua model memiliki kemampuan yang baik dalam mengidentifikasi tweet bullying, akan tetapi XLNet menunjukkan performa yang lebih unggul dibandingkan BERT dengan tingkat akurasi sebesar 95%. Dengan nilai presisi = 100%, recall = 0,87%, dan F1-score = 0,88%. XLNet mampu menangkap konteks dan nuansa bahasa yang lebih kompleks dalam tweet, yang berkontribusi pada akurasi klasifikasi yang lebih tinggi. Penelitian ini memberikan kontribusi penting dalam bidang deteksi bullying di media sosial dengan menunjukkan bahwa penggunaan model XLNet lebih efektif dibandingkan BERT. Temuan ini dapat membantu platform seperti Twitter dalam mengidentifikasi dan mencegah konten bullying, sehingga menciptakan lingkungan online yang lebih aman bagi pengguna, serta dapat digunakan sebagai dasar untuk pengembangan sistem deteksi bullying yang lebih canggih dan efisien di masa depan. Abstract The phenomenon of bullying on social media, particularly on Twitter, has become an increasingly concerning issue with significant impacts on users' mental health. In order to address this issue, automatic detection of tweets containing bullying content is crucial. This study aims to compare the performance of two recent natural language processing models, namely BERT (Bidirectional Encoder Representations from Transformers) and XLNet, in the classification of tweets containing bullying. The research methodology involves collecting a dataset of tweets that have been labelled as bullying or non-bullying. Text preprocessing is done to clean and prepare the data before it is used in model training. Both models, BERT and XLNet, were trained and tested using the same dataset. Performance evaluation was conducted using accuracy, precision, recall, and F1-score metrics. The results show that both models have a good ability to identify bullying tweets, but XLNet shows superior performance compared to BERT with an accuracy rate of 95%. With precision = 100%, recall = 0.87%, and F1-score = 0.88%. XLNet is able to capture more complex context and language nuances in tweets, which contributes to higher classification accuracy. This research makes an important contribution to the field of bullying detection on social media by showing that the use of the XLNet model is more effective than BERT. These findings can help platforms like Twitter identify and prevent bullying content, thereby creating a safer online environment for users, and can be used as a basis for the development of more sophisticated and efficient bullying detection systems in the future.

  • Conference Article
  • Cite Count Icon 10
  • 10.1109/hnicem54116.2021.9731956
Classification of Fire Related Tweets on Twitter Using Bidirectional Encoder Representations from Transformers (BERT)
  • Nov 28, 2021
  • Jairus Mingua + 2 more

Bidirectional Encoder Representation from Transformers (BERT) is a transfer learning model approach in natural language processing (NLP). BERT has different types of pre-trained models that can pre-train a language representation to create a model that can be finetuned on specific tasks using a dataset like text classification to produce state of the art predictions. Recent studies providing the use of BERT in natural language processing have highlighted that there are no publicly available Filipino tweet datasets regarding fire reports on social media that lead to a lack of classification models. This paper aims to design and implement a system to classify Filipino tweets using different pre-trained BERT models. Upon creating a model exclusive for organizing Filipino tweets using 2,081 tweets as a dataset that contains fire-related tweets, the researchers were able to compare the accuracy of the different finetuned pre-trained BERT models. The data shows a significant difference in the accuracy of each pre-trained BERT model. The highest of which is the BERT Base Uncased WWM model with a test accuracy of 87.50% and a train loss of 0.06 during training of the dataset. The least accurate among the pre-trained BERT models is the BERT Base Cased WWM model, with a test accuracy of 76.34% and a train loss of 0.2. The result shows that BERT Base Uncased WWM model can be a reliable model in classifying fire-related tweets. The accuracy obtained by the models may vary depending on how large the dataset is.

  • Research Article
  • 10.62527/joiv.9.2.2853
Performance Improvement of Cosine Similarity Algorithm with Bidirectional Encoder Representations from Transformers on Abstract Document Similarity Detection
  • Mar 31, 2025
  • JOIV : International Journal on Informatics Visualization
  • Musthofa Galih Pradana + 4 more

In thesis courses or final projects, students are required to be able to conduct research by the science they are engaged in, find innovations, solve problems, and foster a culture and critical mindset. However, the issue that is often encountered is plagiarism. Plagiarism is taking a work that can be in the form of someone else's opinion and making it seem as if it is your own. The step in applying technology that can be done is to carry out early detection of the similarity of documents written by students. In this case, the document that will be detected is an abstract that must be collected by students when submitting a thesis title. The algorithm used is a cosine similarity algorithm, which is computationally efficient because of its ease of interpretation and compatibility with large-scale data. This research was carried out using two schematic approaches: bidirectional encoder representations from transformers (BERT) and not bidirectional encoder representations from transformers (BERT). The corpus data used in this study was 1450 data of student thesis abstract documents, with the test using 10 data to see the performance of the cosine similarity algorithm in detecting the similarity of abstract documents. The results showed that documents with optimization using the Bidirectional Encoder Representations from Transformers (BERT) approach had better results, with an average performance improvement of 23.48%.

  • Research Article
  • Cite Count Icon 17
  • 10.1016/j.artmed.2024.102889
Oversampling effect in pretraining for bidirectional encoder representations from transformers (BERT) to localize medical BERT and enhance biomedical BERT
  • May 5, 2024
  • Artificial Intelligence In Medicine
  • Shoya Wada + 6 more

BackgroundPretraining large-scale neural language models on raw texts has made a significant contribution to improving transfer learning in natural language processing. With the introduction of transformer-based language models, such as bidirectional encoder representations from transformers (BERT), the performance of information extraction from free text has improved significantly in both the general and medical domains. However, it is difficult to train specific BERT models to perform well in domains for which few databases of a high quality and large size are publicly available. ObjectiveWe hypothesized that this problem could be addressed by oversampling a domain-specific corpus and using it for pretraining with a larger corpus in a balanced manner. In the present study, we verified our hypothesis by developing pretraining models using our method and evaluating their performance. MethodsOur proposed method was based on the simultaneous pretraining of models with knowledge from distinct domains after oversampling. We conducted three experiments in which we generated (1) English biomedical BERT from a small biomedical corpus, (2) Japanese medical BERT from a small medical corpus, and (3) enhanced biomedical BERT pretrained with complete PubMed abstracts in a balanced manner. We then compared their performance with those of conventional models. ResultsOur English BERT pretrained using both general and small medical domain corpora performed sufficiently well for practical use on the biomedical language understanding evaluation (BLUE) benchmark. Moreover, our proposed method was more effective than the conventional methods for each biomedical corpus of the same corpus size in the general domain. Our Japanese medical BERT outperformed the other BERT models built using a conventional method for almost all the medical tasks. The model demonstrated the same trend as that of the first experiment in English. Further, our enhanced biomedical BERT model, which was not pretrained on clinical notes, achieved superior clinical and biomedical scores on the BLUE benchmark with an increase of 0.3 points in the clinical score and 0.5 points in the biomedical score. These scores were above those of the models trained without our proposed method. ConclusionsWell-balanced pretraining using oversampling instances derived from a corpus appropriate for the target task allowed us to construct a high-performance BERT model.

  • Book Chapter
  • Cite Count Icon 8
  • 10.1007/978-3-030-79757-7_11
Natural Language Processing with “More Than Words – BERT”
  • Jan 1, 2021
  • Saranlita Chotirat + 1 more

Question-Answering (QA) has become one of the most popular natural language processing (NLP) and information retrieval applications. To be applied in QA systems, this paper presents a question classification technique based on NLP and Bidirectional Encoder Representation from Transformers (BERT). We performed experimental investigation on BERT for question classification with TREC-6 dataset and a Thai sentence dataset. We propose an improved processing technique called “More Than Words – BERT” (MTW – BERT) that is a special NLP Annotation tags for combining Part-Of-Speech tagging and Named Entities Recognition to be able for learning both pattern of grammatical tag sequence and recognized entities together as input before classifying text on BERT model. Experimental results showed that MTW – BERT outperformed existing classification methods and achieved new state-of-the-art performance on question classification for TREC-6 dataset with 99.20%. In addition, MTW-BERT also applied for question classification for Thai sentences in wh-question category. The proposed technique remarkably achieved Thai wh-classification with accuracy rate of 87.50%.KeywordsClassificationBERT-based modelNLP TaggingAnalysis Thai Sentence

  • Research Article
  • 10.15294/7h63ma50
Sentiment Analysis on Twitter Social Media Regarding Covid-19 Vaccination with Naive Bayes Classifier (NBC) and Bidirectional Encoder Representations from Transformers (BERT)
  • Sep 30, 2024
  • Recursive Journal of Informatics
  • Angga Riski Dwi Saputra + 1 more

Abstract. The Covid-19 vaccine is an important tool to stop the Covid-19 pandemic, however, there are pros and cons from the public regarding this Covid-19 vaccine. Purpose: These responses were conveyed by the public in many ways, one of which is through social media such as Twitter. Responses given by the public regarding the Covid-19 vaccination can be analyzed and categorized into responses with positive, neutral or negative sentiments. Methods: In this study, sentiment analysis was carried out regarding Covid-19 vaccination originating from Twitter using the Naïve Bayes Classifier (NBC) and Bidirectional Encoder Representations from Transformers (BERT) algorithms. The data used in this study is public tweet data regarding the Covid-19 vaccination with a total of 29,447 tweet data in English. Result: Sentiment analysis begins with data preprocessing on the dataset used for data normalization and data cleaning before classification. Then word vectorization was performed with TF-IDF and data classification was performed using the Naïve Bayes Classifier (NBC) and Bidirectional Encoder Representations from Transformers (BERT) algorithms. From the classification results, an accuracy value of 73% was obtained for the Naïve Bayes Classifier (NBC) algorithm and 83% for the Bidirectional Encoder Representations from Transformers (BERT) algorithm. Novelty: A direct comparison between classical models such as NBC and modern deep learning models such as BERT offers new insights into the advantages and disadvantages of both approaches in processing Twitter data. Additionally, this study proposes temporal sentiment analysis, which allows evaluating changes in public sentiment regarding vaccination over time. Another innovation is the implementation of a hybrid approach to data cleansing that combines traditional methods with the natural language processing capabilities of BERT, which more effectively addresses typical Twitter data issues such as slang and spelling errors. Finally, this research also expands sentiment classification to be multi-label, identifying more specific sentiment categories such as trust, fear, or doubt, which provides a deeper understanding of public opinion.

  • Research Article
  • Cite Count Icon 1
  • 10.15294/rji.v2i2.67502
Sentiment Analysis on Twitter Social Media Regarding Covid-19 Vaccination with Naive Bayes Classifier (NBC) and Bidirectional Encoder Representations from Transformers (BERT)
  • Sep 30, 2024
  • Recursive Journal of Informatics
  • Angga Riski Dwi Saputra + 1 more

Abstract. The Covid-19 vaccine is an important tool to stop the Covid-19 pandemic, however, there are pros and cons from the public regarding this Covid-19 vaccine. Purpose: These responses were conveyed by the public in many ways, one of which is through social media such as Twitter. Responses given by the public regarding the Covid-19 vaccination can be analyzed and categorized into responses with positive, neutral or negative sentiments. Methods: In this study, sentiment analysis was carried out regarding Covid-19 vaccination originating from Twitter using the Naïve Bayes Classifier (NBC) and Bidirectional Encoder Representations from Transformers (BERT) algorithms. The data used in this study is public tweet data regarding the Covid-19 vaccination with a total of 29,447 tweet data in English. Result: Sentiment analysis begins with data preprocessing on the dataset used for data normalization and data cleaning before classification. Then word vectorization was performed with TF-IDF and data classification was performed using the Naïve Bayes Classifier (NBC) and Bidirectional Encoder Representations from Transformers (BERT) algorithms. From the classification results, an accuracy value of 73% was obtained for the Naïve Bayes Classifier (NBC) algorithm and 83% for the Bidirectional Encoder Representations from Transformers (BERT) algorithm. Novelty: A direct comparison between classical models such as NBC and modern deep learning models such as BERT offers new insights into the advantages and disadvantages of both approaches in processing Twitter data. Additionally, this study proposes temporal sentiment analysis, which allows evaluating changes in public sentiment regarding vaccination over time. Another innovation is the implementation of a hybrid approach to data cleansing that combines traditional methods with the natural language processing capabilities of BERT, which more effectively addresses typical Twitter data issues such as slang and spelling errors. Finally, this research also expands sentiment classification to be multi-label, identifying more specific sentiment categories such as trust, fear, or doubt, which provides a deeper understanding of public opinion.

  • Research Article
  • 10.17485/ijst/v19i6.1167
Automatic Short Answer Grading for Enhancing Educational Assessment Using BERT and USE Models
  • Feb 22, 2026
  • Indian Journal Of Science And Technology
  • Chandralika Chakraborty + 2 more

Objectives : This work explores the application of two advanced state-of-the-art models, BERT (Bidirectional Encoder Representations from Transformers) and USE (Universal Sentence Encoder), to automate the grading of short answers. Methods: This work investigates the use of BERT and USE models for automatically grading short answers. The research utilizes HP:SAS dataset containing manually graded responses by two human evaluators. The student responses as well as model answers responses of question 1 and question set 6 are then processed using the BERT and USE models, with scores generated based on cosine similarity measures between student answers and predefined model answers. Findings: The work demonstrates that BERT and USE embeddings can effectively capture contextual and semantic similarity, their performance is heavily dependent on the function which generates the score. Our finding reveal that a non-linear mapping function mimics the human grading more than a linear mapping function. Such a function enhances accuracy (0.67) and reduces the error (0.617) by computing Pearson correlation coefficient and RMSE respectively. Notably, longer responses achieved higher Pearson correlations (0.67) than shorted answers (0.59). The results bring out usability and choice aspects of BERT and USE in relation to ASAG, contributing to the understanding of their application across various answers. We conclude with a weighted ensemble method combining BERT and USE with subject- specific strictness parameter (k) provides a robust framework for automated assessment. Novelty: Evaluates and compares two deep learning models for automatic short answer grading, a scarcely explored area. A novel contribution is the granular analysis across different scoring ranges across two question sets of the dataset. The novelty of this work lies in the transition from linear scoring to non-linear mapping framework. This approach introduces a tunable sigmoid- based ensemble that would replicate human assessment. Finally, a comparison with existing studies demonstrates very limited research. Keywords: Bidirectional Encoder Representations from Transformers (BERT), Universal Sentence Encoder (USE), Transformer, word embedding, Non-linear mapping, deep learning, short answer grading

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant