fastText Word Embedding Research Articles

multiple-choices, regarding students’ learning achievement. When the number of students in a class is huge, however, examinations using essay questions become hard to conduct and take a long evaluation time. Automatic essay evaluation has, therefore, become a potential approach in this situation. Various methods have been proposed, however, optimal solutions for such evaluation in the Indonesian language are less known. Furthermore, with the rapid development of machine learning approaches, in particular deep learning approaches, the investigation of such optimal solutions becomes more necessary. Method: To address the aforementioned issue, this study proposed the investigation of text representation approaches for optimal automatic evaluation of Indonesian essay answers. The investigation compared pre-trained word embedding methods such as Word2vec, GloVe, FastText, and RoBERTa, as well as compared text encoding methods such as long short-term memories (LSTMs) and transformers. LSTMs are able to capture temporal semantics by employing state variables, while transformers are able to capture long-term dependency between parts of their input sequences. Additionally, we investigated classification-based and similarity-based training to build text encoders. We expected that these training approaches allowed encoders to extract different views of information. We compared classification results produced by different text encoders and combinations of text encoders. Result: We evaluated various text representation approaches using the UKARA dataset. Our experiments showed that the FastText word embedding method outperformed the Word2vec, GloVe, and RoBERTa methods. The FastText method achieved an F1-score of 75.43% on validation sets, while the Word2vec, GloVe, and RoBERTa methods achieved F1-scores of 69.56%, 74.53%, and 72.87%, respectively. In addition, the experiments showed that combinations of text encoders outperformed individual encoders. The combination of the LSTM encoder, the transformer encoder, and the TF-IDF encoder obtained an F1-score of 77.22% in the best case, which is better than the best F1-scores of the individual LSTM encoders (75.35%), the best combination of transformer encoders (71.49%), and the individual TF-IDF encoder (76.69%). We observed that LSTM encoders produced better performance when they were built using classification-based training. Meanwhile, the transformer encoders obtained better performance when built using similarity-based training. Novelty: The novelty proposed in this research is the optimal combination of text encoders specifically constructed for the evaluation of essay answers in the Indonesian language. Our experiments showed that the combination of three encoders - namely the LSTM encoder built using classification-based training, the transformer encoder built using classification-based and similarity-based training, and the TF-IDF encoder - obtained the best classification performance.

Read full abstract

BackgroundElectronic health records (EHRs) contain valuable information for clinical research; however, the sensitive nature of healthcare data presents security and confidentiality challenges. De-identification is therefore essential to protect personal data in EHRs and comply with government regulations. Named entity recognition (NER) methods have been proposed to remove personal identifiers, with deep learning-based models achieving better performance. However, manual annotation of training data is time-consuming and expensive. The aim of this study was to develop an automatic de-identification pipeline for all kinds of clinical documents based on a distant supervised method to significantly reduce the cost of manual annotations and to facilitate the transfer of the de-identification pipeline to other clinical centers.MethodsWe proposed an automated annotation process for French clinical de-identification, exploiting data from the eHOP clinical data warehouse (CDW) of the CHU de Rennes and national knowledge bases, as well as other features. In addition, this paper proposes an assisted data annotation solution using the Prodigy annotation tool. This approach aims to reduce the cost required to create a reference corpus for the evaluation of state-of-the-art NER models. Finally, we evaluated and compared the effectiveness of different NER methods.ResultsA French de-identification dataset was developed in this work, based on EHRs provided by the eHOP CDW at Rennes University Hospital, France. The dataset was rich in terms of personal information, and the distribution of entities was quite similar in the training and test datasets. We evaluated a Bi-LSTM + CRF sequence labeling architecture, combined with Flair + FastText word embeddings, on a test set of manually annotated clinical reports. The model outperformed the other tested models with a significant F1 score of 96,96%, demonstrating the effectiveness of our automatic approach for deidentifying sensitive information.ConclusionsThis study provides an automatic de-identification pipeline for clinical notes, which can facilitate the reuse of EHRs for secondary purposes such as clinical research. Our study highlights the importance of using advanced NLP techniques for effective de-identification, as well as the need for innovative solutions such as distant supervision to overcome the challenge of limited annotated data in the medical domain.

Read full abstract

fastText Word Embedding Research Articles

Articles published on fastText Word Embedding

Enhancing Sentiment Analysis of Indonesian Tourism Video Content Commentary on TikTok: A FastText and Bi-LSTM Approach

Deepfake Detection on Social Media: Leveraging Deep Learning and Fast Text Embeddings for Identifying Machine-Generated Tweets

CONTENT-BASED FILTERING CULINARY RECOMMENDATION SYSTEM USING DEEP CONVOLUTIONAL NEURAL NETWORK ON TWITTER (X)

Enhancing misogyny detection in bilingual texts using explainable AI and multilingual fine-tuned transformers

Ensemble based high performance deep learning models for fake news detection

Enhanced automated text categorization via Aquila optimizer with deep learning for Arabic news articles

Service similarity measurement integrating Bi-LSTM contextual representation and attention mechanism for web service discovery

Comparison of Deep Learning and Machine Learning Model for Phishing Email Classification

Combining Multiple Text Representations for Improved Automatic Evaluation of Indonesian Essay Answers

Abusive Language Detection in Khasi Social Media Comments

Analyzing Reddit Data: Hybrid Model for Depression Sentiment using FastText Embedding

A deep learning approach for Named Entity Recognition in Urdu language.

Multi-class hate speech detection in the Norwegian language using FAST-RNN and multilingual fine-tuned transformers

Construction and analysis of uncertainty indices based on multilingual text representations

Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models

AntiBP3: A Method for Predicting Antibacterial Peptides against Gram-Positive/Negative/Variable Bacteria.

Political Optimization Algorithm with a Hybrid Deep Learning Assisted Malicious URL Detection Model

AIPs-SnTCN: Predicting Anti-Inflammatory Peptides Using fastText and Transformer Encoder-Based Hybrid Word Embedding with Self-Normalized Temporal Convolutional Networks.

Low-Resource Language Information Processing using Dwarf Mongoose Optimization with Deep Learning Based Sentiment Classification

Hate Speech Detection in Indonesia Twitter Comments Using Convolutional Neural Network (CNN) and FastText Word Embedding

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

fastText Word Embedding Research Articles

Articles published on fastText Word Embedding

Enhancing Sentiment Analysis of Indonesian Tourism Video Content Commentary on TikTok: A FastText and Bi-LSTM Approach

Deepfake Detection on Social Media: Leveraging Deep Learning and Fast Text Embeddings for Identifying Machine-Generated Tweets

CONTENT-BASED FILTERING CULINARY RECOMMENDATION SYSTEM USING DEEP CONVOLUTIONAL NEURAL NETWORK ON TWITTER (X)

Enhancing misogyny detection in bilingual texts using explainable AI and multilingual fine-tuned transformers

Ensemble based high performance deep learning models for fake news detection

Enhanced automated text categorization via Aquila optimizer with deep learning for Arabic news articles

Service similarity measurement integrating Bi-LSTM contextual representation and attention mechanism for web service discovery

Comparison of Deep Learning and Machine Learning Model for Phishing Email Classification

Combining Multiple Text Representations for Improved Automatic Evaluation of Indonesian Essay Answers

Abusive Language Detection in Khasi Social Media Comments

Analyzing Reddit Data: Hybrid Model for Depression Sentiment using FastText Embedding

A deep learning approach for Named Entity Recognition in Urdu language.

Multi-class hate speech detection in the Norwegian language using FAST-RNN and multilingual fine-tuned transformers

Construction and analysis of uncertainty indices based on multilingual text representations

Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models

AntiBP3: A Method for Predicting Antibacterial Peptides against Gram-Positive/Negative/Variable Bacteria.

Political Optimization Algorithm with a Hybrid Deep Learning Assisted Malicious URL Detection Model

AIPs-SnTCN: Predicting Anti-Inflammatory Peptides Using fastText and Transformer Encoder-Based Hybrid Word Embedding with Self-Normalized Temporal Convolutional Networks.

Low-Resource Language Information Processing using Dwarf Mongoose Optimization with Deep Learning Based Sentiment Classification

Hate Speech Detection in Indonesia Twitter Comments Using Convolutional Neural Network (CNN) and FastText Word Embedding