Abstract

This study aims to improve the performance of multiclass classification of biomedical texts for cardiovascular diseases by combining two different feature representation methods, i.e., bag-of-words (BoW) and word embeddings (WE). To hybridize the two feature representations, we investigated a set of possible statistical weighting schemes to combine with each element of WE vectors, which were term frequency (TF), inverse document frequency (IDF) and class probability (CP) methods. Thus, we built a multiclass classification model using a bidirectional long short-term memory (BLSTM) with deep neural networks for all investigated operations of feature vector combinations. We used MIMIC III and the PubMed dataset for the developing language model. To evaluate the performance of our weighted feature representation approaches, we conducted a set of experiments for examining multiclass classification performance with the deep neural network model and other state-of-the-art machine learning (ML) approaches. In all experiments, we used the OHSUMED-400 dataset, which includes PubMed abstracts related with specifically one class over 23 cardiovascular disease categories. Afterwards, we presented the results obtained from experiments and provided a comparison with related research in the literature. The results of the experiment showed that our BLSTM model with the weighting techniques outperformed the baseline and other machine learning approaches in terms of validation accuracy. Finally, our model outperformed the scores of related studies in the literature. This study shows that weighted feature representation improves the performance of the multiclass classification.

Highlights

  • The task of text classification (TC) has evolved during the last decade to become one of the most interesting fields in machine learning

  • Afterwards, we evaluated them with a series of experiments using different machine learning algorithms and we showed that our weighted feature representation method worked well in multiclass classification of biomedical texts

  • We present the results of our bidirectional long short-term memory (BLSTM) model with both the original baseline FastText

Read more

Summary

Introduction

The task of text classification (TC) has evolved during the last decade to become one of the most interesting fields in machine learning. The more efficient the feature representation method used, the better the ability of the classifier to discover patterns amongst the data [1,2]. Bag of words (BoW) and word embeddings (WE) are the two commonly used feature representation techniques in TC. These two techniques are powerful in classification systems, but their working mechanism differs from one to another. Bag of words (BoW) is a technique that represents the whole-body text, whether documents or sentences, as a list of words. These words are stored in a matrix to be calculated

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.