Abstract
Biomedical text document classification is an essential task within Natural Language Processing (NLP), with applications ranging from sentiment analysis to authorship identification. Despite advancements in traditional machine-learning algorithms like Support Vector Machines (SVM) and Logistic Regression, challenges such as data sparsity and high dimensionality persist. Recent years have seen a surge in the use of deep learning models to mitigate these issues. This study aims to conduct a comparative analysis of various machine-learning algorithms for classifying biomedical text documents. The study employs the "Medical Text Dataset - Cancer Doc Classification" from Kaggle, comprising 7570 biomedical text documents labeled into three types of cancer (colon, lung, and thyroid). A preprocessing pipeline involving tokenization, stop-word removal, and Term Frequency-Inverse Document Frequency (TF-IDF) vectorization is applied. Algorithms including Logistic Regression, SVM, and Multinomial Naive Bayes are evaluated through 5-fold cross-validation. Performance metrics like accuracy, precision, recall, F1 score, and area under the ROC curve (AUC ROC) are employed. Logistic Regression outperforms the other algorithms with an accuracy of 78.3% and an AUC ROC of 88.59%. SVM and Multinomial Naive Bayes follow with lower performance metrics. Hyperparameter tuning further enhances the performance of the algorithms, particularly Logistic Regression. The study makes a significant contribution to the field of biomedical text classification by systematically comparing machine-learning algorithms. Logistic Regression emerges as the most effective, emphasizing the importance of algorithm selection and hyperparameter tuning in machine learning applications within this domain.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.