Topic Modeling Technique for Text Mining Over Biomedical Text Corpora Through Hybrid Inverse Documents Frequency and Fuzzy K-Means Clustering

Junaid Rashid,Muhammad Shafiq,Muhammad Wasif Nisar,Syed Muhammad Adnan Shah,Aun Irtaza,Akber Gardezi,Toqeer Mahmood

doi:10.1109/access.2019.2944973

Abstract

Text data plays an imperative role in the biomedical domain. As patient's data comprises of a huge amount of text documents in a non-standardized format. In order to obtain the relevant data, the text documents pose a lot of challenging issues for data processing. Topic modeling is one of the popular techniques for information retrieval based on themes from the biomedical documents. In topic modeling discovering the precise topics from the biomedical documents is a challenging task. Furthermore, in biomedical text documents, the redundancy puts a negative impact on the quality of text mining as well. Therefore, the rapid growth of unstructured documents entails machine learning techniques for topic modeling capable of discovering precise topics. In this paper, we proposed a topic modeling technique for text mining through hybrid inverse document frequency and machine learning fuzzy k-means clustering algorithm. The proposed technique ameliorates the redundancy issue and discovers precise topics from the biomedical text documents. The proposed technique generates local and global term frequencies through the bag-of-words (BOW) model. The global term weighting is calculated through the proposed hybrid inverse documents frequency and Local term weighting is computed with term frequency. The robust principal component analysis is used to remove the negative impact of higher dimensionality on the global term weights. Afterward, the classification and clustering for text mining are performed with a probability of topics in the documents. The classification is performed through discriminant analysis classifier whereas the clustering is done through the k-means clustering. The performance of clustering is evaluated with Calinsiki-Har-abasz (CH) index internal validation method. The proposed toping modeling technique is evaluated on six standard datasets namely Ohsumed, MuchMore Springer Corpus, GENIA corpus, Bioxtext, tweets and WSJ redundant corpus for experimentation. The proposed topic modeling technique exhibits high performance on classification and clustering in text mining compared to baseline topic models like FLSA, LDA, and LSA. Moreover, the execution time of the proposed topic modeling technique remains stable for different numbers of topics.

Highlights

The tremendous amount of biomedical text documents is a valuable sourced of information in the biomedical field.The associate editor coordinating the review of this manuscript and approving it for publication was Kun Wang .Biomedical documents are categorized by the extensive amount of disorganized and infrequent information during a vast variety of forms like medical documents, scientific papers, electronic health records, a case summary of reports and so forth
Topic modeling is an efficient technique for biomedical text mining but needs some improvement because biomedical text documents are words redundant [9] and redundancy is a negative impact on topic modeling and text mining [10]
Biomedical text documents are continuously increasing nowadays while analyzing these documents is very important for discovering the valuable resource of information

Summary

Introduction

The tremendous amount of biomedical text documents is a valuable sourced of information in the biomedical field.The associate editor coordinating the review of this manuscript and approving it for publication was Kun Wang .Biomedical documents are categorized by the extensive amount of disorganized and infrequent information during a vast variety of forms like medical documents, scientific papers, electronic health records, a case summary of reports and so forth. Topic modeling techniques help in the extraction of unknown topics from a huge collection of documents [3], available articles and discover the topic distributions for every document. Topic models discover the topics from documents which are represented by the distribution of words. Latent Dirichlet Allocation(LDA) finds the probabilities, which predict a posterior distribution of various words and topics from the input collection of text corpus [5]. LDA extract topics distribution by using Gibbs sampling which is an iterative method This method selects some parameters like numbers of topics, iterations and Dirichlet priors. The latent semantic analysis (LSA) method extracts topics and shows the semantic meaning of words with statistical computation on a huge collection of documents [6]. Topic modeling extracts the needed information very effectively from biomedical text documents. The biomedical text documents consist of hundreds to thousands of medical

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2019
Citations: 82	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Topic Modeling Technique for Text Mining Over Biomedical Text Corpora Through Hybrid Inverse Documents Frequency and Fuzzy K-Means Clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

A novel multiple kernel fuzzy topic modeling technique for biomedical data
Junaid Rashid ... Sapna Juneja
BMC Bioinformatics | VOL. 23
Junaid Rashid, et. al.Junaid Rashid ... Sapna Juneja
12 Jul 2022
BMC Bioinformatics | VOL. 23

Fuzzy topic modeling approach for text mining over short text
Junaid Rashid ... Aun Irtaza
Information Processing & Management | VOL. 56
Junaid Rashid, et. al.Junaid Rashid ... Aun Irtaza
21 Jun 2019
Information Processing & Management | VOL. 56

Text Mining of Open-Ended Questions in Self-Assessment of University Teachers: An LDA Topic Modeling Approach
Diego Buenano-Fernandez ... David Gil
IEEE Access | VOL. 8
Diego Buenano-Fernandez, et. al.Diego Buenano-Fernandez ... David Gil
01 Jan 2020
IEEE Access | VOL. 8

An Efficient Topic Modeling Approach for Text Mining and Information Retrieval through K-means Clustering
Junaid Rashid ... Aun Irtaza
Mehran University Research Journal of Engineering and Technology | VOL. 39
Junaid Rashid, et. al.Junaid Rashid ... Aun Irtaza
01 Jan 2020
Mehran University Research Journal of Engineering and Technology | VOL. 39

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Topic Modeling Technique for Text Mining Over Biomedical Text Corpora Through Hybrid Inverse Documents Frequency and Fuzzy K-Means Clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access