Abstract

Text data plays an imperative role in the biomedical domain. As patient's data comprises of a huge amount of text documents in a non-standardized format. In order to obtain the relevant data, the text documents pose a lot of challenging issues for data processing. Topic modeling is one of the popular techniques for information retrieval based on themes from the biomedical documents. In topic modeling discovering the precise topics from the biomedical documents is a challenging task. Furthermore, in biomedical text documents, the redundancy puts a negative impact on the quality of text mining as well. Therefore, the rapid growth of unstructured documents entails machine learning techniques for topic modeling capable of discovering precise topics. In this paper, we proposed a topic modeling technique for text mining through hybrid inverse document frequency and machine learning fuzzy k-means clustering algorithm. The proposed technique ameliorates the redundancy issue and discovers precise topics from the biomedical text documents. The proposed technique generates local and global term frequencies through the bag-of-words (BOW) model. The global term weighting is calculated through the proposed hybrid inverse documents frequency and Local term weighting is computed with term frequency. The robust principal component analysis is used to remove the negative impact of higher dimensionality on the global term weights. Afterward, the classification and clustering for text mining are performed with a probability of topics in the documents. The classification is performed through discriminant analysis classifier whereas the clustering is done through the k-means clustering. The performance of clustering is evaluated with Calinsiki-Har-abasz (CH) index internal validation method. The proposed toping modeling technique is evaluated on six standard datasets namely Ohsumed, MuchMore Springer Corpus, GENIA corpus, Bioxtext, tweets and WSJ redundant corpus for experimentation. The proposed topic modeling technique exhibits high performance on classification and clustering in text mining compared to baseline topic models like FLSA, LDA, and LSA. Moreover, the execution time of the proposed topic modeling technique remains stable for different numbers of topics.

Highlights

  • The tremendous amount of biomedical text documents is a valuable sourced of information in the biomedical field.The associate editor coordinating the review of this manuscript and approving it for publication was Kun Wang .Biomedical documents are categorized by the extensive amount of disorganized and infrequent information during a vast variety of forms like medical documents, scientific papers, electronic health records, a case summary of reports and so forth

  • Topic modeling is an efficient technique for biomedical text mining but needs some improvement because biomedical text documents are words redundant [9] and redundancy is a negative impact on topic modeling and text mining [10]

  • Biomedical text documents are continuously increasing nowadays while analyzing these documents is very important for discovering the valuable resource of information

Read more

Summary

Introduction

The tremendous amount of biomedical text documents is a valuable sourced of information in the biomedical field.The associate editor coordinating the review of this manuscript and approving it for publication was Kun Wang .Biomedical documents are categorized by the extensive amount of disorganized and infrequent information during a vast variety of forms like medical documents, scientific papers, electronic health records, a case summary of reports and so forth. Topic modeling techniques help in the extraction of unknown topics from a huge collection of documents [3], available articles and discover the topic distributions for every document. Topic models discover the topics from documents which are represented by the distribution of words. Latent Dirichlet Allocation(LDA) finds the probabilities, which predict a posterior distribution of various words and topics from the input collection of text corpus [5]. LDA extract topics distribution by using Gibbs sampling which is an iterative method This method selects some parameters like numbers of topics, iterations and Dirichlet priors. The latent semantic analysis (LSA) method extracts topics and shows the semantic meaning of words with statistical computation on a huge collection of documents [6]. Topic modeling extracts the needed information very effectively from biomedical text documents. The biomedical text documents consist of hundreds to thousands of medical

Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.