Abstract

Medical diagnostics is often a multi-attribute problem, necessitating sophisticated tools for analyzing high-dimensional biomedical data. Mining this data often results in two crucial bottlenecks: 1) high dimensionality of features used to represent rich biological data and 2) small amounts of labelled training data due to the expense of consulting highly specific medical expertise necessary to assess each study. Currently, no approach that we are aware of has attempted to use active learning in the context of dimensionality reduction approaches for improving the construction of low dimensional representations. We present our novel methodology, AdDReSS (Adaptive Dimensionality Reduction with Semi-Supervision), to demonstrate that fewer labeled instances identified via AL in embedding space are needed for creating a more discriminative embedding representation compared to randomly selected instances. We tested our methodology on a wide variety of domains ranging from prostate gene expression, ovarian proteomic spectra, brain magnetic resonance imaging, and breast histopathology. Across these various high dimensional biomedical datasets with 100+ observations each and all parameters considered, the median classification accuracy across all experiments showed AdDReSS (88.7%) to outperform SSAGE, a SSDR method using random sampling (85.5%), and Graph Embedding (81.5%). Furthermore, we found that embeddings generated via AdDReSS achieved a mean 35.95% improvement in Raghavan efficiency, a measure of learning rate, over SSAGE. Our results demonstrate the value of AdDReSS to provide low dimensional representations of high dimensional biomedical data while achieving higher classification rates with fewer labelled examples as compared to without active learning.

Highlights

  • The ability to mine disease patterns from large biomedical datasets could enable the identification of prognostic disease markers, which in turn, could save lives, reduce morbidity, and alleviate the overall cost of healthcare today

  • We evaluated our methodology on different tasks for four relevant medical datasets: (a) Discrimination of tumoral and non-tumoral prostate samples in a gene expression dataset [8], (b) Discrimination of neoplastic and non-neoplastic disease within the ovary in a protein expression dataset [4], (c) Mitosis detection in breast cancer images [43], and (d) Identifying white matter and grey matter in a Brain MR Imaging dataset [44]

  • The accuracy curve corresponding to Adaptive Dimensionality Reduction with Semi-Supervision (AdDReSS) approaches the empirical maximum φAcc at a faster rate compared to supervised agglomerative graph embedding (SSAGE)

Read more

Summary

Introduction

The ability to mine disease patterns from large biomedical datasets could enable the identification of prognostic disease markers, which in turn, could save lives, reduce morbidity, and alleviate the overall cost of healthcare today. Development Award; the Ohio Third Frontier Technology development Grant; the CTSC Coulter Annual Pilot Grant; the Case Comprehensive Cancer Center Pilot Grant; the VelaSano Grant from the Cleveland Clinic; and the Wallace H. Coulter Foundation Program in the Department of Biomedical Engineering at Case Western Reserve University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.