Abstract
PurposeMachine learning is broadly used for clinical data analysis. Before training a model, a machine learning algorithm must be selected. Also, the values of one or more model parameters termed hyper-parameters must be set. Selecting algorithms and hyper-parameter values requires advanced machine learning knowledge and many labor-intensive manual iterations. To lower the bar to machine learning, miscellaneous automatic selection methods for algorithms and/or hyper-parameter values have been proposed. Existing automatic selection methods are inefficient on large data sets. This poses a challenge for using machine learning in the clinical big data era.MethodsTo address the challenge, this paper presents progressive sampling-based Bayesian optimization, an efficient and automatic selection method for both algorithms and hyper-parameter values.ResultsWe report an implementation of the method. We show that compared to a state of the art automatic selection method, our method can significantly reduce search time, classification error rate, and standard deviation of error rate due to randomization.ConclusionsThis is major progress towards enabling fast turnaround in identifying high-quality solutions required by many machine learning-based clinical data analysis tasks.
Highlights
Machine learning is a key technology for modern clinical data analysis and can be used to support many clinical applications
We show that compared to a state of the art automatic selection method, our method can significantly reduce search time, classification error rate, and standard deviation of error rate due to randomization
(2) We present several new optimizations tailored to automatic machine learning model selection
Summary
Machine learning is a key technology for modern clinical data analysis and can be used to support many clinical applications. To make machine learning accessible, statistics and computer science researchers have built various open source software tools such as Weka [6], scikit-learn [7], PyBrain [8], RapidMiner, R, and KNIME [9] These software tools integrate many machine learning algorithms and provide intuitive graphical user interfaces. A detailed review of existing automatic selection methods for algorithms and/or hyper-parameter values is provided in our papers [11, 15]. The generalization performance is estimated by M(Aλ, D), the error rate attained by Aλ when trained and tested on D, e.g., via stratified multi-fold cross validation to decrease the possibility of overfitting [6] Using this estimate, the objective of machine learning model selection is to find A∗∗ ∈ arg minA∈A, ∈Λ M(A , D)
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.