Abstract

The rise of machine learning (ML) has created an explosion in the potential strategies for using data to make scientific predictions. For physical scientists wishing to apply ML strategies to a particular domain, it can be difficult to assess in advance what strategy to adopt within a vast space of possibilities. Here we outline the results of an online community-powered effort to swarm search the space of ML strategies and develop algorithms for predicting atomic-pairwise nuclear magnetic resonance (NMR) properties in molecules. Using an open-source dataset, we worked with Kaggle to design and host a 3-month competition which received 47,800 ML model predictions from 2,700 teams in 84 countries. Within 3 weeks, the Kaggle community produced models with comparable accuracy to our best previously published 'in-house' efforts. A meta-ensemble model constructed as a linear combination of the top predictions has a prediction accuracy which exceeds that of any individual model, 7-19x better than our previous state-of-the-art. The results highlight the potential of transformer architectures for predicting quantum mechanical (QM) molecular properties.

Highlights

  • The rise of machine learning (ML) in the physical sciences has created a number of notable successes [1,2,3,4,5,6,7], and the number of published outputs is increasing substantially [8]

  • Community-powered approaches offer a powerful tool for searching ML strategy space and providing accurate predictions for physical science problems like the prediction of 2-body quantum mechanical (QM) nuclear magnetic resonance (NMR) properties

  • Within 3 weeks, the best score on the Kaggle public leader board achieved an accuracy which surpassed our own previously published approaches [24], suggesting that an open source community-powered ‘swarm search’ of ML strategy space may in some cases be significantly faster and more cost-efficient than conventional academic research strategies where a single agent spends several years hunting for solutions in an infinite search space

Read more

Summary

Introduction

The rise of machine learning (ML) in the physical sciences has created a number of notable successes [1,2,3,4,5,6,7], and the number of published outputs is increasing substantially [8] This explosion is perhaps not entirely surprising, given that ML ‘search space’ is effectively infinite. In a nod to the 1950 Japanese period drama “Rashomon” (where various characters provide subjective, alternative, self-serving, yet compelling versions of the same incident), ML’s tendency to produce many accurate-but-different models has been referred to as the “Rashomon effect” in machine learning [13] In such a vast space, any individual agent has a chance of stumbling upon a reasonable ML model. CKC participants reported being drawn to the competition because it: (a) facilitated

Domain
Leaderboard time evolution
Meta-ensemble model
Correlation analysis
Discussion & conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.