Exploiting redundancy in large materials datasets for efficient machine learning with less data

Kangming Li,Daniel Persaud,Kamal Choudhary,Brian Decost,Michael Greenwood,Jason Hattrick-Simpers

doi:10.1038/s41467-023-42992-y

Abstract

Extensive efforts to gather materials data have largely overlooked potential data redundancy. In this study, we present evidence of a significant degree of redundancy across multiple large datasets for various material properties, by revealing that up to 95% of data can be safely removed from machine learning training with little impact on in-distribution prediction performance. The redundant data is related to over-represented material types and does not mitigate the severe performance degradation on out-of-distribution samples. In addition, we show that uncertainty-based active learning algorithms can construct much smaller but equally informative datasets. We discuss the effectiveness of informative data in improving prediction performance and robustness and provide insights into efficient data acquisition and machine learning training. This work challenges the “bigger is better” mentality and calls for attention to the information richness of materials data rather than a narrow emphasis on data volume.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Nature communications	Publication Date: Nov 10, 2023
Citations: 21	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Exploiting redundancy in large materials datasets for efficient machine learning with less data

Abstract

Talk to us

Similar Papers

More From: Nature communications

Lead the way for us

Similar Papers

Enabling scalable and adaptive machine learning training via serverless computing on public cloud
Ahsan Ali ... Feng Yan
Performance Evaluation | VOL. -
Ahsan Ali, et. al.Ahsan Ali ... Feng Yan
01 Nov 2024
Performance Evaluation | VOL. -

Developing an Efficient Feature Engineering and Machine Learning Model for Detecting IoT-Botnet Cyber Attacks
Mrutyunjaya Panda ... Aboul Ella Hassanien
IEEE Access | VOL. 9
Mrutyunjaya Panda, et. al.Mrutyunjaya Panda ... Aboul Ella Hassanien
01 Jan 2020
IEEE Access | VOL. 9

Elastic Parameter Server: Accelerating ML Training With Scalable Resource Scheduling
Shaoqi Wang ... Aidi Pi
IEEE Transactions on Parallel and Distributed Systems | VOL. 33
Shaoqi Wang, et. al.Shaoqi Wang ... Aidi Pi
01 May 2022
IEEE Transactions on Parallel and Distributed Systems | VOL. 33

Elastic Machine Learning Systems with Co-adaptation

-

28 Sep 2021
28 Sep 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Exploiting redundancy in large materials datasets for efficient machine learning with less data

Abstract

Talk to us

Similar Papers

More From: Nature communications