Abstract

Abstract Sky surveys are the largest data generators in astronomy, making automated tools for extracting meaningful scientific information an absolute necessity. We show that, without the need for labels, self-supervised learning recovers representations of sky survey images that are semantically useful for a variety of scientific tasks. These representations can be directly used as features, or fine-tuned, to outperform supervised methods trained only on labeled data. We apply a contrastive learning framework on multiband galaxy photometry from the Sloan Digital Sky Survey (SDSS), to learn image representations. We then use them for galaxy morphology classification and fine-tune them for photometric redshift estimation, using labels from the Galaxy Zoo 2 data set and SDSS spectroscopy. In both downstream tasks, using the same learned representations, we outperform the supervised state-of-the-art results, and we show that our approach can achieve the accuracy of supervised models while using 2–4 times fewer labels for training. The codes, trained models, and data can be found at https://portal.nersc.gov/project/dasrepo/self-supervised-learning-sdss.

Highlights

  • Observing and imaging objects in the sky has been the main driver of the scientific discovery process in astronomy, because doing controlled experiments is not a viable option

  • To visualize the information contained in the self-supervised representations, we use Uniform Manifold Approximation and Projection (UMAP; McInnes et al 2018) to reduce the 2048dimensional representations to a more manageable 2, while preserving structure information on both local and global scales

  • In this letter we have demonstrated that self-supervised representation learning on unlabeled data yields notable performance gains over supervised learning for multiple tasks

Read more

Summary

Introduction

Observing and imaging objects in the sky has been the main driver of the scientific discovery process in astronomy, because doing controlled experiments is not a viable option. Machine-learning (ML) methods have proven useful for both classification and regression tasks (see Stein 2020 for a comprehensive list), but the majority of published works rely on the quantity and quality of (manually assigned) image labels. It demonstrates the need for methods that allow for the discovery of truly unusual and previously unseen objects, as well as the need to perform semantic (or feature) similarity searches on images in situations when the number of labels is as low as one.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call