Leaves on trees: identifying halo stars with extreme gradient boosted trees

Jovan Veljanoski,Amina Helmi,Maarten Breddels,Lorenzo Posti

doi:10.1051/0004-6361/201732303

Abstract

Context. Extended stellar haloes are a natural by-product of the hierarchical formation of massive galaxies like the Milky Way. If merging is a non-negligible factor in the growth of our Galaxy, evidence of such events should be encoded in its stellar halo. The reliable identification of genuine halo stars is a challenging task, however. Aims. With the advent of the Gaia space telescope, we are ushered into a new era of Galactic astronomy. The first Gaia data release contains the positions, parallaxes, and proper motions for over two million stars, mostly in the solar neighbourhood. The second Gaia data release will enlarge this sample to over 1.5 billion stars, the brightest ~ 5 million of which will have full phase-space information. Our aim for this paper is to develop a machine learning model for reliably identifying halo stars, even when their full phase-space information is not available. Methods. We use the Gradient Boosted Trees algorithm to build a supervised halo star classifier. The classifier is trained on a sample of stars extracted from the Gaia Universe Model Snapshot, which is also convolved with the errors of the public TGAS data, which is a subset of Gaia DR1, as well as with the expected uncertainties for the upcoming Gaia DR2 catalogue. We also trained our classifier on a dataset resulting from the cross-match between the TGAS and RAVE catalogues, where the halo stars are labelled in an entirely model-independent way. We then use this model to identify halo stars in TGAS. Results. When full phase-space information is available and for Gaia DR2-like uncertainties, our classifier is able to recover 90% of the halo stars with at most 30% distance errors, in a completely unseen test set and with negligible levels of contamination. When line-of-sight velocity is not available, we recover ~ 60% of such halo stars, with less than 10% contamination. When applied to the TGAS catalogue, our classifier detects 337 high confidence red giant branch halo stars. At first glance this number may seem small, however, it is consistent with the expectation from the models, given the uncertainties in the data. The large parallax errors are in fact the biggest limitation in our ability to identify a large number of halo stars in all the cases studied.

Full Text