Connected vehicles (CVs) have the potential to collect and share information that, if appropriately processed, can be employed for advanced traffic control strategies, rendering infrastructure-based sensing obsolete. However, before we reach a fully connected environment, where all vehicles are CVs, we have to deal with the challenge of incomplete data. In this paper, we develop data-driven methods for the estimation of vehicles approaching a signalised intersection, based on the availability of partial information stemming from an unknown penetration rate of CVs. In particular, we build machine learning models with the aim of capturing the nonlinear relations between the inputs (CV data) and the output (number of non-connected vehicles), which are characterised by highly complex interactions and may be affected by a large number of factors. We show that, in order to train these models, we may use data that can be easily collected with modern technologies. Moreover, we demonstrate that, if the available real data is not deemed sufficient, training can be performed using synthetic data, produced via microscopic simulations calibrated with real data, without a significant loss of performance. Numerical experiments, where the estimation methods are tested using real vehicle data simulating the presence of various penetration rates of CVs, show very good performance of the estimators, making them promising candidates for applications in the near future.