BackgroundCerebral palsy is a complex condition that can manifest in different ways, and diagnosis is likely to be under-recorded in primary care databases. This study aimed to identify potential unrecorded cases based on other available information in patients' medical records. MethodsA machine learning approach was used to identify likely cases of cerebral palsy in live births between Jan 1, 1990, and April 30, 2016, in the Clinical Practice Research Datalink (CPRD), a UK primary care database. Firstly, we made a preliminary selection of predictor variables (medical and drug codes) by comparing their relative frequencies associated with known cases and with the remaining non-cases; secondly, we reduced the number of variables using the random forest method based on a resampled balanced population; thirdly, we used a logistic regression model with selected codes to predict the probability for cerebral palsy; and lastly, the medical records of identified likely cases were manually reviewed with expert clinical knowledge to validate the cases. Scientific approval for this study was given by the CPRD Independent Scientific Advisory Committee. FindingsOf 485 709 live births, 664 (0·14%) were initially identified as known cases of cerebral palsy using 43 validated diagnostic codes. 175 of 31 605 codes in the records were discovered more frequently in known cases of cerebral palsy than in non-cases. 35 of the most informative codes (eg, skeletal muscle relaxants, prematurity, and being seen in paediatric clinic) were selected and used to build the logistic prediction model, which yielded 787 most likely cases (with predicted probability for cerebral palsy ≥0·975). On the basis of evidence of both motor disorder and brain injury, after manual review of medical records, 405 children were validated as cases additional to the known cases, resulting in a cerebral palsy prevalence of 0·22% in live births, which is comparable to existing evidence. InterpretationData-driven schemes, such as random forests, have the potential of identifying the most informative predictors in a cost-effective way to reliably identify potential unrecorded cases of cerebral palsy or other complex medical conditions in primary care databases. FundingEconomic and Social Research Council (grant ref ES/L007517/1).