Background: We aimed to develop models and evaluate their performance of machine learning approaches in predicting the diagnosis of HIV and sexually transmitted infections (STIs) based on a large retrospective cohort of Australian men who have sex with men (MSM). Methods: We collected demographic, clinical, behavioural and laboratory information from the clinic records of 21273 MSM who attended Melbourne Sexual Health Centre (MSHC), Australia between 2011-2017. We limited the analysis of three STIs (syphilis, gonorrhoea, chlamydia) to the period of January 2015 to December 2017. We compared the accuracy for predicting the diagnosis of HIV and three STIs using four machine learning approaches against a multivariable logistic regression (MLR) model. Findings: HIV was diagnosed in 436/18505 MSM (436 diagnoses/58121 consultations), syphilis in 741/13820 MSM (810 diagnoses/38490 consultations), gonorrhoea in 3258/10802 MSM (4382 diagnoses/25011 consultations), and chlamydia in 2836/7708 MSM (3918 diagnoses/13926 consultations). Machine learning approaches more accurately predicted each infection than MLR. Gradient boosting machine (GBM) was the most accurate and achieved the highest area under the receiver operator characteristic curve for HIV (76·3%) and STIs (syphilis, 85·8%; gonorrhoea, 75·5%; chlamydia, 68·0%), followed by extreme gradient boosting (71·1%, 82·2%, 70·3%, 66·4%), random forest (72·0%, 81·9%, 67·2%, 64·3%), deep learning (75·8%, 81·0%, 67·5%, 65·4%), and MLR (69·8%, 80·1%, 67·2%, 63·2%). The trained GBM models demonstrated that the ten greatest predictors collectively explained 62·7-73·6% of variations in predicting the diagnosis of HIV/STIs. Among which, STIs symptoms, past syphilis infection, age, time living in Australia, frequency of condom use with casual male sexual partners during receptive anal sex and the number of casual male sexual partners in the past 12 months were predictors most commonly identified by the models. Interpretation: Machine learning approaches are advantageous over multivariable logistic regression models in predicting the diagnosis of HIV/STIs. Funding Statement: Australian NHMRC Leadership Investigator Grant (GNT1172900) Declaration of Interests: LZ is supported by the National Natural Science Foundation of China (8191101420); Thousand Talents Plan Professorship for Young Scholars (3111500001); Xi'an Jiaotong University Young Talent Support Program; Xi’an Jiaotong University Basic Research and Profession Grant (xtr022019003). CKF is supported by an Australian NHMRC Leadership Investigator Grant (GNT1172900). EPFC is supported by an Australian National Health and Medical Research Council (NHMRC) Emerging Leadership Investigator Grant (GNT1172873). XZ is supported by National Science and Technology Major Project of China (2018ZX10721102); The key Project of Philosophy and Social Sciences Research in Jiangsu Education Department of China (2018SJZDI123); Nantong Municipal Bureau of Science and Technology, China (MS12018001, HS2016002 ). All other authors declare no competing interests. Ethics Approval Statement: The datasets were completely de-identified and not re-identifiable. Ethical approval was granted by the Alfred Hospital Ethics Committee, Australia (project number: 124/18).
Read full abstract