BackgroundEfforts toward tuberculosis management and control are challenged by the emergence of Mycobacterium tuberculosis (MTB) resistance to existing anti-TB drugs. This study aimed to explore the potential of machine learning algorithms in predicting drug resistance of four anti-TB drugs (rifampicin, isoniazid, streptomycin, and ethambutol) in MTB using whole-genome sequence and clinical data from Uganda. We also assessed the model’s generalizability on another dataset from South Africa.ResultsWe trained ten machine learning algorithms on a dataset comprising of 182 MTB isolates with clinical data variables (age, sex, HIV status) and SNP mutations across the entire genome as predictor variables and phenotypic drug-susceptibility data for the four drugs as the outcome variable. Model performance varied across the four anti-TB drugs after a five-fold cross validation. The best model was selected considering the highest Mathews Correlation Coefficient (MCC) and Area Under the Receiver Operating Characteristic Curve (AUC) score as key metrics. The Logistic regression excelled in predicting rifampicin resistance (MCC: 0.83 (95% confidence intervals (CI) 0.73–0.86) and AUC: 0.96 (95% CI 0.95–0.98) and streptomycin (MCC: 0.44 (95% CI 0.27–0.58) and AUC: 0.80 (95% CI 0.74–0.82), Extreme Gradient Boosting (XGBoost) for ethambutol (MCC: 0.65 (95% CI 0.54–0.74) and AUC: 0.90 (95% CI 0.83–0.96) and Gradient Boosting (GBC) for isoniazid (MCC: 0.69 (95% CI 0.61–0.78) and AUC: 0.91 (95% CI 0.88–0.96). The best performing model per drug was only trained on the SNP dataset after excluding the clinical data variables because intergrating them with SNP mutations showed a marginal improvement in the model’s performance. Despite the high MCC (0.18 to 0.72) and AUC (0.66 to 0.95) scores for all the best models with the Uganda test dataset, LR model for rifampicin and streptomycin didn’t generalize with the South Africa dataset compared to the GBC and XGBoost models. Compared to TB profiler, LR for RIF was very sensitive and the GBC for INH and XGBoost for EMB were very specific on the Uganda dataset. TB profiler outperformed all the best models on the South Africa dataset. We identified key mutations associated with drug resistance for these antibiotics. HIV status was also identified among the top significant features in predicting drug resistance.ConclusionLeveraging machine learning applications in predicting antimicrobial resistance represents a promising avenue in addressing the global health challenge posed by antimicrobial resistance. This work demonstrates that integration of diverse data types such as genomic and clinical data could improve resistance predictions while using machine learning algorithms, support robust surveillance systems and also inform targeted interventions to curb the rising threat of antimicrobial resistance.
Read full abstract