Background/Objectives: Tuberculosis (TB) is one of the leading causes of death by infectious disease in the world, caused by the bacterium Mycobacterium tuberculosis (M. tuberculosis). First- and second-line drugs are used to treat active tuberculosis. However, drug resistance presents a critical challenge in the global fight against tuberculosis. Materials and Methods: Drug resistance diagnosis is typically performed through drug susceptibility testing, either via culture-based methods or molecular rapid tests. In recent years, studies have utilized whole-genome sequencing of M. tuberculosis isolates combined with machine learning techniques to predict drug resistance. In this study, we evaluated four machine learning classification models: Extreme Gradient Boosting Classifier (XGBC), Logistic Gradient Boosting Classifier (LGBC), Gradient Boosting Classifier (GBC), and an Artificial Neural Network (ANN). These models were trained using a Variant Call Format (VCF) file preprocessed by the CRyPTIC consortium. We employed three datasets: the original dataset, a dataset reduced through principal component analysis, and a dataset that prioritized the most important features identified by the XGBC model. Results: The four models were utilized on the principal component analysis dataset, while the XGBC model was additionally implemented with an arbitrarily reduced dataset focusing on the most significant mutations identified during model training. All models were trained and tested across these datasets, and their performances were compared. The XGBC model trained on the original dataset outperformed the others, achieving sensitivity values of 0.97, 0.90, and 0.94, specificity values of 0.97, 0.99, and 0.96, and F1-scores of 0.93, 0.94, and 0.92 for ethambutol, isoniazid, and rifampicin, respectively. Discussion: This study highlights the effectiveness of a binary representation of mutations (indicating their presence or absence) as a robust approach for training XGBC models that accurately classify resistance and susceptibility to ethambutol, isoniazid, and rifampicin in M. tuberculosis. Conclusion: The XGBC model trained on the original dataset demonstrated superior performance compared to the other models, indicating its potential for improving drug resistance prediction in TB. This approach could be further explored for clinical applications in the fight against drug-resistant tuberculosis.
Read full abstract