Identifying individuals with tuberculosis (TB) with a high risk of onward transmission can guide disease prevention and public health strategies. Here, we train classification models to predict the first sampled isolates in Mycobacterium tuberculosis transmission clusters from demographic and disease data. We find that supervised learning, in particular balanced random forests, can be used to develop predictive models to identify people with TB that are more likely associated with TB cluster growth, with good model performance and AUCs of ≥ 0.75. We also identified the most important patient and disease characteristics in the best performing classification model, including host demographics, site of infection, TB lineage, and age at diagnosis. This framework can be used to develop predictive tools for the early assessment of potential cluster growth to prioritise individuals for enhanced follow-up with the aim of reducing transmission chains.
Read full abstract