Clusters of invasive group Astreptococcal (iGAS) infection, linked to genomically closely related group Astreptococcal (GAS) isolates (referred to as genomic clusters), pose public health threats, and are increasingly identified through whole-genome sequencing (WGS) analysis. In this study, we aimed to assess the risk of genomic cluster formation among iGAS cases not already part of existing genomic clusters. In this WGS and population-based surveillance study, we analysed iGAS case isolates from the Active Bacterial Core surveillance (ABCs), which is part of the US Centers for Disease Control and Prevention's Emerging Infections Program, in ten US states from Jan 1, 2015, to Dec 31, 2019. We included all residents in ABCs sites with iGAS infections meeting the case definition and excluded non-conforming GAS infections and cases with whole-genome assemblies of the isolate containing fewer than 1·5million total bases or more than 150contigs. For iGAS cases we collected basic demographics, underlying conditions, and risk factors for infection from medical records, and for isolates we included emm types, antimicrobial resistance, and presence of virulence-related genes. Two iGAS cases were defined as genomically clustered if their isolates differed by three or less single-nucleotide variants. An iGAS case not clustered with any previous cases at the time of detection, with a minimum trace-back time of 1year, was defined as being at risk of cluster formation. We monitored each iGAS case at risk for a minimum of 1year to identify any cluster formation event, defined as the detection of a subsequent iGAS case clustered with the case at risk. We used the Kaplan-Meier method to estimate the cumulative incidence of cluster formation events over time. We used Cox regression to assess associations between features of cases at risk upon detection and subsequent cluster formation. We developed a random survival forest machine-learning model based on a derivation cohort (random selection of 50% of cases at risk) to predict cluster formation risk. This model was validated using a validation cohort consisting of the remaining 50%of cases at risk. We identified 2764iGAS cases at risk from 2016to 2018, of which 656 (24%) formed genomic clusters by the end of 2019. Overall, the cumulative incidence of cluster formation was 0·057 (95% CI 0·048-0·066) at 30days after detection, 0·12 (0·11-0·13) at 90days after detection, and 0·16 (0·15-0·18) at 180days after detection. Ahigher risk of cluster formation was associated with emm type (adjusted hazard ratio as compared with emm89 was 2·37 [95% CI 1·71-3·30] for emm1, 2·72 [1·82-4·06] for emm3, 2·28 [1·49-3·51] for emm6, 1·47 [1·05-2·06] for emm12, and 2·21 [1·38-3·56] for emm92), homelessness (1·42 [1·01-1·99]), injection drug use (2·08 [1·59-2·72]), residence in a long-term care facility (1·78[1·29-2·45]), and the autumn-winter season (1·34 [1·14-1·57]) in multivariable Cox regression analysis. The machine-learning model stratified the validation cohort (n=1382) into groups at low (n=370), moderate (n=738), and high (n=274) risk. The 90-day risk of cluster formation was 0·03 (95% CI 0·01-0·05) for the group at low risk, 0·10(0·08-0·13) for the group at moderate risk, and 0·21 (0·17-0·25) for the group at high risk. These results were consistent with the cross-validation outcomes in the derivation cohort. Using population-based surveillance data, we found that pathogen, host, and environment factors of iGAS cases were associated with increased likelihood of subsequent genomic cluster formation. Groups at high risk were consistently identified by a predictive model which could inform prevention strategies, although future work to refine the model, incorporating other potential risk factors such as host contact patterns and immunity to GAS, is needed to improve its predictive performance. Centers for Disease Control and Prevention.