BackgroundThe COVID-19 pandemic arising from the emergence of SARS-CoV-2 in late 2019 has led to global devastation with millions of lives lost by January 2024. Despite the WHO’s declaration of the end of the global health emergency in May 2023, the virus persists, propelled by mutations. Variants continue to challenge vaccination efforts, underscoring the necessity for ongoing vigilance. This study aimed at contributing to a more data-driven approach to pandemic management by employing random forest regression to analyze regional variant prevalence.MethodsThis study utilized data from various sources including National COVID Cohort Collaborative database, Bureau of Transportation Statistics, World Weather Online, EPA, and US Census. Key variables include pollution, weather, travel patterns, and demographics. Preprocessing steps involved merging and normalization of datasets. Training data spanned from January 2021 to February 2023. The Random Forest (RF) Regressor was chosen for its accuracy in modeling. To prevent data leakage, time series splits were employed. Model performance was evaluated using metrics such as MSE and R-squared.ResultsThe Alpha variant was predominant in the Southeast, with less than 80% share even at its peak. Delta surged initially in Kansas City and maintained dominance there for over 5 months. Omicron subvariant BA.5 spread nationwide, becoming predominant across all Health and Human Services regions simultaneously, with New York seeing the earliest and fastest decline in its share. Variant XBB.1.5 concentrated more in the Northeast, but limited data hindered full analysis. Using RF Regressor, key features affecting spread patterns were identified, with high predictive accuracy. Each variant showed specific environmental correlations; for instance, Alpha with air quality index, Delta with ozone density, BA.5 with UV index, and XBB.1.5 with land area and income. Correlation analysis further highlighted variant-specific associations.ConclusionsThis research provides a comprehensive analysis of the regional distribution of COVID-19 variants, offering critical insights for devising targeted public health strategies. By utilizing machine learning, the study uncovers the complex factors contributing to variant spread and reveals how specific factors contribute to variant prevalence, offering insights crucial for pandemic management.
Read full abstract