CovTransformer: A transformer model for SARS-CoV-2 lineage frequency forecasting.

Yinan Feng,Emma Elizabeth Goldberg,Michael Kupperman,Xitong Zhang,Youzuo Lin,Ruian Ke

doi:10.1093/ve/veae086

Yinan Feng, Emma Elizabeth Goldberg + Show 4 more

Open Access

PDF Available

https://doi.org/10.1093/ve/veae086

Copy DOI

Export

Save

Cite

Journal: Virus evolution	Publication Date: Nov 14, 2024
License type: CC BY-NC 4.0

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

With hundreds of SARS-CoV-2 lineages circulating in the global population, there is an ongoing need for predicting and forecasting lineage frequencies and thus identifying rapidly expanding lineages. Accurate prediction would allow for more focused experimental efforts to understand pathogenicity of future dominating lineages and characterize the extent of their immune escape. Here, we first show that the inherent noise and biases in lineage frequency data make a commonly-used regression-based approach unreliable. To address this weakness, we constructed a machine learning model for SARS-CoV-2 lineage frequency forecasting, called CovTransformer, based on the transformer architecture. We designed our model to navigate challenges such as a limited amount of data with high levels of noise and bias. We first trained and tested the model using data from the UK and the USA, and then tested the generalization ability of the model to many other countries and US states. Remarkably, the trained model makes accurate predictions two months into the future with high levels of accuracy both globally (in 31 countries with high levels of sequencing effort) and at the US-state level. Our model performed substantially better than a widely used forecasting tool, the multinomial regression model implemented in Nextstrain, demonstrating its utility in SARS-CoV-2 monitoring. Assuming a newly emerged lineage is identified and assigned, our test using retrospective data shows that our model is able to identify the dominating lineages 7 weeks in advance on average before they became dominant. Overall, our work demonstrates that transformer models represent a promising approach for SARS-CoV-2 forecasting and pandemic monitoring.

Full Text