The untranslated region (UTR) of messenger ribonucleic acid (mRNA), including the 5'UTR and 3'UTR, plays a critical role in regulating gene expression and translation. Variants within the UTR can lead to changes associated with human traits and diseases; however, computational prediction of UTR variant effect is challenging. Current noncoding variant prediction mainly focuses on the promoters and enhancers, neglecting the unique sequence of the UTR and thereby limiting their predictive accuracy. In this study, using consolidated datasets of UTR variants from disease databases and large-scale experimental data, we systematically analyzed more than 50 region-specific features of UTR, including functional elements, secondary structure, sequence composition and site conservation. Our analysis reveals that certain features, such as C/G-related sequence composition in 5'UTR and A/T-related sequence composition in 3'UTR, effectively differentiate between nonfunctional and functional variant sets, unveiling potential sequence determinants of functional UTR variants. Leveraging these insights, we developed two classification models to predict functional UTR variants using machine learning, achieving an area under the curve (AUC) value of 0.94 for 5'UTR and 0.85 for 3'UTR, outperforming all existing methods. Our models will be valuable for enhancing clinical interpretation of genetic variants, facilitating the prediction and management of disease risk.
Read full abstract