Modeling genetic heterogeneity in Hepatitis C Virus hyper-variable region 1 infers demographic characteristics of infected hosts

James Lara,Yury Kudhyakov

doi:10.1109/iccabs.2014.6863936

Abstract

Hepatitis C Virus (HCV) is the most common etiological cause of non-A/non-B blood-borne viral hepatitis and the leading cause for liver transplantation. The population of HCV-infected individuals in the US is estimated to be over 3 million. There are 7 major HCV genotypes with world-wide distribution, which are further grouped into numerous sub-genotypes. HCV genotype 1a is the most common genotype in the US, with genotype 1b being the next most common. Both genotypes are responsible for the most difficult-to-treat infections. Several host- and viral-related factors have been identified as risk factors for development of HCV chronic (HCH) infection, liver disease progression and therapy outcome. We previously reported that certain host demographic characteristics were found associated to the genetic properties of HCV strains in a group of chronically infected patients undergoing combined interferon and ribavirin therapy. In this study we expanded analysis to a larger dataset to further explore association between the hosts' ethnic background and the genetic properties of the HCV hyper-variable region 1 (HVR1). The HCV data contained sequences of intra-host HVR1 variants of HCV1a and HCV1b (n=936 and n=630, respectively) obtained from a national survey and five independent state-wide outbreak investigations. Association between properties of HVR1 strains and hosts' ethnicity was examined using viral features derived from nucleotide (nt) and amino-acid (aa) sequence information. Nucleotide sequences of 87nt at genome position 1491-1577 and amino acid sequences of 29 aa at polyprotein position 384-412 (GenBank reference sequence AF01175) were associated with ethnicity data, Caucasian (CA) or Afro-American (AA). To identify relevant viral nt- or aa-based features associated with host ethnicity we applied a correlation feature selection (CFS) method to find subsets with features that have a high correlation to the variable of interest and a low correlation between the features. In HCV1a data, the best HVR1 nt-based feature subset (merit=0.26) and aa-based subset (merit=0.20) consisted of 9 nt sites and 6 aa sites, respectively. In HCV1b data, the best nt-based feature subset (merit=0.35) and aa-based subset (merit=0.25) consisted of 13 nt and 8 aa sites, respectively. These findings indicate the association of the ethnicity variable with genetic heterogeneity of certain sets of genomic and polyprotein sites. It also indicates absence of strong correlation between variation at any single site and the ethnicity variable. Therefore, in order to account for interactions and/or dependencies among features in selected subsets, which are associated as a group with host ethnicity, we modeled genetic relationships to ethnicity using Bayesian network classifiers (BNCs). BNC models were initially constructed as naïve Bayesian networks and then were left to learn dependencies among the features. Performance evaluations of BNCs were measured using F-measure and classification accuracy metrics during the training - 10-fold-cross-validation (10xCV) - and testing phases - out-of-sample data (validation). BNCs evaluations were also carried out using 5 datasets generated by random sampling from HCV data where sequences were randomly assigned to ethnicity classes. Remarkable accuracy in performance (10xCV / validation) was observed for the HCV1a BNCs based on 9nt (91.1% / 91.7%) and 6aa features (83.3% / 82.7%). Accuracy of BNCs on randomly labeled data was significantly lower (9nt-BNC Rand =60.9% and 6aa-BNC Rand =47.3%, avg. accuracy). Similar performances were observed for BNCs constructed from HCV1b data, where accuracy of the classification was further improved by integrating the 13nt and 8aa learned BNCs into a single combined 21 feature BNC construct (96.3% / 90.2%). Average accuracy of the BNC Rand was 48.6%. In conclusion, findings in this study suggest that HVR1 sequence variants are strongly associated to host ethnicity. We hypothesize that this association is based on an extensive epistatic connectivity of the HVR1 sites with sites across the entire HCV genome, thus reflecting coevolution among genomic sites as related to numerous selection pressures acting on HCV during chronic infection. Hence, intra-host HVR1 variants can serve as a genetic sensor of the HCV genome and provide parameters relevant for engineering robust computational models to associate properties encoded in genomes of viral strains with a wide range of HCV-associated features such as stages of intra-host infection, drug resistance and virulence.

Full Text