Mixed Sequence Reader: A Program for Analyzing DNA Sequences with Heterozygous Base Calling

Chun-Tien Chang,Yun-Shien Lee,Chia-Lung Tsai,Chuan Yi Tang,Tzu-Hao Wang,Angel Chao,Chi-Yu Hu,Jang-Hau Lian,Chi-Neu Tsai,Chyong-Huey Lai,Chun-Houh Chen

doi:10.1100/2012/365104

Abstract

The direct sequencing of PCR products generates heterozygous base-calling fluorescence chromatograms that are useful for identifying single-nucleotide polymorphisms (SNPs), insertion-deletions (indels), short tandem repeats (STRs), and paralogous genes. Indels and STRs can be easily detected using the currently available Indelligent or ShiftDetector programs, which do not search reference sequences. However, the detection of other genomic variants remains a challenge due to the lack of appropriate tools for heterozygous base-calling fluorescence chromatogram data analysis. In this study, we developed a free web-based program, Mixed Sequence Reader (MSR), which can directly analyze heterozygous base-calling fluorescence chromatogram data in .abi file format using comparisons with reference sequences. The heterozygous sequences are identified as two distinct sequences and aligned with reference sequences. Our results showed that MSR may be used to (i) physically locate indel and STR sequences and determine STR copy number by searching NCBI reference sequences; (ii) predict combinations of microsatellite patterns using the Federal Bureau of Investigation Combined DNA Index System (CODIS); (iii) determine human papilloma virus (HPV) genotypes by searching current viral databases in cases of double infections; (iv) estimate the copy number of paralogous genes, such as β-defensin 4 (DEFB4) and its paralog HSPDP3.

Highlights

The detection of genomic variations is important in studying the relationships between causative genes and diseases and the relationships between predisposing genes and complex trait diseases, such as type 2 diabetes, coronary heart disease, and cancers [1,2,3,4]
The direct sequencing of PCR products generates heterozygous base-calling fluorescence chromatograms that are useful for identifying single-nucleotide polymorphisms (SNPs), insertion-deletions, short tandem repeats (STRs), and paralogous genes
Our results showed that Mixed Sequence Reader (MSR) may be used to (i) physically locate indel and STR sequences and determine STR copy number by searching NCBI reference sequences; (ii) predict combinations of microsatellite patterns using the Federal Bureau of Investigation Combined DNA Index System (CODIS); (iii) determine human papilloma virus (HPV) genotypes by searching current viral databases in cases of double infections; (iv) estimate the copy number of paralogous genes, such as β-defensin 4 (DEFB4) and its paralog HSPDP3

Summary

Introduction

The detection of genomic variations is important in studying the relationships between causative genes and diseases and the relationships between predisposing genes and complex trait diseases, such as type 2 diabetes, coronary heart disease, and cancers [1,2,3,4]. Human genomic variations include single-nucleotide polymorphisms (SNPs), variable number of tandem repeats (VNTRs), short tandem repeats (STRs, or microsatellites), and copy number variations (CNVs) [6]. Among these genomic variants, there are currently 51,810,853 reference SNPs for the human genome, which include 6,516,668 indel sequences and 5,214 microsatellite markers, according to dbSNP Build 135. Genomic variants are frequently identified with heterozygous base-calling fluorescence chromatogram data generated from the direct sequencing of genomic PCR products using the dye-terminator method with Applied Biosystems (ABIs) autosequencers, such as models 3700 or 3730. The alignment of heterozygous base-calling fluorescent data with a reference database can be used to detect the physical position of indel within the genome. Some heterozygous indels may not be visualized (Figures 1(b-4) and 2(a))

Methods

Results

Conclusion