Abstract

Structural variation (SV) is typically defined as variation within the human genome that exceeds 50 base pairs (bp). SV may be copy number neutral or it may involve duplications, deletions, and complex rearrangements. Recent studies have shown SV to be associated with many human diseases. However, studies of SV have been challenging due to technological constraints. With the advent of third generation (long-read) sequencing technology, exploration of longer stretches of DNA not easily examined previously has been made possible. In the present study, we utilized third generation (long-read) sequencing techniques to examine SV in the EGFR landscape of four haplotypes derived from two human samples. We analyzed the EGFR gene and its landscape (+/- 500,000 base pairs) using this approach and were able to identify a region of non-coding DNA with over 90% similarity to the most common activating EGFR mutation in non-small cell lung cancer. Based on previously published Alu-element genome instability algorithms, we propose a molecular mechanism to explain how this non-coding region of DNA may be interacting with and impacting the stability of the EGFR gene and potentially generating this cancer-driver gene. By these techniques, we were also able to identify previously hidden structural variation in the four haplotypes and in the human reference genome (hg38). We applied previously published algorithms to compare the relative stabilities of these five different EGFR gene landscape haplotypes to estimate their relative potentials to generate the EGFR exon 19, 15 bp canonical deletion. To our knowledge, the present study is the first to use the differences in genomic architecture between targeted cancer-linked phased haplotypes to estimate their relative potentials to form a common cancer-linked driver mutation.

Highlights

  • Over the past decade, the ability to examine the human genome beyond short segments of a few hundred base pairs has greatly improved with the advent of third-generation sequencing technologies which have improved detection and characterization of structural variants

  • We examined the stability of the EGFR landscape via in silico manipulations when subjected to a 1,000 bp deletion, duplication, or inversion at the high homology reverse complement

  • The reverse complements ( 60% homology and within ± 421,000 bp) to the EGFR exon 19 canonical deletion varied across the five haploid genomes examined (4 patient landscapes and hg38)

Read more

Summary

Introduction

The ability to examine the human genome beyond short segments of a few hundred base pairs (bp) has greatly improved with the advent of third-generation (i.e., long-read) sequencing technologies which have improved detection and characterization of structural variants. Short-read sequencing is often challenging for accurate calling of large structural variants, especially in highly repetitive regions of a genome [1,2,3,4]. SNVs constitute the majority of variants in the human genome, SV affects far more bases. The typical genome contains an estimated 2,100 to 2,500 structural variants, affecting ~20 million bases of sequence compared to the only ~4–5 million bases affected by SNVs [2, 3]. The 1000 Genomes Consortium studied 2,504 individuals and identified 3,163 specific regions of the genome (~13 percent of the genome) in which there were consistently three or more instances of SV [12]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call