Probably Correct: Rescuing Repeats with Short and Long Reads.

Monika Cechova

doi:10.3390/genes12010048

Abstract

Ever since the introduction of high-throughput sequencing following the human genome project, assembling short reads into a reference of sufficient quality posed a significant problem as a large portion of the human genome—estimated 50–69%—is repetitive. As a result, a sizable proportion of sequencing reads is multi-mapping, i.e., without a unique placement in the genome. The two key parameters for whether or not a read is multi-mapping are the read length and genome complexity. Long reads are now able to span difficult, heterochromatic regions, including full centromeres, and characterize chromosomes from “telomere to telomere”. Moreover, identical reads or repeat arrays can be differentiated based on their epigenetic marks, such as methylation patterns, aiding in the assembly process. This is despite the fact that long reads still contain a modest percentage of sequencing errors, disorienting the aligners and assemblers both in accuracy and speed. Here, I review the proposed and implemented solutions to the repeat resolution and the multi-mapping read problem, as well as the downstream consequences of reference choice, repeat masking, and proper representation of sex chromosomes. I also consider the forthcoming challenges and solutions with regards to long reads, where we expect the shift from the problem of repeat localization within a single individual to the problem of repeat positioning within pangenomes.

Highlights

While next-generation sequencing is increasingly used both in research and clinical practice, a subset of sequencing reads is frequently underutilized. These are reads that cannot be uniquely positioned within their respective genomes, and are multi-mapping in the chosen reference assembly. They frequently originate from duplicated genes [1], transposable elements [2,3], satellite repeats [4], and more generally a heterochromatic portion of the genome [5]
For typical applications, it might be advantageous not to use alternative haplotigs and to mask large multi-copy sequences such as pseudoautosomal regions (PARs) and a small subset of α-satellites that are artificially identical in the current GRC reference genomes [41]
If no further post-processing is applied, this approach can lead to erroneous downstream conclusions

Summary

Introduction

While next-generation sequencing is increasingly used both in research and clinical practice, a subset of sequencing reads is frequently underutilized These are reads that cannot be uniquely positioned within their respective genomes, and are multi-mapping in the chosen reference assembly. They frequently originate from duplicated genes [1], transposable elements [2,3], satellite repeats (e.g., centromeric and telomeric reads) [4], and more generally a heterochromatic portion of the genome [5]. Next-generation sequencing reads, even if not multi-mapping, can fall short of capturing the full repeat variability of individuals [22], especially when compared to a single reference genome (i.e., limited representation of a genome)

Reference Genomes Are Inherently Incomplete

Methods of Multi-Mapping Read Assignment

Repeat Masking and Its Consequences

Sex Chromosomes

Long Reads

Long-Read Sequencing Strategies

Long-Read Assemblies

Findings

Future

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Genes	Publication Date: Dec 31, 2020
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Probably Correct: Rescuing Repeats with Short and Long Reads.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Genes

Lead the way for us

Similar Papers

Genomic Analysis of Left Ventricular Remodeling
Rizwan Sarwar ... Stuart A Cook
Circulation | VOL. 120
Rizwan Sarwar, et. al.Rizwan Sarwar ... Stuart A Cook
03 Aug 2009
Circulation | VOL. 120

Do plants have more genes than humans? Yes, when it comes to ABC proteins
Rocı́O Sánchez-Fernández ... Julian O.D Coleman
Trends in Plant Science | VOL. 6
Rocı́O Sánchez-Fernández, et. al.Rocı́O Sánchez-Fernández ... Julian O.D Coleman
01 Aug 2001
Trends in Plant Science | VOL. 6

Genetics update : impact of the human genome projects and identification of a stroke gene.
Mark J Alberts
Stroke | VOL. 32
Mark J AlbertsMark J Alberts
01 Jun 2001
Stroke | VOL. 32

Cancer genomics: new software tools making sequencing more accessible.
En-Guo Chen ... Yan Lu
Personalized Medicine | VOL. 11
En-Guo Chen, et. al.En-Guo Chen ... Yan Lu
01 Mar 2014
Personalized Medicine | VOL. 11

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Probably Correct: Rescuing Repeats with Short and Long Reads.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Genes