RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

Xingyu Liao,Fang-Xiang Wu,Jianxin Wang,Xiankai Zhang,Xin Gao

doi:10.1186/s12859-020-03779-w

Abstract

BackgroundRepetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many applications, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the high-frequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools.ResultsIn this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-generation sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to generate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences.ConlusionsWe test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics.

Highlights

Repetitive sequences account for a large proportion of eukaryotes genomes
Conlusions: We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics
We evaluate the repeats identified by RepAHR, RepARK and REPdenovo on five next-gen‐ eration sequencing (NGS) data sets

Summary

Introduction

Identification of repetitive sequences plays a significant role in many appli‐ cations, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the highfrequency k-mers to obtain repeats. The repetitive sequences are patterns of nucleic acids, which occur multiple times in genome with the same or approximate form. Based on their structure and distribution in the genome, repetitive sequences are classified into several types, i.e. tandem repeats, interspersed repeats and so on. Transposable elements account for a large fraction of the genome and have influence on much of the mass of DNA in eukaryotic genomes [1]. For many basic analysis methods of genome sequences, such as de novo assembly, sequence alignment, sequence error correction, etc., repetitive sequences pose a challenge to these tasks [5]

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Oct 19, 2020
Citations: 6	License type: open-access

R Discovery Prime

R Discovery Prime

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction
Peng Zeng ... Jing Cai
Chinese medicine | VOL. 17
Peng Zeng, et. al.Peng Zeng ... Jing Cai
09 Aug 2022
Chinese medicine | VOL. 17

Accurate Prediction of RH Genotypes Using Whole Genome Sequencing Data
Yan Zheng ... Stella T Chou
Blood | VOL. 132
Yan Zheng, et. al.Yan Zheng ... Stella T Chou
29 Nov 2018
Blood | VOL. 132

AlignerBoost: A Generalized Software Toolkit for Boosting Next-Gen Sequencing Mapping Accuracy Using a Bayesian-Based Mapping Quality Framework.
Qi Zheng ... Elizabeth A Grice
PLOS Computational Biology | VOL. 12
Qi Zheng, et. al.Qi Zheng ... Elizabeth A Grice
05 Oct 2016
PLOS Computational Biology | VOL. 12

A detailed analysis of next generation sequencing reads of microRNA expression in Barrett’s Esophagus: absolute versus relative quantification
In-Hee Lee ... Prateek Sharma
BMC Research Notes | VOL. 7
In-Hee Lee, et. al.In-Hee Lee ... Prateek Sharma
04 Apr 2014
BMC Research Notes | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics