Gentle Masking of Low-Complexity Sequences Improves Homology Search

Martin C Frith,Leonardo Mariño-Ramírez

doi:10.1371/journal.pone.0028819

Abstract

Detection of sequences that are homologous, i.e. descended from a common ancestor, is a fundamental task in computational biology. This task is confounded by low-complexity tracts (such as atatatatatat), which arise frequently and independently, causing strong similarities that are not homologies. There has been much research on identifying low-complexity tracts, but little research on how to treat them during homology search. We propose to find homologies by aligning sequences with “gentle” masking of low-complexity tracts. Gentle masking means that the match score involving a masked letter is , where is the unmasked score. Gentle masking slightly but noticeably improves the sensitivity of homology search (compared to “harsh” masking), without harming specificity. We show examples in three useful homology search problems: detection of NUMTs (nuclear copies of mitochondrial DNA), recruitment of metagenomic DNA reads to reference genomes, and pseudogene detection. Gentle masking is currently the best way to treat low-complexity tracts during homology search.

Highlights

The problem of false homology prediction due to lowcomplexity sequences is sufficiently severe that it has been addressed since the early days of computational biology
We recently showed that standard methods such as SegMasker and DustMasker are not perfect: they fail to mask some low-complexity sequences, which produce strong (E-value v10{30), non-homologous alignments [1]
We looked for NUMTs in several nuclear genomes, using either harsh masking or gentle masking of low-complexity regions identified by tantan

Summary

Introduction

The problem of false homology prediction due to lowcomplexity sequences is sufficiently severe that it has been addressed since the early days of computational biology. Methods to avoid this problem can be classified into three approaches: Hard masking The first approach is to identify low-complexity regions by some means, and replace each letter in these regions with a dummy letter, typically X for proteins and N for DNA. In the NCBI blosum matrices, X receives a score of 21. This prevents low-complexity regions from getting high alignment scores. We described a new masking method, tantan, which prevents non-homologous alignments much more reliably

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLoS ONE	Publication Date: Dec 19, 2011
Citations: 39	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Gentle Masking of Low-Complexity Sequences Improves Homology Search

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE

Lead the way for us

Similar Papers

Factors Affecting the Relative Abundance of Nuclear Copies of Mitochondrial DNA (Numts) in Hominoids
I D Soto-Calderón ... N M Anthony
Journal of Molecular Evolution | VOL. 75
I D Soto-Calderón, et. al.I D Soto-Calderón ... N M Anthony
01 Oct 2012
Journal of Molecular Evolution | VOL. 75

Exploring phylogenetic informativeness and nuclear copies of mitochondrial DNA (numts) in three commonly used mitochondrial genes: mitochondrial phylogeny of peppermint, cleaner, and semi-terrestrial shrimps (Caridea:Lysmata,Exhippolysmata, andMerguia)
J Antonio Baeza ... M Soledad Fuentes
Zoological Journal of the Linnean Society | VOL. 168
J Antonio Baeza, et. al.J Antonio Baeza ... M Soledad Fuentes
26 Jul 2013
Zoological Journal of the Linnean Society | VOL. 168

Nuclear copies of mitochondrial DNA as a potential problem for phylogenetic and population genetic studies of Odonata
Stanislav Ožana ... Aleš Dolný
Systematic Entomology | VOL. 47
Stanislav Ožana, et. al.Stanislav Ožana ... Aleš Dolný
29 Apr 2022
Systematic Entomology | VOL. 47

Evolutionary Capture of Viral and Plasmid DNA by Yeast Nuclear Chromosomes
A Carolin Frank ... Kenneth H Wolfe
Eukaryotic Cell | VOL. 8
A Carolin Frank, et. al.A Carolin Frank ... Kenneth H Wolfe
07 Aug 2009
Eukaryotic Cell | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Gentle Masking of Low-Complexity Sequences Improves Homology Search

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE