Clustering exact matches of pairwise sequence alignments by weighted linear regression

Alvaro J González,Li Liao

doi:10.1186/1471-2105-9-102

Abstract

BackgroundAt intermediate stages of genome assembly projects, when a number of contigs have been generated and their validity needs to be verified, it is desirable to align these contigs to a reference genome when it is available. The interest is not to analyze a detailed alignment between a contig and the reference genome at the base level, but rather to have a rough estimate of where the contig aligns to the reference genome, specifically, by identifying the starting and ending positions of such a region. This information is very useful in ordering the contigs, facilitating post-assembly analysis such as gap closure and resolving repeats. There exist programs, such as BLAST and MUMmer, that can quickly align and identify high similarity segments between two sequences, which, when seen in a dot plot, tend to agglomerate along a diagonal but can also be disrupted by gaps or shifted away from the main diagonal due to mismatches between the contig and the reference. It is a tedious and practically impossible task to visually inspect the dot plot to identify the regions covered by a large number of contigs from sequence assembly projects. A forced global alignment between a contig and the reference is not only time consuming but often meaningless.ResultsWe have developed an algorithm that uses the coordinates of all the exact matches or high similarity local alignments, clusters them with respect to the main diagonal in the dot plot using a weighted linear regression technique, and identifies the starting and ending coordinates of the region of interest.ConclusionThis algorithm complements existing pairwise sequence alignment packages by replacing the time-consuming seed extension phase with a weighted linear regression for the alignment seeds. It was experimentally shown that the gain in execution time can be outstanding without compromising the accuracy. This method should be of great utility to sequence assembly and genome comparison projects.

Highlights

At intermediate stages of genome assembly projects, when a number of contigs have been generated and their validity needs to be verified, it is desirable to align these contigs to a reference genome when it is available
We propose a simple yet powerful algorithm that takes as input the set of exact matches between a contig and a reference genome, and produces as output the starting and ending coordinates of the most likely global alignment that exists between the two sequences
In our attempt to assemble a region of the genome of several rice species, sets of contigs of varied cardinality and average length are produced and aligned to an available reference genome, namely the sequence of O. sativa var. japonica

Summary

Introduction

At intermediate stages of genome assembly projects, when a number of contigs have been generated and their validity needs to be verified, it is desirable to align these contigs to a reference genome when it is available. The interest is not to analyze a detailed alignment between a contig and the reference genome at the base level, but rather to have a rough estimate of where the contig aligns to the reference genome, by identifying the starting and ending positions of such a region This information is very useful in ordering the contigs, facilitating postassembly analysis such as gap closure and resolving repeats. There exist programs, such as BLAST and MUMmer, that can quickly align and identify high similarity segments between two sequences, which, when seen in a dot plot, tend to agglomerate along a diagonal but can be disrupted by gaps or shifted away from the main diagonal due to mismatches between the contig and the reference. The stage in the assembly phase is to identify the relative order

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Feb 18, 2008
Citations: 12	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Clustering exact matches of pairwise sequence alignments by weighted linear regression

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

New Approaches for Genome Assembly and Scaffolding.
Edward S Rice ... Richard E Green
Annual Review of Animal Biosciences | VOL. 7
Edward S Rice, et. al.Edward S Rice ... Richard E Green
28 Nov 2018
Annual Review of Animal Biosciences | VOL. 7

The Need for Speed and Energy Efficiency in Genome Analysis
Sachin Rawat
GEN Biotechnology | VOL. 2
Sachin RawatSachin Rawat
01 Jun 2023
GEN Biotechnology | VOL. 2

MPSAGA: a matrix-based pair-wise sequence alignment algorithm for global alignment with position based sequence representation
Jyoti Lakhani ... Dharmesh Harwani
Sādhanā | VOL. 44
Jyoti Lakhani, et. al.Jyoti Lakhani ... Dharmesh Harwani
29 Jun 2019
Sādhanā | VOL. 44

VitisGDB: The Multifunctional Database for Grapevine Breeding and Genetics
Xiao Dong ... Jun Sheng
Molecular Plant | VOL. 13
Xiao Dong, et. al.Xiao Dong ... Jun Sheng
15 May 2020
Molecular Plant | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Clustering exact matches of pairwise sequence alignments by weighted linear regression

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics