A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data.

Nikolaos Pechlivanis,Stefanos Sgardelis,Anastasios Togkousidis,Ilias Kappas,Maria Tsagiopoulou,Fotis Psomopoulos

doi:10.3389/fgene.2021.618170

Abstract

The exponential growth of genome sequences available has spurred research on pattern detection with the aim of extracting evolutionary signal. Traditional approaches, such as multiple sequence alignment, rely on positional homology in order to reconstruct the phylogenetic history of taxa. Yet, mining information from the plethora of biological data and delineating species on a genetic basis, still proves to be an extremely difficult problem to consider. Multiple algorithms and techniques have been developed in order to approach the problem multidimensionally. Here, we propose a computational framework for identifying potentially meaningful features based on k-mers retrieved from unaligned sequence data. Specifically, we have developed a process which makes use of unsupervised learning techniques in order to identify characteristic k-mers of the input dataset across a range of different k-values and within a reasonable time frame. We use these k-mers as features for clustering the input sequences and identifying differences between the distributions of k-mers across the dataset. The developed algorithm is part of an innovative and much promising approach both to the problem of grouping sequence data based on their inherent characteristic features, as well as for the study of changes in the distributions of k-mers, as the k-value is fluctuating within a range of values. Our framework is fully developed in Python language as an open source software licensed under the MIT License, and is freely available at https://github.com/BiodataAnalysisGroup/kmerAnalyzer.

Highlights

During the last decade DNA sequencing technology has been revolutionized, as the advent of Generation Sequencing (NGS) (Castro et al, 2020) led to the production of great amounts of biological data
A total of 12,474 sequences of SARS-CoV-2 were retrieved from NCBI2, along with a metadata JSON file containing chronological information, demographic information and information related to the collection source
In order to do that, we firstly investigated lower k-mer sizes and we started with a range of k ∈ [4, 5]

Summary

Introduction

During the last decade DNA sequencing technology has been revolutionized, as the advent of Generation Sequencing (NGS) (Castro et al, 2020) led to the production of great amounts of biological data. A Computational-Framework for Sequence-Feature Identification in several studies for the comparison and analysis of DNA sequences (Murray et al, 2017; Sievers et al, 2017) Alignmentbased methods, such as the well-known Basic Local Alignment Search Tool (BLAST), consider the exact position and quality of similarity of every part of the sequence within the dataset. Most of the time, prior knowledge of the underlying genome sequences is not a requirement To this end, Murray et al (2017) have proposed a new method for a k-mer-based sequence comparison to estimate genetic relatedness from sequence data. Given the overwhelming quantities of available sequence data, a question that arises is how to identify key features across sequences that they would serve as proxies for significant phenotypic differences, aiding in this way the inference of the underlying evolutionary relationships

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in genetics

Lead the way for us

Journal: Frontiers in genetics	Publication Date: May 28, 2021
License type: CC BY 4.0

Similar Papers

Optimization Alignment:Down,Up,Error,and Improvements
Ward C. Wheeler
-
Ward C. WheelerWard C. Wheeler
01 Jan 2002
01 Jan 2002

Multiple sequence alignment methods
...
-
, et. al. ...
23 Apr 1998
23 Apr 1998

CSA: An efficient algorithm to improve circular DNA multiple alignment
Francisco Fernandes ... Luísa Pereira
BMC Bioinformatics | VOL. 10
Francisco Fernandes, et. al.Francisco Fernandes ... Luísa Pereira
23 Jul 2009
BMC Bioinformatics | VOL. 10

ProtWave-VAE: Integrating Autoregressive Sampling with Latent-Based Inference for Data-Driven Protein Design.
Nikša Praljak ... Xinran Lian
ACS synthetic biology | VOL. 12
Nikša Praljak, et. al.Nikša Praljak ... Xinran Lian
21 Nov 2023
ACS synthetic biology | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in genetics