Abstract

Alignment-free (AF) methodologies have increased in popularity in the last decades as alternative tools to alignment-based (AB) algorithms for performing comparative sequence analyses. They have been especially useful to detect remote homologs within the twilight zone of highly diverse gene/protein families and superfamilies. The most popular alignment-free methodologies, as well as their applications to classification problems, have been described in previous reviews. Despite a new set of graph theory-derived sequence/structural descriptors that have been gaining relevance in the detection of remote homology, they have been omitted as AF predictors when the topic is addressed. Here, we first go over the most popular AF approaches used for detecting homology signals within the twilight zone and then bring out the state-of-the-art tools encoding graph theory-derived sequence/structure descriptors and their success for identifying remote homologs. We also highlight the tendency of integrating AF features/measures with the AB ones, either into the same prediction model or by assembling the predictions from different algorithms using voting/weighting strategies, for improving the detection of remote signals. Lastly, we briefly discuss the efforts made to scale up AB and AF features/measures for the comparison of multiple genomes and proteomes. Alongside the achieved experiences in remote homology detection by both the most popular AF tools and other less known ones, we provide our own using the graphical–numerical methodologies, MARCH-INSIDE, TI2BioP, and ProtDCal. We also present a new Python-based tool (SeqDivA) with a friendly graphical user interface (GUI) for delimiting the twilight zone by using several similar criteria.

Highlights

  • Sequence comparisons between recorded genes in databases and a new query sequence are the grounds of comparative and functional genomics

  • The experience achieved in the last century about encoding the structure of organic compounds by applying the Chemical Graph Theory aimed to develop quantitative structure activity relationship (QSAR)-type models is increasingly being transferred to analyze comparatively DNA, RNA, and proteins with no alignments

  • Numerous articles that report the development of new tools providing graph theory-based sequence descriptors are released each year, as well as their applications in genomics and protein science

Read more

Summary

Introduction

Sequence comparisons between recorded genes in databases and a new query sequence are the grounds of comparative and functional genomics. Despite several state-of-the-art reviews that have been published describing the most popular AF methods with their corresponding AF similarity measures and their successful applications in sequence comparison [16,17,18,19], a group of relatively new class of AF gene/protein features have been omitted. They are extensions of topological indices (TIs) initially defined in chemo-informatics to describe the molecular structure of organic compounds by applying graphical theoretical approaches. The availability of distributed computing and big data implementations to such methodologies is included along the workflow for the homology detection

The Twilight Zone for Protein and RNA Alignments
Word Frequency-Based Methods
Information Theory-Based Methods
Brief Background of Graphical–Numerical Approaches
Graphical–Numerical-Based Methods in the Twilight Zone
MARCH-INSIDE Sequence Descriptors
S2SNet’s TIs
ProtDCal’s Descriptors
Findings
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.