Abstract

BackgroundClustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. Tools in this context usually generates data with greed algorithms that solves some Data Mining difficulties which can degrade biological relevant information during the clustering process. The lack of standardization of metrics and consistent bases also raises questions about the clustering efficiency of some methods. Benchmarks are needed to explore the full potential of clustering methods - in which alignment-free methods stand out - and the good choice of dataset makes it essentials.ResultsHere we present a new approach to Data Mining in large protein sequences datasets, the Rapid Alignment Free Tool for Sequences Similarity Search to Groups (RAFTS3G), a method to clustering aiming of losing less biological information in the processes of generation groups. The strategy developed in our algorithm is optimized to be more astringent which reflects increase in accuracy and sensitivity in the generation of clusters in a wide range of similarity. RAFTS3G is the better choice compared to three main methods when the user wants more reliable result even ignoring the ideal threshold to clustering.ConclusionIn general, RAFTS3G is able to group up to millions of biological sequences into large datasets, which is a remarkable option of efficiency in clustering. RAFTS3G compared to other “standard-gold” methods in the clustering of large biological data maintains the balance between the reduction of biological information redundancy and the creation of consistent groups. We bring the binary search concept applied to grouped sequences which shows maintaining sensitivity/accuracy relation and up to minimize the time of data generated with RAFTS3G process.

Highlights

  • Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets

  • It is worth pointing out that both the CD-HIT and UCLUST tools require a manual preprocessing step in which the data to be rotated by the algorithms must be organized in order of sequence size, because both algorithms select the largest to minor sequences to choose the representative sequence to the group and align the others from them, not being a random process

  • The metric provided by BCOM [27] is effective to sort a set of sequences according to their similarity, the similarity measure based on identities, enabled when alignment is performed, is desirable when the intention is to hold clusters and it is often selected as cut-off criterion [28]

Read more

Summary

Introduction

Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. It is worth pointing out that both the CD-HIT and UCLUST tools require a manual preprocessing step in which the data to be rotated by the algorithms must be organized in order of sequence size, because both algorithms select the largest to minor sequences to choose the representative sequence to the group and align the others from them, not being a random process Both CD-HIT and UCLUST are not reliable choices for clustering in large datasets with values less than 30% of similarity so trivial to search sequences with homologies in remotely structures [14]. The most efficient techniques for this prediction use as gold standard the Basic Local Alignment Search Tool (BLAST) ‘all-against-all’ or, in another cases, Markov Clustering (MCL) method adaptations [15] These tools are dependents on alignment metrics requiring a lot of processing and time to generate results mainly in large datasets [16,17,18]

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call