Markov Clustering Research Articles

Public DNA databases are composed of data from many different taxa, although the taxonomic annotation on sequences is not always complete, which impedes the utilization of mined data for species-level applications. There is much ongoing work on species identification and delineation based on the molecular data itself, although applying species clustering to whole databases requires consolidation of results from numerous undefined gene regions, and introduces significant obstacles in data organization and computational load. In the current paper, we demonstrate an approach for species delineation of a sequence database. All DNA sequences for the insects were obtained and processed. After filtration of duplicated data, delineation of the database into species or molecular operational taxonomic units (MOTUs) followed a three-step process in which (i) the genetic loci L are partitioned, (ii) the species S are delineated within each locus, then (iii) species units are matched across loci to form the matrix L × S, a set of global (multilocus) species units. Partitioning the database into a set of homologous gene fragments was achieved by Markov clustering using edge weights calculated from the amount of overlap between pairs of sequences, then delineation of species units and assignment of species names were performed for the set of genes necessary to capture most of the species diversity. The complexity of computing pairwise similarities for species clustering was substantial at the cytochrome oxidase subunit I locus in particular, but made feasible through the development of software that performs pairwise alignments within the taxonomic framework, while accounting for the different ranks at which sequences are labeled with taxonomic information. Over 24 different homologs, the unidentified sequences numbered approximately 194,000, containing 41,525 species IDs (98.7% of all found in the insect database), and were grouped into 59,173 single-locus MOTUs by hierarchical clustering under parameters optimized independently for each locus. Species units from different loci were matched using a multipartite matching algorithm to form multilocus species units with minimal incongruence between loci. After matching, the insect database as represented by these 24 loci was found to be composed of 78,091 species units in total. 38,574 of these units contained only species labeled data, 34,891 contained only unlabeled data, leaving 4,626 units composed both of labeled and unlabeled sequences. In addition to giving estimates of species diversity of sequence repositories, the protocol developed here will facilitate species-level applications of modern-day sequence data sets. In particular, the L × S matrix represents a post-taxonomic framework that can be used for species-level organization of metagenomic data, and incorporation of these methods into phylogenetic pipelines will yield matrices more representative of species diversity.

Read full abstract

BackgroundThe metabolic strategies employed by microbes inhabiting natural systems are, in large part, dictated by the physical and geochemical properties of the environment. This study sheds light onto the complex relationship between biology and environmental geochemistry using forty-three metagenomes collected from geochemically diverse and globally distributed natural systems. It is widely hypothesized that many uncommonly measured geochemical parameters affect community dynamics and this study leverages the development and application of multidimensional biogeochemical metrics to study correlations between geochemistry and microbial ecology. Analysis techniques such as a Markov cluster-based measure of the evolutionary distance between whole communities and a principal component analysis (PCA) of the geochemical gradients between environments allows for the determination of correlations between microbial community dynamics and environmental geochemistry and provides insight into which geochemical parameters most strongly influence microbial biodiversity.ResultsBy progressively building from samples taken along well defined geochemical gradients to samples widely dispersed in geochemical space this study reveals strong links between the extent of taxonomic and functional diversification of resident communities and environmental geochemistry and reveals temperature and pH as the primary factors that have shaped the evolution of these communities. Moreover, the inclusion of extensive geochemical data into analyses reveals new links between geochemical parameters (e.g. oxygen and trace element availability) and the distribution and taxonomic diversification of communities at the functional level. Further, an overall geochemical gradient (from multivariate analyses) between natural systems provides one of the most complete predictions of microbial taxonomic and functional composition.ConclusionsClustering based on the frequency in which orthologous proteins occur among metagenomes facilitated accurate prediction of the ordering of community functional composition along geochemical gradients, despite a lack of geochemical input. The consistency in the results obtained from the application of Markov clustering and multivariate methods to distinct natural systems underscore their utility in predicting the functional potential of microbial communities within a natural system based on system geochemistry alone, allowing geochemical measurements to be used to predict purely biological metrics such as microbial community composition and metabolism.

Read full abstract

Markov Clustering Research Articles

Related Topics

Articles published on Markov Clustering

Clumpak: a program for identifying clustering modes and packaging population structure inferences across K.

Predicting protein complexes from weighted protein-protein interaction graphs with a novel unsupervised methodology: Evolutionary enhanced Markov clustering.

Identification of Protein Complexes Using Weighted PageRank-Nibble Algorithm and Core-Attachment Structure.

Building Ontology from Texts

The Mechanism Research of Qishen Yiqi Formula by Module-Network Analysis.

H-CLAP: hierarchical clustering within a linear array with an application in genetics.

An efficient protein complex mining algorithm based on Multistage Kernel Extension.

Gene families as soft cliques with backbones: Amborella contrasted with other flowering plants.

Drug repurposing based on drug-drug interaction.

A protocol for species delineation of public DNA databases, applied to the Insecta.

Merging metagenomics and geochemistry reveals environmental controls on biological diversity and evolution.

Performance Improvement through Parallelization of Graph Clustering algorithm

Integrating Vague Association Mining with Markov Model

Improving the Robustness of Local Network Alignment: Design and Extensive Assessment of a Markov Clustering-Based Approach.

Aggregation of Similarity Measures for Ortholog Detection: Validation with Measures Based on Rough Set Theory

The genome sequence and effector complement of the flax rust pathogen Melampsora lini.

A fast hierarchical clustering algorithm for large-scale protein sequence data sets

HTTP Traffic Graph Clustering using Markov Clustering Algorithm

Integrating Vague Association Mining with Markov Model

Operon prediction by Markov clustering.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Markov Clustering Research Articles

Related Topics

Articles published on Markov Clustering

Clumpak: a program for identifying clustering modes and packaging population structure inferences across K.

Predicting protein complexes from weighted protein-protein interaction graphs with a novel unsupervised methodology: Evolutionary enhanced Markov clustering.

Identification of Protein Complexes Using Weighted PageRank-Nibble Algorithm and Core-Attachment Structure.

Building Ontology from Texts

The Mechanism Research of Qishen Yiqi Formula by Module-Network Analysis.

H-CLAP: hierarchical clustering within a linear array with an application in genetics.

An efficient protein complex mining algorithm based on Multistage Kernel Extension.

Gene families as soft cliques with backbones: Amborella contrasted with other flowering plants.

Drug repurposing based on drug-drug interaction.

A protocol for species delineation of public DNA databases, applied to the Insecta.

Merging metagenomics and geochemistry reveals environmental controls on biological diversity and evolution.

Performance Improvement through Parallelization of Graph Clustering algorithm

Integrating Vague Association Mining with Markov Model

Improving the Robustness of Local Network Alignment: Design and Extensive Assessment of a Markov Clustering-Based Approach.

Aggregation of Similarity Measures for Ortholog Detection: Validation with Measures Based on Rough Set Theory

The genome sequence and effector complement of the flax rust pathogen Melampsora lini.

A fast hierarchical clustering algorithm for large-scale protein sequence data sets

HTTP Traffic Graph Clustering using Markov Clustering Algorithm

Integrating Vague Association Mining with Markov Model

Operon prediction by Markov clustering.