Range Of Real Datasets Research Articles

BackgroundDe novo transcriptome assembly is an important technique for understanding gene expression in non-model organisms. Many de novo assemblers using the de Bruijn graph of a set of the RNA sequences rely on in-memory representation of this graph. However, current methods analyse the complete set of read-derived k-mer sequence at once, resulting in the need for computer hardware with large shared memory.ResultsWe introduce a novel approach that clusters k-mers as the first step. The clusters correspond to small sets of gene products, which can be processed quickly to give candidate transcripts. We implement the clustering step using the MapReduce approach for parallelising the analysis of large datasets, which enables the use of compute clusters. The computational task is distributed across the compute system using the industry-standard MPI protocol, and no specialised hardware is required. Using this approach, we have re-implemented the Inchworm module from the widely used Trinity pipeline, and tested the method in the context of the full Trinity pipeline. Validation tests on a range of real datasets show large reductions in the runtime and per-node memory requirements, when making use of a compute cluster.ConclusionsOur study shows that MapReduce-based clustering has great potential for distributing challenging sequencing problems, without loss of accuracy. Although we have focussed on the Trinity package, we propose that such clustering is a useful initial step for other assembly pipelines.

Read full abstract

Genomics studies frequently involve clustering of molecular data to identify groups, but common clustering methods such as K-means clustering and hierarchical clustering do not determine the number of clusters. Methods for estimating the number of clusters typically focus on identifying the global structure in the data, however the discovery of substructures within clusters may also be of great biological interest. We propose a novel method, Partitioning Algorithm based on Recursive Thresholding (PART), that recursively uncovers distinct subgroups in the groups already identified. Outliers are common in high-dimensional genomics data and may mask the presence of substructure within a cluster. A crucial feature of the algorithm is the introduction of tentative splits of clusters to isolate outliers that might otherwise halt the recursion prematurely. The method is demonstrated on simulated as well as a wide range of real data sets from gene expression microarrays, where the correct clusters were known in advance. When subclusters are present and the variance is large or varies between the clusters, the proposed method performs better than two established global methods on simulated data. On the real data sets the overall performance of PART is superior to the global methods when used in combination with hierarchical clustering. The method is implemented in the R package clusterGenomics and is freely available from CRAN (The Comprehensive R Archive Network).

Read full abstract

Range Of Real Datasets Research Articles

Articles published on Range Of Real Datasets

High-dimensional asymptotics of denoising autoencoders*

Gaussian universality of perceptrons with random labels.

K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity

Sparse graphs using exchangeable random measures.

PFIN: A Parallel Frequent Itemset Mining Algorithm Using Nodesets

Supervised box clustering

Intelligent Dynamic Aging Approaches in Web Proxy Cache Replacement

Pruning boxes in a box-based classification method

Identifying clusters in genomics data by recursive partitioning

Stereo Matching with Mumford-Shah Regularization and Occlusion Handling

A critique of the use of jackknife and related non-parametric techniques to estimate species richness

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Range Of Real Datasets Research Articles

Articles published on Range Of Real Datasets

High-dimensional asymptotics of denoising autoencoders*

Gaussian universality of perceptrons with random labels.

K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity

Sparse graphs using exchangeable random measures.

PFIN: A Parallel Frequent Itemset Mining Algorithm Using Nodesets

Supervised box clustering

Intelligent Dynamic Aging Approaches in Web Proxy Cache Replacement

Pruning boxes in a box-based classification method

Identifying clusters in genomics data by recursive partitioning

Stereo Matching with Mumford-Shah Regularization and Occlusion Handling

A critique of the use of jackknife and related non-parametric techniques to estimate species richness