Analysis Of High-dimensional Genomic Data Research Articles

Background: The size of genomics data has been growing rapidly over the last decade. However, the conventional data analysis techniques are incapable of processing this huge amount of data. For the efficient processing of high dimensional datasets, it is essential to develop some new parallel methods.Methods: In this work, a novel distributed method is presented using Map-Reduce (MR)-based approach. The proposed algorithm consists of MR-based Fisher score (mrFScore), MR-based ReliefF (mrRefiefF), and MR-based probabilistic neural network (mrPNN) using a weighted chaotic grey wolf optimization technique (WCGWO). Here, mrFScore, and mrRefiefF methods are introduced for feature selection (FS), and mrPNN is implemented as an effective method for microarray classification. The proper choice of smoothing parameter (σ) plays a major role in the prediction ability of the PNN which is addressed using a novel technique namely, WCGWO. The WCGWO algorithm is used to select the optimal value of σ in PNN.Results: These algorithms have been successfully implemented using the Hadoop framework. The proposed model is tested by using three large and one small microarray datasets, and a comparative analysis is carried out with the existing FS and classification techniques. The results suggest that WCGWO-mrPNN can outperform other methods for high dimensional microarray classification.Conclusion: The effectiveness of the proposed methods are compared with other existing schemes. Experimental results reveal that the proposed scheme is accurate and robust. Hence, the suggested scheme is considered to be a reliable framework for microarray data analysis.Significance: Such a method promotes the application of parallel programming using Hadoop cluster for the analysis of large-scale genomics data, particularly when the dataset is of high dimension.

Read full abstract

Driver gene selection is crucial to understand the heterogeneous system of cancer. To identity cancer driver genes, various statistical strategies have been proposed, especially the L1-type regularization methods have drawn a large amount of attention. However, the statistical approaches have been developed purely from algorithmic and statistical point, and the existing studies have applied the statistical approaches to genomic data analysis without consideration of biological knowledge. We consider a statistical strategy incorporating biological knowledge to identify cancer driver gene. The alterations of copy number have been considered to driver cancer pathogenesis processes, and the region of strong interaction of copy number alterations and expression levels was known as a tumor-related symptom. We incorporate the influence of copy number alterations on expression levels to cancer driver gene-selection processes. To quantify the dependence of copy number alterations on expression levels, we consider [Formula: see text] and [Formula: see text] effects of copy number alterations on expression levels of genes, and incorporate the symptom of tumor pathogenesis to gene-selection procedures. We then proposed an interaction-based feature-selection strategy based on the adaptive L1-type regularization and random lasso procedures. The proposed method imposes a large amount of penalty on genes corresponding to a low dependency of the two features, thus the coefficients of the genes are estimated to be small or exactly 0. It implies that the proposed method can provide biologically relevant results in cancer driver gene selection. Monte Carlo simulations and analysis of the Cancer Genome Atlas (TCGA) data show that the proposed strategy is effective for high-dimensional genomic data analysis. Furthermore, the proposed method provides reliable and biologically relevant results for cancer driver gene selection in TCGA data analysis.

Read full abstract

Analysis Of High-dimensional Genomic Data Research Articles

Related Topics

Articles published on Analysis Of High-dimensional Genomic Data

Selection probability of multivariate regularization to identify pleiotropic variants in genetic association studies

Analysis of high-dimensional genomic data using MapReduce based probabilistic neural network

An empirical threshold of selection probability for analysis of high-dimensional correlated data

HDMAC: A Web-Based Interactive Program for High-Dimensional Analysis of Molecular Alterations in Cancer

Model-free feature screening for categorical outcomes: Nonlinear effect detection and false discovery rate control.

Analysis of high-dimensional genomic data employing a novel bio-inspired algorithm

New variable selection strategy for analysis of high-dimensional DNA methylation data.

Computation and application of tissue-specific gene set weights.

DaMiRseq-an R/Bioconductor package for data mining of RNA-Seq data: normalization, feature selection and classification.

그룹 구조를 갖는 고차원 유전체 자료 분석을 위한 네트워크 기반의 규제화 방법

Interaction-Based Feature Selection for Uncovering Cancer Driver Genes Through Copy Number-Driven Expression Level.

High-dimensional genomic data bias correction and data integration using MANCIE.

Multiple Group Testing Procedures for Analysis of High-Dimensional Genomic Data.

Recursive Random Lasso (RRLasso) for Identifying Anti-Cancer Drug Targets.

An Independent Filter for Gene Set Testing Based on Spectral Enrichment.

AB0019 A PLS multivariate model to predict RA radiological severity by selecting key predictors from a large panel of SNPS and environmental factors

High-Dimensional Regression and Variable Selection Using CAR Scores

Super-sparse principal component analyses for high-throughput genomic data

Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data

Partial least squares: a versatile tool for the analysis of high-dimensional genomic data

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Analysis Of High-dimensional Genomic Data Research Articles

Related Topics

Articles published on Analysis Of High-dimensional Genomic Data

Selection probability of multivariate regularization to identify pleiotropic variants in genetic association studies

Analysis of high-dimensional genomic data using MapReduce based probabilistic neural network

An empirical threshold of selection probability for analysis of high-dimensional correlated data

HDMAC: A Web-Based Interactive Program for High-Dimensional Analysis of Molecular Alterations in Cancer

Model-free feature screening for categorical outcomes: Nonlinear effect detection and false discovery rate control.

Analysis of high-dimensional genomic data employing a novel bio-inspired algorithm

New variable selection strategy for analysis of high-dimensional DNA methylation data.

Computation and application of tissue-specific gene set weights.

DaMiRseq-an R/Bioconductor package for data mining of RNA-Seq data: normalization, feature selection and classification.

그룹 구조를 갖는 고차원 유전체 자료 분석을 위한 네트워크 기반의 규제화 방법

Interaction-Based Feature Selection for Uncovering Cancer Driver Genes Through Copy Number-Driven Expression Level.

High-dimensional genomic data bias correction and data integration using MANCIE.

Multiple Group Testing Procedures for Analysis of High-Dimensional Genomic Data.

Recursive Random Lasso (RRLasso) for Identifying Anti-Cancer Drug Targets.

An Independent Filter for Gene Set Testing Based on Spectral Enrichment.

AB0019 A PLS multivariate model to predict RA radiological severity by selecting key predictors from a large panel of SNPS and environmental factors

High-Dimensional Regression and Variable Selection Using CAR Scores

Super-sparse principal component analyses for high-throughput genomic data

Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data

Partial least squares: a versatile tool for the analysis of high-dimensional genomic data