Unraveling Diabetes Mechanisms: A Computational scRNA-Seq Approach to Inflammation and Diabetes.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Diabetes, a widespread metabolic disorder, is characterized by chronic hyperglycemia and is often linked to systemic inflammation and insulin resistance. As the global incidence of diabetes continues to rise, understanding the molecular mechanisms behind the disease is crucial for developing effective treatments. In this study, single-cell RNA sequencing (scRNA-seq) data from a scRNA-seq dataset, GSE161872, were analyzed to explore the relationship between nutrition, inflammation, and diabetes. By employing advanced computational methods, this research aims to identify potential key gene expression profiles and molecular pathways associated with inflammation and its impact on metabolic regulation. The analysis can reveal how different dietary conditions influence immune responses and contribute to the development of diabetes. These findings have the potential to provide new insights into the molecular links between diet, inflammation, and metabolic disorders, offering a foundation for potential therapeutic interventions and personalized dietary strategies.

Similar Papers
  • Research Article
  • Cite Count Icon 2262
  • 10.1038/nbt.4091
Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors.
  • Apr 2, 2018
  • Nature Biotechnology
  • Laleh Haghverdi + 3 more

Large-scale single-cell RNA sequencing (scRNA-seq) data sets that are produced in different laboratories and at different times contain batch effects that may compromise the integration and interpretation of the data. Existing scRNA-seq analysis methods incorrectly assume that the composition of cell populations is either known or identical across batches. We present a strategy for batch correction based on the detection of mutual nearest neighbors (MNNs) in the high-dimensional expression space. Our approach does not rely on predefined or equal population compositions across batches; instead, it requires only that a subset of the population be shared between batches. We demonstrate the superiority of our approach compared with existing methods by using both simulated and real scRNA-seq data sets. Using multiple droplet-based scRNA-seq data sets, we demonstrate that our MNN batch-effect-correction method can be scaled to large numbers of cells.

  • Research Article
  • 10.1093/bib/bbaf603
CluVar: clustering of variants using autoencoder for inferring cancer subclones from single cell RNA sequencing data
  • Nov 1, 2025
  • Briefings in Bioinformatics
  • Chae Won Kim + 5 more

Tumor tissues are composed of malignant subclones with diverse genetic profiles. Reconstructing the evolutionary trajectory of these subclones is crucial for understanding how tumors acquire malignant traits. However, current approaches to subclonal tree reconstruction are limited either by their reliance on single-cell DNA sequencing (scDNA-seq) that involve a small number of cells and thus yield low-resolution results, or using single-cell RNA sequencing (scRNA-seq) data, which despite including larger cell populations, remain susceptible to bias from high dropout rates and technical noise. Here, we introduce CluVar, an autoencoder-based framework for inferring the phylogeny of cancer subclones from scRNA-seq data using mutation profile analysis. To address the extensive missing variant information inherent in scRNA-seq datasets, CluVar incorporates a customized loss function and multiple hidden layers optimized for clustering. CluVar demonstrated superior performance in reconstructing phylogenetic trees of cancer subclones under a range of erroneous conditions. When applied to cancer scRNA-seq data, the phylogenetic tree predicted using CluVar aligned well with the transcriptomic profiles. These findings highlight its utility for tracing evolutionary trajectories and identifying novel variants associated with cancer progression.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/cac53003.2021.9728447
Incomplete Multi-view Clustering for Single cell RNA Sequencing Data
  • Oct 22, 2021
  • Tijian Zhu + 2 more

With the rapid development of sequencing technology, researchers can obtain a large number of single cell RNA sequencing (scRNA-seq) data which is useful for analysis of cell fate decision and growth process at individual cell resolution. But due to the limitations of sequencing technology, the data acquired has dropouts which may affect the results of down-steam analysis. Therefore, many algorithms have been proposed to impute the data before clustering, here in, imputation and clustering are considered as two separate processing stage. In this paper, we adopt a clustering algorithm—Incomplete Multiple Kernel k-means Clustering with Mutual Kernel Completion (MKKM-IK-MKC) to analyze scRNA-seq data. It unifies imputation and clustering into a process. Comparing with some existing "two stage" (imputation +clustering) algorithms, the experimental results on five scRNA-seq datasets from various species demonstrate the effective performance of our new proposed method.

  • Research Article
  • Cite Count Icon 1
  • 10.5256/f1000research.20232.r45814
Evaluation of methods to assign cell type labels to cell clusters from single-cell RNA-sequencing data
  • Mar 20, 2019
  • F1000Research
  • Saskia Freytag

Background: Identification of cell type subpopulations from complex cell mixtures using single-cell RNA-sequencing (scRNA-seq) data includes automated steps from normalization to cell clustering. However, assigning cell type labels to cell clusters is often conducted manually, resulting in limited documentation, low reproducibility and uncontrolled vocabularies. This is partially due to the scarcity of reference cell type signatures and because some methods support limited cell type signatures. Methods: In this study, we benchmarked five methods representing first-generation enrichment analysis (ORA), second-generation approaches (GSEA and GSVA), machine learning tools (CIBERSORT) and network-based neighbor voting (METANEIGHBOR), for the task of assigning cell type labels to cell clusters from scRNA-seq data. We used five scRNA-seq datasets: human liver, 11 Tabula Muris mouse tissues, two human peripheral blood mononuclear cell datasets, and mouse retinal neurons, for which reference cell type signatures were available. The datasets span Drop-seq, 10X Chromium and Seq-Well technologies and range in size from ~3,700 to ~68,000 cells. Results: Our results show that, in general, all five methods perform well in the task as evaluated by receiver operating characteristic curve analysis (average area under the curve (AUC) = 0.91, sd = 0.06), whereas precision-recall analyses show a wide variation depending on the method and dataset (average AUC = 0.53, sd = 0.24). We observed an influence of the number of genes in cell type signatures on performance, with smaller signatures leading more frequently to incorrect results. Conclusions: GSVA was the overall top performer and was more robust in cell type signature subsampling simulations, although different methods performed well using different datasets. METANEIGHBOR and GSVA were the fastest methods. CIBERSORT and METANEIGHBOR were more influenced than the other methods by analyses including only expected cell types. We provide an extensible framework that can be used to evaluate other methods and datasets at https://github.com/jdime/scRNAseq_cell_cluster_labeling.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 49
  • 10.12688/f1000research.18490.1
Evaluation of methods to assign cell type labels to cell clusters from single-cell RNA-sequencing data.
  • Mar 15, 2019
  • F1000Research
  • J Javier Diaz-Mejia + 7 more

Background: Identification of cell type subpopulations from complex cell mixtures using single-cell RNA-sequencing (scRNA-seq) data includes automated computational steps like data normalization, dimensionality reduction and cell clustering. However, assigning cell type labels to cell clusters is still conducted manually by most researchers, resulting in limited documentation, low reproducibility and uncontrolled vocabularies. Two bottlenecks to automating this task are the scarcity of reference cell type gene expression signatures and the fact that some dedicated methods are available only as web servers with limited cell type gene expression signatures. Methods: In this study, we benchmarked four methods (CIBERSORT, GSEA, GSVA, and ORA) for the task of assigning cell type labels to cell clusters from scRNA-seq data. We used scRNA-seq datasets from liver, peripheral blood mononuclear cells and retinal neurons for which reference cell type gene expression signatures were available. Results: Our results show that, in general, all four methods show a high performance in the task as evaluated by receiver operating characteristic curve analysis (average area under the curve (AUC) = 0.94, sd = 0.036), whereas precision-recall curve analyses show a wide variation depending on the method and dataset (average AUC = 0.53, sd = 0.24). Conclusions: CIBERSORT and GSVA were the top two performers. Additionally, GSVA was the fastest of the four methods and was more robust in cell type gene expression signature subsampling simulations. We provide an extensible framework to evaluate other methods and datasets at https://github.com/jdime/scRNAseq_cell_cluster_labeling.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.12688/f1000research.18490.2
Evaluation of methods to assign cell type labels to cell clusters from single-cell RNA-sequencing data
  • Aug 27, 2019
  • F1000Research
  • J Javier Diaz-Mejia + 7 more

Background: Identification of cell type subpopulations from complex cell mixtures using single-cell RNA-sequencing (scRNA-seq) data includes automated steps from normalization to cell clustering. However, assigning cell type labels to cell clusters is often conducted manually, resulting in limited documentation, low reproducibility and uncontrolled vocabularies. This is partially due to the scarcity of reference cell type signatures and because some methods support limited cell type signatures. Methods: In this study, we benchmarked five methods representing first-generation enrichment analysis (ORA), second-generation approaches (GSEA and GSVA), machine learning tools (CIBERSORT) and network-based neighbor voting (METANEIGHBOR), for the task of assigning cell type labels to cell clusters from scRNA-seq data. We used five scRNA-seq datasets: human liver, 11 Tabula Muris mouse tissues, two human peripheral blood mononuclear cell datasets, and mouse retinal neurons, for which reference cell type signatures were available. The datasets span Drop-seq, 10X Chromium and Seq-Well technologies and range in size from ~3,700 to ~68,000 cells. Results: Our results show that, in general, all five methods perform well in the task as evaluated by receiver operating characteristic curve analysis (average area under the curve (AUC) = 0.91, sd = 0.06), whereas precision-recall analyses show a wide variation depending on the method and dataset (average AUC = 0.53, sd = 0.24). We observed an influence of the number of genes in cell type signatures on performance, with smaller signatures leading more frequently to incorrect results. Conclusions: GSVA was the overall top performer and was more robust in cell type signature subsampling simulations, although different methods performed well using different datasets. METANEIGHBOR and GSVA were the fastest methods. CIBERSORT and METANEIGHBOR were more influenced than the other methods by analyses including only expected cell types. We provide an extensible framework that can be used to evaluate other methods and datasets at https://github.com/jdime/scRNAseq_cell_cluster_labeling.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 45
  • 10.12688/f1000research.18490.3
Evaluation of methods to assign cell type labels to cell clusters from single-cell RNA-sequencing data.
  • Oct 14, 2019
  • F1000Research
  • J Javier Diaz-Mejia + 7 more

Background: Identification of cell type subpopulations from complex cell mixtures using single-cell RNA-sequencing (scRNA-seq) data includes automated steps from normalization to cell clustering. However, assigning cell type labels to cell clusters is often conducted manually, resulting in limited documentation, low reproducibility and uncontrolled vocabularies. This is partially due to the scarcity of reference cell type signatures and because some methods support limited cell type signatures. Methods: In this study, we benchmarked five methods representing first-generation enrichment analysis (ORA), second-generation approaches (GSEA and GSVA), machine learning tools (CIBERSORT) and network-based neighbor voting (METANEIGHBOR), for the task of assigning cell type labels to cell clusters from scRNA-seq data. We used five scRNA-seq datasets: human liver, 11 Tabula Muris mouse tissues, two human peripheral blood mononuclear cell datasets, and mouse retinal neurons, for which reference cell type signatures were available. The datasets span Drop-seq, 10X Chromium and Seq-Well technologies and range in size from ~3,700 to ~68,000 cells. Results: Our results show that, in general, all five methods perform well in the task as evaluated by receiver operating characteristic curve analysis (average area under the curve (AUC) = 0.91, sd = 0.06), whereas precision-recall analyses show a wide variation depending on the method and dataset (average AUC = 0.53, sd = 0.24). We observed an influence of the number of genes in cell type signatures on performance, with smaller signatures leading more frequently to incorrect results. Conclusions: GSVA was the overall top performer and was more robust in cell type signature subsampling simulations, although different methods performed well using different datasets. METANEIGHBOR and GSVA were the fastest methods. CIBERSORT and METANEIGHBOR were more influenced than the other methods by analyses including only expected cell types. We provide an extensible framework that can be used to evaluate other methods and datasets at https://github.com/jdime/scRNAseq_cell_cluster_labeling.

  • Conference Article
  • 10.1109/csci54926.2021.00129
Missing Value Recovery for Single Cell RNA Sequencing Data
  • Dec 1, 2021
  • Wenjuan Zhang + 4 more

The emergence of single-cell sequencing technologies has enabled the production of high-resolution data at the individual cell level, providing unprecedented opportunities to capture cell population diversity and dissect the cellular heterogeneity of complex diseases. At the same time, relatively high biological and technical noise poses new challenges for single-cell data analysis. The single-cell RNA sequencing (scRNA-seq) data often contains substantial missing values due to gene dropout events. Here, we developed a convolutional neural network based model to recover missing values for scRNA-seq data. We first calculated the probability of dropout employing gamma-normal expectation maximum algorithm. Unlike most existing approaches, our model only recovered the expression values that have a dropout probability larger than a threshold. The mean square error and Pearson correlation coefficient were used to assess the accuracy of predicted expression values. The purity and entropy were computed to measure the homogeneity of cell clusters using imputed gene expression profiles. Across various scRNAseq datasets, our model demonstrated robust performance and achieved comparable or better results compared to the other imputation methods.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.3389/fgene.2022.788832
A Regularized Multi-Task Learning Approach for Cell Type Detection in Single-Cell RNA Sequencing Data
  • Apr 13, 2022
  • Frontiers in Genetics
  • Piu Upadhyay + 1 more

Cell type prediction is one of the most challenging goals in single-cell RNA sequencing (scRNA-seq) data. Existing methods use unsupervised learning to identify signature genes in each cluster, followed by a literature survey to look up those genes for assigning cell types. However, finding potential marker genes in each cluster is cumbersome, which impedes the systematic analysis of single-cell RNA sequencing data. To address this challenge, we proposed a framework based on regularized multi-task learning (RMTL) that enables us to simultaneously learn the subpopulation associated with a particular cell type. Learning the structure of subpopulations is treated as a separate task in the multi-task learner. Regularization is used to modulate the multi-task model (e.g., W1, W2, … Wt) jointly, according to the specific prior. For validating our model, we trained it with reference data constructed from a single-cell RNA sequencing experiment and applied it to a query dataset. We also predicted completely independent data (the query dataset) from the reference data which are used for training. We have checked the efficacy of the proposed method by comparing it with other state-of-the-art techniques well known for cell type detection. Results revealed that the proposed method performed accurately in detecting the cell type in scRNA-seq data and thus can be utilized as a useful tool in the scRNA-seq pipeline.

  • Research Article
  • Cite Count Icon 3451
  • 10.1016/j.cels.2019.03.003
DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors.
  • Apr 1, 2019
  • Cell Systems
  • Christopher S Mcginnis + 2 more

DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors.

  • Research Article
  • 10.1089/cmb.2023.0077
ARGLRR: A Sparse Low-Rank Representation Single-Cell RNA-Sequencing Data Clustering Method Combined with a New Graph Regularization.
  • Aug 1, 2023
  • Journal of Computational Biology
  • Zhen-Chang Wang + 5 more

The development of single-cell transcriptome sequencing technologies has opened new ways to study biological phenomena at the cellular level. A key application of such technologies involves the employment of single-cell RNA sequencing (scRNA-seq) data to identify distinct cell types through clustering, which in turn provides evidence for revealing heterogeneity. Despite the promise of this approach, the inherent characteristics of scRNA-seq data, such as higher noise levels and lower coverage, pose major challenges to existing clustering methods and compromise their accuracy. In this study, we propose a method called Adjusted Random walk Graph regularization Sparse Low-Rank Representation (ARGLRR), a practical sparse subspace clustering method, to identify cell types. The fundamental low-rank representation (LRR) model is concerned with the global structure of data. To address the limited ability of the LRR method to capture local structure, we introduced adjusted random walk graph regularization in its framework. ARGLRR allows for the capture of both local and global structures in scRNA-seq data. Additionally, the imposition of similarity constraints into the LRR framework further improves the ability of the proposed model to estimate cell-to-cell similarity and capture global structural relationships between cells. ARGLRR surpasses other advanced comparison approaches on nine known scRNA-seq data sets judging by the results. In the normalized mutual information and Adjusted Rand Index metrics on the scRNA-seq data sets clustering experiments, ARGLRR outperforms the best-performing comparative method by 6.99% and 5.85%, respectively. In addition, we visualize the result using Uniform Manifold Approximation and Projection. Visualization results show that the usage of ARGLRR enhances the separation of different cell types within the similarity matrix.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 105
  • 10.1186/s13059-019-1863-4
Systematic comparative analysis of single-nucleotide variant detection methods from single-cell RNA sequencing data
  • Nov 19, 2019
  • Genome Biology
  • Fenglin Liu + 6 more

BackgroundSystematic interrogation of single-nucleotide variants (SNVs) is one of the most promising approaches to delineate the cellular heterogeneity and phylogenetic relationships at the single-cell level. While SNV detection from abundant single-cell RNA sequencing (scRNA-seq) data is applicable and cost-effective in identifying expressed variants, inferring sub-clones, and deciphering genotype-phenotype linkages, there is a lack of computational methods specifically developed for SNV calling in scRNA-seq. Although variant callers for bulk RNA-seq have been sporadically used in scRNA-seq, the performances of different tools have not been assessed.ResultsHere, we perform a systematic comparison of seven tools including SAMtools, the GATK pipeline, CTAT, FreeBayes, MuTect2, Strelka2, and VarScan2, using both simulation and scRNA-seq datasets, and identify multiple elements influencing their performance. While the specificities are generally high, with sensitivities exceeding 90% for most tools when calling homozygous SNVs in high-confident coding regions with sufficient read depths, such sensitivities dramatically decrease when calling SNVs with low read depths, low variant allele frequencies, or in specific genomic contexts. SAMtools shows the highest sensitivity in most cases especially with low supporting reads, despite the relatively low specificity in introns or high-identity regions. Strelka2 shows consistently good performance when sufficient supporting reads are provided, while FreeBayes shows good performance in the cases of high variant allele frequencies.ConclusionsWe recommend SAMtools, Strelka2, FreeBayes, or CTAT, depending on the specific conditions of usage. Our study provides the first benchmarking to evaluate the performances of different SNV detection tools for scRNA-seq data.

  • Research Article
  • Cite Count Icon 5
  • 10.3390/biology11101495
A Novel Algorithm for Feature Selection Using Penalized Regression with Applications to Single-Cell RNA Sequencing Data †
  • Oct 12, 2022
  • Biology
  • Bhavithry Sen Puliparambil + 2 more

Simple SummarySingle Cell RNA Sequencing generates gene expression data at a single cell resolution. While single cell RNA has many applications in biomedical research, the high dimensionality of the data produced poses a considerable computational challenge. This study proposes a novel algorithm using penalized regression methods to analyze single cell RNA sequencing data. The proposed algorithm reduces high dimensionality of the gene expression data using a sequence feature selection methods such as Ridge regression, LASSO, Elastic Net, Drop LASSO, and Sparse Group LASSO. The proposed algorithm successfully detected highly differentiated genes, including the marker genes, for 5 different single cell RNA sequencing datasets associated with the species mouse, plant, and human.With the emergence of single-cell RNA sequencing (scRNA-seq) technology, scientists are able to examine gene expression at single-cell resolution. Analysis of scRNA-seq data has its own challenges, which stem from its high dimensionality. The method of machine learning comes with the potential of gene (feature) selection from the high-dimensional scRNA-seq data. Even though there exist multiple machine learning methods that appear to be suitable for feature selection, such as penalized regression, there is no rigorous comparison of their performances across data sets, where each poses its own challenges. Therefore, in this paper, we analyzed and compared multiple penalized regression methods for scRNA-seq data. Given the scRNA-seq data sets we analyzed, the results show that sparse group lasso (SGL) outperforms the other six methods (ridge, lasso, elastic net, drop lasso, group lasso, and big lasso) using the metrics area under the receiver operating curve (AUC) and computation time. Building on these findings, we proposed a new algorithm for feature selection using penalized regression methods. The proposed algorithm works by selecting a small subset of genes and applying SGL to select the differentially expressed genes in scRNA-seq data. By using hierarchical clustering to group genes, the proposed method bypasses the need for domain-specific knowledge for gene grouping information. In addition, the proposed algorithm provided consistently better AUC for the data sets used.

  • Research Article
  • 10.1158/1557-3265.sabcs24-ps13-01
Abstract PS13-01: Evaluation of breast cancer stem cell gene expression signatures in single-cell RNA sequencing (scRNAseq) data from the OPPORTUNE and FELINE trials, and the association with treatment resistance
  • Jun 13, 2025
  • Clinical Cancer Research
  • Peter Hall + 6 more

Background: Cancer stem cells (CSCs) play a key role in tumor initiation, progression, and resistance to conventional therapies. These cells possess the ability to self-renew and differentiate, contributing to tumor heterogeneity and recurrence. Preclinical data indicate that targeting cancer stem cell-specific pathways could lead to more effective treatments and prevent relapse, thereby improving outcomes. However, there is limited clinical data to support this, in part due to the challenges of measuring CSCs in the clinical setting. Single-cell RNA sequencing (scRNAseq) data in combination with stemness gene expression signatures were investigated as a novel approach to detect and assess changes in CSC numbers in response to treatment. Methods: Single-cell RNAseq datasets from two clinical trials were interrogated – OPPORTUNE, a window-of-opportunity trial evaluating anastrozole vs anastrozole plus pictilisib (PI3K inhibitor) in 75 patients with ER+ HER2- early breast cancer, and FELINE, a neoadjuvant trial which compared letrozole plus placebo with letrozole plus ribociclib (CDK4/6 inhibitor) in 120 patients with ER+ HER2- early breast cancer. The primary endpoint for OPPORTUNE was the inhibition of tumor cell proliferation as measured by Ki67. The primary endpoint for FELINE was the rate of preoperative endocrine prognostic index (PEPI) score 0 after neoadjuvant endocrine therapy. Thirty-four patients from the FELINE study had available scRNAseq data for analysis; this number was 62 patients for the OPPORTUNE study. Stemness gene expression signatures with published evidence of selectivity for breast CSCs were identified from the literature and their utility in detecting CSCs compared in the datasets. Subsequently, changes in CSC fraction with treatment was assessed. Results: Eight different stemness gene signatures were identified from the published literature. Among these, four (Kim_myc, benporath_es2, Bhattacharya_hESC, and Shats_consensus) exhibited selective expression in a minority of tumor cells, with the Shats consensus signature demonstrating the highest specificity to tumor cells over non-malignant cells. Utilizing a Gaussian mixture model, we estimated that approximately 4.5% of cells within a tumor are likely stem cells. In the OPPORTUNE study, higher stemness scores were observed in luminal B compared to luminal A tumors, but there was no association with PIK3CA mutation status. Changes to Ki67 were inversely associated with the CSC fraction – tumors with higher stemness scores were less likely to achieve a complete cell cycle arrest versus endocrine-sensitive tumors. Similarly, in the FELINE trial, non-responders in the letrozole arm showed a trend towards elevated stemness scores. Conclusion: The use of stemness gene expression signatures in scRNAseq data is a feasible method to assess changes in putative CSC fraction with treatment in breast cancer. Increased stemness scores were associated with more aggressive subtypes and resistance to treatment. Citation Format: Peter Hall,Alejandro Chibly, Peter Schmid, Sarah E Pinder, Arnie Purushotham, Alastair Thompson, Steven Gendreau. Evaluation of breast cancer stem cell gene expression signatures in single-cell RNA sequencing (scRNAseq) data from the OPPORTUNE and FELINE trials, and the association with treatment resistance [abstract]. In: Proceedings of the San Antonio Breast Cancer Symposium 2024; 2024 Dec 10-13; San Antonio, TX. Philadelphia (PA): AACR; Clin Cancer Res 2025;31(12 Suppl):Abstract nr PS13-01.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 8
  • 10.3390/math11204315
Identifying Genetic Signatures from Single-Cell RNA Sequencing Data by Matrix Imputation and Reduced Set Gene Clustering
  • Oct 17, 2023
  • Mathematics
  • Soumita Seth + 7 more

In this current era, the identification of both known and novel cell types, the representation of cells, predicting cell fates, classifying various tumor types, and studying heterogeneity in various cells are the key areas of interest in the analysis of single-cell RNA sequencing (scRNA-seq) data. Due to the nature of the data, cluster identification in single-cell sequencing data with high dimensions presents several difficulties. In this paper, we introduce a new framework that combines various strategies such as imputed matrix, minimum redundancy maximum relevance (MRMR) feature selection, and shrinkage clustering to discover gene signatures from scRNA-seq data. Firstly, we conducted the pre-filtering of the “drop-out” value in the data focusing solely on imputing the identified “drop-out” values. Next, we applied the MRMR feature selection method to the imputed data and obtained the top 100 features based on the MRMR feature selection optimization scores for further downstream analysis. Thereafter, we employed shrinkage clustering on the selected feature matrix to identify the cell clusters using a global optimization approach. Finally, we applied the Limma-Voom R tool employing voom normalization and an empirical Bayes test to detect differentially expressed features with a false discovery rate (FDR) < 0.001. In addition, we performed the KEGG pathway and gene ontology enrichment analysis of the identified biomarkers using David 6.8 software. Furthermore, we conducted miRNA target detection for the top gene markers and performed miRNA target gene interaction network analysis using the Cytoscape online tool. Subsequently, we compared our detected 100 markers with our previously detected top 100 cluster-specified markers ranked by FDR of the latest published article and discovered three common markers; namely, Cyp2b10, Mt1, Alpi, along with 97 novel markers. In addition, the Gene Set Enrichment Analysis (GSEA) of both marker sets also yields similar outcomes. Apart from this, we performed another comparative study with another published method, demonstrating that our model detects more significant markers than that model. To assess the efficiency of our framework, we apply it to another dataset and identify 20 strongly significant up-regulated markers. Additionally, we perform a comparative study of different imputation methods and include an ablation study to prove that every key phase of our framework is essential and strongly recommended. In summary, our proposed integrated framework efficiently discovers differentially expressed stronger gene signatures as well as up-regulated markers in single-cell RNA sequencing data.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.