Rand Index Research Articles

Visual summarization of clinical data collected on patients contained within the electronic health record (EHR) may enable precise and rapid triage at the time of patient presentation to an emergency department (ED). The triage process is critical in the appropriate allocation of resources and in anticipating eventual patient disposition, typically admission to the hospital or discharge home. EHR data are high-dimensional and complex, but offer the opportunity to discover and characterize underlying data-driven patient phenotypes. Data-driven phenotypes are intended to relieve reliance on weak labels like diagnosis codes and to aid in identifying populations of existing patients that are most similar to a specific patient. These phenotypes will enable improved, personalized therapeutic decision making and prognostication. In this work, we focus on the challenge of two-dimensional patient projections. A low dimensional embedding offers visual interpretability lost in higher dimensions. While linear dimensionality reduction techniques such as principal component analysis are often used towards this aim, they are insufficient to describe the variance of patient data. This linear reduction does not account for higher order, non-linear interactions of variables. In this work, we employ the newly-described non-linear embedding technique called uniform manifold approximation and projection (UMAP). UMAP seeks to capture both local and global structures in high-dimensional data. We then use Gaussian mixture models to identify clusters in the embedded data and use the adjusted Rand index (ARI) to establish stability in the discovery of these clusters. This technique is applied to five common clinical chief complaints from a real-world ED EHR dataset, describing the emergent properties of discovered clusters. We observe clinically-relevant cluster attributes, suggesting that visual embeddings of EHR data using non-linear dimensionality reduction is a promising approach to reveal data-driven patient phenotypes. In the five chief complaints, we find between 2 and 6 clusters, with the peak mean pairwise ARI between subsequent training iterations to range from 0.35 to 0.74.

Read full abstract

MotivationSeveral recently developed single-cell DNA sequencing technologies enable whole-genome sequencing of thousands of cells. However, the ultra-low coverage of the sequenced data (<0.05× per cell) mostly limits their usage to the identification of copy number alterations in multi-megabase segments. Many tumors are not copy number-driven, and thus single-nucleotide variant (SNV)-based subclone detection may contribute to a more comprehensive view on intra-tumor heterogeneity. Due to the low coverage of the data, the identification of SNVs is only possible when superimposing the sequenced genomes of hundreds of genetically similar cells. Thus, we have developed a new approach to efficiently cluster tumor cells based on a Bayesian filtering approach of relevant loci and exploiting read overlap and phasing.ResultsWe developed Single Cell Data Tumor Clusterer (SECEDO, lat. ‘to separate’), a new method to cluster tumor cells based solely on SNVs, inferred on ultra-low coverage single-cell DNA sequencing data. We applied SECEDO to a synthetic dataset simulating 7250 cells and eight tumor subclones from a single patient and were able to accurately reconstruct the clonal composition, detecting 92.11% of the somatic SNVs, with the smallest clusters representing only 6.9% of the total population. When applied to five real single-cell sequencing datasets from a breast cancer patient, each consisting of 2000 cells, SECEDO was able to recover the major clonal composition in each dataset at the original coverage of 0.03×, achieving an Adjusted Rand Index (ARI) score of 0.6. The current state-of-the-art SNV-based clustering method achieved an ARI score of 0, even after merging cells to create higher coverage data (factor 10 increase), and was only able to match SECEDOs performance when pooling data from all five datasets, in addition to artificially increasing the sequencing coverage by a factor of 7. Variant calling on the resulting clusters recovered more than twice as many SNVs as would have been detected if calling on all cells together. Further, the allelic ratio of the called SNVs on each subcluster was more than double relative to the allelic ratio of the SNVs called without clustering, thus demonstrating that calling variants on subclones, in addition to both increasing sensitivity of SNV detection and attaching SNVs to subclones, significantly increases the confidence of the called variants.Availability and implementationSECEDO is implemented in C++ and is publicly available at https://github.com/ratschlab/secedo. Instructions to download the data and the evaluation code to reproduce the findings in this paper are available at: https://github.com/ratschlab/secedo-evaluation. The code and data of the submitted version are archived at: https://doi.org/10.5281/zenodo.6516955.Supplementary information Supplementary data are available at Bioinformatics online.

Read full abstract

Rand Index Research Articles

Articles published on Rand Index

MCluster-VAEs: An end-to-end variational deep learning-based clustering method for subtype discovery using multi-omics data

Mining regular behaviors based on spatiotemporal trajectory multi‐dimensional features

Distributed denial of service attack detection in E-government cloud via data clustering

Visualization of emergency department clinical data for interpretable patient phenotyping

The Initialization of Flexible K-Medoids Partitioning Method Using a Combination of Deviation and Sum of Variable Values

CLUSTERING SURGEMES USING PROTOTYPES FROM ROBOTIC KINEMATIC INFORMATION

Evaluation and Comparison of Spatial Clustering for Solar Irradiance Time Series

Unsupervised segmentation of hyperspectral remote sensing images with superpixels

Evaluation of effect based on different typing methods in Escherichia coli

Learning discriminative and structural samples for rare cell types with deep generative model.

Fuzzy Cluster Analysis for Interval Data Based on the Overlap Distance

SECEDO: SNV-based subclone detection using ultra-low coverage single-cell DNA sequencing.

Understanding the Adjusted Rand Index and Other Partition Comparison Indices Based on Counting Object Pairs

Improved K‐means algorithm for clustering non‐spherical data

Similarity-based Partitions on Pre-ordered Sets

The impact of public health emergency governance based on artificial intelligence

Axon Tracing and Centerline Detection using Topologically-Aware 3D U-Nets.

An Evaluative Measure of Clustering Methods Incorporating Hyperparameter Sensitivity

A neighbour-similarity based community discovery algorithm

Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans)

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Rand Index Research Articles

Articles published on Rand Index

MCluster-VAEs: An end-to-end variational deep learning-based clustering method for subtype discovery using multi-omics data

Mining regular behaviors based on spatiotemporal trajectory multi‐dimensional features

Distributed denial of service attack detection in E-government cloud via data clustering

Visualization of emergency department clinical data for interpretable patient phenotyping

The Initialization of Flexible K-Medoids Partitioning Method Using a Combination of Deviation and Sum of Variable Values

CLUSTERING SURGEMES USING PROTOTYPES FROM ROBOTIC KINEMATIC INFORMATION

Evaluation and Comparison of Spatial Clustering for Solar Irradiance Time Series

Unsupervised segmentation of hyperspectral remote sensing images with superpixels

Evaluation of effect based on different typing methods in Escherichia coli

Learning discriminative and structural samples for rare cell types with deep generative model.

Fuzzy Cluster Analysis for Interval Data Based on the Overlap Distance

SECEDO: SNV-based subclone detection using ultra-low coverage single-cell DNA sequencing.

Understanding the Adjusted Rand Index and Other Partition Comparison Indices Based on Counting Object Pairs

Improved K‐means algorithm for clustering non‐spherical data

Similarity-based Partitions on Pre-ordered Sets

The impact of public health emergency governance based on artificial intelligence

Axon Tracing and Centerline Detection using Topologically-Aware 3D U-Nets.

An Evaluative Measure of Clustering Methods Incorporating Hyperparameter Sensitivity

A neighbour-similarity based community discovery algorithm

Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans)