Sketching Methods Research Articles

Most sequence sketching methods work by selecting specific k-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software. Applications using sketches often rely on properties of the k-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a decycling set of the de Bruijn graph, which is a set of unavoidable k-mers. Any long enough sequence, by definition, must contain a k-mer from any decycling set (hence, the unavoidable property). Conversely, a decycling set also defines a sketching method by choosing the k-mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope.

Read full abstract

MotivationSingle-cell RNA-sequencing has grown massively in scale since its inception, presenting substantial analytic and computational challenges. Even simple downstream analyses, such as dimensionality reduction and clustering, require days of runtime and hundreds of gigabytes of memory for today’s largest datasets. In addition, current methods often favor common cell types, and miss salient biological features captured by small cell populations.ResultsHere we present Hopper, a single-cell toolkit that both speeds up the analysis of single-cell datasets and highlights their transcriptional diversity by intelligent subsampling, or sketching. Hopper realizes the optimal polynomial-time approximation of the Hausdorff distance between the full and downsampled dataset, ensuring that each cell is well-represented by some cell in the sample. Unlike prior sketching methods, Hopper adds points iteratively and allows for additional sampling from regions of interest, enabling fast and targeted multi-resolution analyses. In a dataset of over 1.3 million mouse brain cells, Hopper detects a cluster of just 64 macrophages expressing inflammatory genes (0.004% of the full dataset) from a Hopper sketch containing just 5000 cells, and several other small but biologically interesting immune cell populations invisible to analysis of the full data. On an even larger dataset consisting of ∼2 million developing mouse organ cells, we show Hopper’s even representation of important cell types in small sketches, in contrast with prior sketching methods. We also introduce Treehopper, which uses spatial partitioning to speed up Hopper by orders of magnitude with minimal loss in performance. By condensing transcriptional information encoded in large datasets, Hopper and Treehopper grant the individual user with a laptop the analytic capabilities of a large consortium.Availability and implementationThe code for Hopper is available at https://github.com/bendemeo/hopper. In addition, we have provided sketches of many of the largest single-cell datasets, available at http://hopper.csail.mit.edu.

Read full abstract

Sketching Methods Research Articles

Articles published on Sketching Methods

K-nonical space: sketching with reverse complements.

Sketching Methods with Small Window Guarantee Using Minimum Decycling Sets.

Sampling Methods for Inner Product Sketching

Accelerated Double-Sketching Subspace Newton

Stylus and Gesture Asymmetric Interaction for Fast and Precise Sketching in Virtual Reality

Using a digital ‘pocket atelier’ for creative teamwork: What is the impact of digital costume sketching on the professional competence of costume designers?

Combining graphic facilitation and animation-based sketching in higher education

Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash.

Distributed Sketching for Randomized Optimization: Exact Characterization, Concentration, and Lower Bounds

Machine learning-based CT radiomics model to discriminate the primary and secondary intracranial hemorrhage

RidgeSketch: A Fast Sketching Based Solver for Large Scale Ridge Regression

How to reduce dimension with PCA and random projections?

Revisiting Co-Occurring Directions: Sharper Analysis and Efficient Algorithm for Sparse Matrices

Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis

Lower Bounds and a Near-Optimal Shrinkage Estimator for Least Squares Using Random Projections

Gradient preconditioned mini-batch SGD for ridge regression

Hopper: a mathematically optimal algorithm for sketching biological data.

D HistoSketch: Discriminative and Dynamic Similarity-Preserving Sketching of Streaming Histograms

Designing Human Centered GeoVisualization application – the SanaViz – for telehealth users: A case study

Mediating cognitive transformation with VR 3D sketching during conceptual architectural design process

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Sketching Methods Research Articles

Articles published on Sketching Methods

K-nonical space: sketching with reverse complements.

Sketching Methods with Small Window Guarantee Using Minimum Decycling Sets.

Sampling Methods for Inner Product Sketching

Accelerated Double-Sketching Subspace Newton

Stylus and Gesture Asymmetric Interaction for Fast and Precise Sketching in Virtual Reality

Using a digital ‘pocket atelier’ for creative teamwork: What is the impact of digital costume sketching on the professional competence of costume designers?

Combining graphic facilitation and animation-based sketching in higher education

Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash.

Distributed Sketching for Randomized Optimization: Exact Characterization, Concentration, and Lower Bounds

Machine learning-based CT radiomics model to discriminate the primary and secondary intracranial hemorrhage

RidgeSketch: A Fast Sketching Based Solver for Large Scale Ridge Regression

How to reduce dimension with PCA and random projections?

Revisiting Co-Occurring Directions: Sharper Analysis and Efficient Algorithm for Sparse Matrices

Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis

Lower Bounds and a Near-Optimal Shrinkage Estimator for Least Squares Using Random Projections

Gradient preconditioned mini-batch SGD for ridge regression

Hopper: a mathematically optimal algorithm for sketching biological data.

D HistoSketch: Discriminative and Dynamic Similarity-Preserving Sketching of Streaming Histograms

Designing Human Centered GeoVisualization application – the SanaViz – for telehealth users: A case study

Mediating cognitive transformation with VR 3D sketching during conceptual architectural design process