Abstract

Integrating datasets from different disciplines is hard because the data are often qualitatively different in meaning, scale and reliability. When two datasets describe the same entities, many scientific questions can be phrased around whether the (dis)similarities between entities are conserved across such different data. Our method, CLARITY, quantifies consistency across datasets, identifies where inconsistencies arise and aids in their interpretation. We illustrate this using three diverse comparisons: gene methylation versus expression, evolution of language sounds versus word use, and country-level economic metrics versus cultural beliefs. The non-parametric approach is robust to noise and differences in scaling, and makes only weak assumptions about how the data were generated. It operates by decomposing similarities into two components: a ‘structural’ component analogous to a clustering, and an underlying ‘relationship’ between those structures. This allows a ‘structural comparison’ between two similarity matrices using their predictability from ‘structure’. Significance is assessed with the help of re-sampling appropriate for each dataset. The software, CLARITY, is available as an R package from github.com/danjlawson/CLARITY.

Highlights

  • The need to compare different sources of information about the same subjects arises in most quantitative sciences

  • CLARITY allows comparison of arbitrary datasets for which the same set of d subjects are observed. It represents the similarity of a reference dataset Y1 non-parametrically using increasingly rich representations of complexity k ≤ d

  • To emphasize the utility of CLARITY we focus on an epigenetic simulation model from the literature [36], in which Methylation and Expression data are generated from independent case-control experiments that describe the same set of genetic loci, i.e. positions in the genome, that correspond to known genes

Read more

Summary

Introduction

The need to compare different sources of information about the same subjects arises in most quantitative sciences. (a)(i) spatial representation (b) (i) similarity representation (c) (i) residual persistence reference dataset jlik learn k clusters from the reference. Efgh npmo bcad (ii) target dataset (same Structure). D c a predict target similarities b using learned structure, and a new Relationship. Ge h f k li j o m p n subjects p p o n medium o n m m l l k j far k j i i h h g g f f e e d d c b close c b a a. M l k j medium k j subjects i i h h g g f f e e d d c b close a persistence is : the sum of squared residuals when using k clusters. (iii) target dataset (different Structure) eh p gfonm i l k j (a)(i) spatial representation (b) (i) similarity representation (c) (i) residual persistence reference dataset jlik learn k clusters from the reference. structure: the cluster members. relationship: the inter−cluster distances.

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call