Abstract

ABSTRACTWe develop a general statistical framework for the analysis and inference of large tree-structured data, with a focus on developing asymptotic goodness-of-fit tests. We first propose a consistent statistical model for binary trees, from which we develop a class of invariant tests. Using the model for binary trees, we then construct tests for general trees by using the distributional properties of the continuum random tree, which arises as the invariant limit for a broad class of models for tree-structured data based on conditioned Galton–Watson processes. The test statistics for the goodness-of-fit tests are simple to compute and are asymptotically distributed as χ2 and F random variables. We illustrate our methods on an important application of detecting tumor heterogeneity in brain cancer. We use a novel approach with tree-based representations of magnetic resonance images and employ the developed tests to ascertain tumor heterogeneity between two groups of patients. Supplementary materials for this article are available online.

Highlights

  • The statistical analysis of tree-structured objects has received appreciable attention in recent years owing to the emergence of datasets wherein the underlying quantities of interest allow for tree-like representations

  • Our approach in this article is based on the abstract notion of a Continuum Random Tree (CRT) from Aldous [1991a] and Aldous [1993] which arises as a continuous limit as the number of vertices grows without bound for a large class of random trees

  • The pertinent question behind the Dyck path representation of an ordered tree is this: suppose a Conditioned Galton-Watson trees (CGW) tree τn is distributed as a member of the class (2.2); what is the ramification of the bijective transformation τn → Hn on the class {πk,σ2 : k = 0, 1, 2, . . . ; 0 < σ2 < ∞}? If we propose to develop inferential tools on the space of Dyck paths, it is required to establish the equivalence of statistical procedures, perhaps in the Le Cam sense, on {πk,σ2 : k = 0, 1, 2, . . . ; 0 < σ2 < ∞} and the class resulting from the transformation

Read more

Summary

Introduction

The statistical analysis of tree-structured objects has received appreciable attention in recent years owing to the emergence of datasets wherein the underlying quantities of interest allow for tree-like representations. Some central challenges have stymied the systematic development of tools for statistical inference: The non-Euclidean nature of the underlying space offers considerable challenges while developing probability models for fully observed trees; tree-structured data rarely contain the same number of vertices leading to issues in comparing trees of differing sizes; generating trees from a probability model for simulation purposes is not straightforward Motivated by these issues, our approach in this article is based on the abstract notion of a Continuum Random Tree (CRT) from Aldous [1991a] and Aldous [1993] which arises as a continuous limit as the number of vertices grows without bound for a large class of random trees. The second specification is based on the limiting distribution of family of subtrees referred to as Least Common Ancestor trees These considerations on CGW tree-models for tree-structured data lead us to the primary focus of this article:.

Preliminaries
Parametric family and test from LCA trees
Parametric family and test based on Dyck Path
Simulations
Discussion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.