Abstract

Data generated from a system of interest typically consists of measurements on many covariate features and possibly multiple response features across all subjects in a designated ensemble. Such data is naturally represented by one response-matrix against one covariate-matrix. A matrix lattice is an advantageous platform for simultaneously accommodating heterogeneous data types: continuous, discrete and categorical, and exploring hidden dependency among/between features and subjects. After each feature being individually renormalized with respect to its own histogram, the categorical version of mutual conditional entropy is evaluated for all pairs of response and covariate features according to the combinatorial information theory. Then, by applying Data Could Geometry (DCG) algorithmic computations on such a mutual conditional entropy matrix, multiple synergistic feature-groups are partitioned. Distinct synergistic feature-groups embrace distinct structures of dependency. The explicit details of dependency among members of synergistic features are seen through mutliscale compositions of blocks computed by a computing paradigm called Data Mechanics. We then propose a categorical pattern matching approach to establish a directed associative linkage: from the patterned response dependency to serial structured covariate dependency. The graphic display of such a directed associative linkage is termed an information flow and the degrees of association are evaluated via tree-to-tree mutual conditional entropy. This new universal way of discovering system knowledge is illustrated through five data sets. In each case, the emergent visible heterogeneity is an organization of discovered knowledge.

Highlights

  • Most scientific researches are geared to acquire knowledge and understanding on systems of interest

  • Each data set is chosen for idiosyncratic reasons and characters: 1) the 1st data set with 1D binary response feature is to show why an information flow is more advantageous over Logistic regression model; 2) the 2nd data set with 1D continuous response feature is to recognize the fact that a data set can only sustain limited, not full spectrum, of resolutions of information content as implied by a linear regression model; 3) the 3rd and 4th data sets deal with multiple response features with distinct data types; 4) the 5th data set consists of covariate features of all types: from continuous, discrete to categorical ones, in which all features need to be properly digitally coded

  • All results of the five data sets presented via information flows are meant to advance our system knowledge with concise and vivid pictorial visualizations

Read more

Summary

Introduction

Most scientific researches are geared to acquire knowledge and understanding on systems of interest. Ufl.edu/~winner/data/insular.txt), Height (http:// users.stat.ufl.edu/~winner/data/police_height.dat and http://users.stat.ufl.edu/~winner/data/police_ height.txt) Categorical-pattern-matching (http://users.stat.ufl.edu/~winner/data/gbelec.dat and http://users.stat.ufl.edu/~winner/data/gbelec. txt), Patterns of bird species (http://users.stat.ufl. edu/~winner/data/insular.dat and http://users.stat. ufl.edu/~winner/data/insular.txt), Height (http:// users.stat.ufl.edu/~winner/data/police_height.dat and http://users.stat.ufl.edu/~winner/data/police_ height.txt)

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call