Abstract

Given a dataset, we quantify the size of patterns that must always exist in the dataset. This is done formally through the lens of Ramsey theory of graphs, and a quantitative bound known as Goodman’s theorem. By combining statistical tools with Ramsey theory of graphs, we give a nuanced understanding of how far away a dataset is from correlated, and what qualifies as a meaningful pattern. This method is applicable to a wide range of datasets. As examples, we analyze two very different datasets. The first is a dataset of repeated voters ( n = 435 ) in the 1984 US congress, and we quantify how homogeneous a subset of congressional voters is. We also measure how transitive a subset of voters is. Statistical Ramsey theory is also used with global economic trading data ( n = 214 ) to provide evidence that global markets are quite transitive. While these datasets are small relative to Big Data, they illustrate the new applications we are proposing. We end with specific calls to strengthen the connections between Ramsey theory and statistical methods.

Highlights

  • In the realm of data science, the conventional wisdom is that “more data is always better”, but is this the case? As a dataset D becomes larger, Ramsey theory describes the mathematical conditions by which disorder becomes impossible

  • Axioms 2019, 8, 29 beyond the base requirement that there is a single shirt that must be worn twice in a given week. This leads to our major connection between Ramsey theory and statistical analysis: Remark 1 (Spurious Correlations through Ramsey theory)

  • While the expected value is a good benchmark, it still doesn’t answer the more fundamental question of how many monochromatic triangles are present in GN versus how many are required by Ramsey theory

Read more

Summary

Introduction

In the realm of data science, the conventional wisdom is that “more data is always better”, but is this the case? As a dataset D becomes larger, Ramsey theory describes the mathematical conditions by which disorder becomes impossible. It would be incorrect to conclude that the given person has a particular affinity for that repeated shirt In this case, there is no meaningful conclusion we can draw, despite the natural human desire to attribute meaning to a pattern that is observed but forced to exist by the pigeonhole principle. Axioms 2019, 8, 29 beyond the base requirement that there is a single shirt that must be worn twice in a given week This leads to our major connection between Ramsey theory and statistical analysis: Remark 1 (Spurious Correlations through Ramsey theory). Translating the Ramsey theorem Goodman’s theorem to a measurement of transitivity of a system (Theorem 2) In order for these connections to be further used and explored, we take care to explain the Ramsey theory we use in the language that an untrained data scientist will understand.

Mathematical Framework
The Ramsey Perspective
Models
Similarity in Voting Records
Theoretical Construction
Defining Deviation
Applied to Voting Threshold Graphs
Collaboration Model
Applications to Other Datasets
Applications to Transitivity
Application to Voting Records
Application to Global Trading Data
Theory Building
Further Applications
Findings
Closing Remarks
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.