Abstract

Variables in large data sets in biology or e-commerce often have a head, made up of very frequent values and a long tail of ever rarer values. Models such as the Zipf or Zipf–Mandelbrot provide a good description. The problem we address here is the visualization of two such long-tailed variables, as one might see in a bivariate Zipf context. We introduce a copula plot to display the joint behavior of such variables. The plot uses an empirical ordering of the data; we prove that this ordering is asymptotically accurate in a Zipf–Mandelbrot–Poisson model. We often see an association between entities at the head of one variable with those from the tail of the other. We present two generative models (saturation and bipartite preferential attachment) that show such qualitative behavior and we characterize the power law behavior of the marginal distributions in these models.

Highlights

  • It is increasingly common to see data sets in which two or more categorical variables each have a long-tailed distribution, of which the Zipf distribution is the best known example

  • In making the copula displays we have implicitly assumed that sorting entities by their observed size in the data set puts them into the correct order that we would see in an infinite sample

  • We have investigated head-to-tail affinities for bivariate heavy tailed data arising from bipartite networks and directed networks

Read more

Summary

Introduction

It is increasingly common to see data sets in which two or more categorical variables each have a long-tailed distribution, of which the Zipf distribution is the best known example. In making the copula displays we have implicitly assumed that sorting entities by their observed size in the data set puts them into the correct order that we would see in an infinite sample. The ratings data we looked at typically showed head-to-tail affinities. A second model invokes bipartite preferential attachment This model provides reasonable marginal distributions and we find head-to-tail affinities.

Construction
II 0 0
Copulas and discrepancies
Maslov and Sneppen’s display
Examples
Bipartite graphs
Directed graphs
Symmetric matrices
Numerical summaries
Proper ordering
Data from the deep tail
Affinity models
A saturation model
Bipartite preferential attachment
Discussion
Accuracy of the bulk ordering
Proof of Theorem 3
Findings
Proof of Theorem 4
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call