Visualizing bivariate long-tailed data

Justin S Dyer,Art B Owen

doi:10.1214/11-ejs622

Abstract

Variables in large data sets in biology or e-commerce often have a head, made up of very frequent values and a long tail of ever rarer values. Models such as the Zipf or Zipf–Mandelbrot provide a good description. The problem we address here is the visualization of two such long-tailed variables, as one might see in a bivariate Zipf context. We introduce a copula plot to display the joint behavior of such variables. The plot uses an empirical ordering of the data; we prove that this ordering is asymptotically accurate in a Zipf–Mandelbrot–Poisson model. We often see an association between entities at the head of one variable with those from the tail of the other. We present two generative models (saturation and bipartite preferential attachment) that show such qualitative behavior and we characterize the power law behavior of the marginal distributions in these models.

Highlights

It is increasingly common to see data sets in which two or more categorical variables each have a long-tailed distribution, of which the Zipf distribution is the best known example
In making the copula displays we have implicitly assumed that sorting entities by their observed size in the data set puts them into the correct order that we would see in an infinite sample
We have investigated head-to-tail affinities for bivariate heavy tailed data arising from bipartite networks and directed networks

Summary

Introduction

It is increasingly common to see data sets in which two or more categorical variables each have a long-tailed distribution, of which the Zipf distribution is the best known example. In making the copula displays we have implicitly assumed that sorting entities by their observed size in the data set puts them into the correct order that we would see in an infinite sample. The ratings data we looked at typically showed head-to-tail affinities. A second model invokes bipartite preferential attachment This model provides reasonable marginal distributions and we find head-to-tail affinities.

Construction

II 0 0

Copulas and discrepancies

Maslov and Sneppen’s display

Examples

Bipartite graphs

Directed graphs

Symmetric matrices

Numerical summaries

Proper ordering

Data from the deep tail

Affinity models

A saturation model

Bipartite preferential attachment

Discussion

Accuracy of the bulk ordering

Proof of Theorem 3

Findings

Proof of Theorem 4

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Electronic journal of statistics	Publication Date: Jan 1, 2011
Citations: 11	License type: cc-by

R Discovery Prime

R Discovery Prime

Visualizing bivariate long-tailed data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronic journal of statistics

Lead the way for us

Similar Papers

Topic modeling for cluster analysis of large biological and medical datasets.
Weizhong Zhao ... James J Chen
BMC bioinformatics | VOL. Suppl 15 11
Weizhong Zhao, et. al.Weizhong Zhao ... James J Chen
21 Oct 2014
BMC bioinformatics | VOL. Suppl 15 11

Uncovering Hidden Phylogenetic Consensus in Large Data Sets
Nicholas D Pattengale ... Andre J Aberer
IEEE/ACM transactions on computational biology and bioinformatics | VOL. 8
Nicholas D Pattengale, et. al.Nicholas D Pattengale ... Andre J Aberer
01 Jul 2011
IEEE/ACM transactions on computational biology and bioinformatics | VOL. 8

Generative vs. Non-Generative Models in Engineering Shape Optimization
Zahid Masood ... Muhammad Usama
Journal of marine science and engineering | VOL. 12
Zahid Masood, et. al.Zahid Masood ... Muhammad Usama
27 Mar 2024
Journal of marine science and engineering | VOL. 12

Generating bulk RNA-Seq gene expression data based on generative deep learning models and utilizing it for data augmentation
Yinglun Wang ... Han Shen
Computers in Biology and Medicine | VOL. 169
Yinglun Wang, et. al.Yinglun Wang ... Han Shen
07 Dec 2023
Computers in Biology and Medicine | VOL. 169

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Visualizing bivariate long-tailed data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronic journal of statistics