A conditional predictive p-value to compare a multinomial with an overdispersed multinomial in the analysis of T-cell populations

Q Pei,M R Albertini,M D Macklin,C L Zuleger,M A Newton

doi:10.1093/biostatistics/kxt039

Abstract

Immunological experiments that record primary molecular sequences of T-cell receptors produce moderate to high-dimensional categorical data, some of which may be subject to extra-multinomial variation caused by technical constraints of cell-based assays. Motivated by such experiments in melanoma research, we develop a statistical procedure for testing the equality of two discrete populations, where one population delivers multinomial data and the other is subject to a specific form of overdispersion. The procedure computes a conditional-predictive p-value by splitting the data set into two, obtaining a predictive distribution for one piece given the other, and using the observed predictive ordinate to generate a p-value. The procedure has a simple interpretation, requires fewer modeling assumptions than would be required of a fully Bayesian analysis, and has reasonable operating characteristics as evidenced empirically and by asymptotic analysis.

Highlights

When testing the equality of two discrete populations, Fisher’s exact test applies naturally to multinomial samples (e.g. Agresti, 1990, p. 62)
A conditional predictive p-value to compare a multinomial with an overdispersed multinomial in the analysis of T-cell populations
We address the testing problem by developing a conditional predictive p-value

Summary

SUMMARY

Immunological experiments that record primary molecular sequences of T-cell receptors produce moderate to high-dimensional categorical data, some of which may be subject to extra-multinomial variation caused by technical constraints of cell-based assays Motivated by such experiments in melanoma research, we develop a statistical procedure for testing the equality of two discrete populations, where one population delivers multinomial data and the other is subject to a specific form of overdispersion. The procedure computes a conditional-predictive p-value by splitting the data set into two, obtaining a predictive distribution for one piece given the other, and using the observed predictive ordinate to generate a p-value. The procedure has a simple interpretation, requires fewer modeling assumptions than would be required of a fully Bayesian analysis, and has reasonable operating characteristics as evidenced empirically and by asymptotic analysis

INTRODUCTION

BIOLOGICAL CONTEXT

Sampling model

CONDITIONAL PREDICTIVE p-VALUE

POSTERIOR AND PREDICTIVE SAMPLING

ASYMPTOTIC THEORY