Characterizing user churn has become an important research area of networks and distributed systems, both in theoretical analysis and system design. A realistic churn model, often measured using periodic observation, should replicate two key properties of deployed systems -- (1) the arrival process and (2) the lifetime distribution of participating agents. Because users can be sampled only by sending packets to them and eliciting responses, there is an inherent tradeoff between overhead (i.e., bandwidth needed to perform the measurement) and accuracy of obtained results. Furthermore, all observations are censored, i.e., rounded up or down to a multiple of Δ, where Δ is the minimum delay between repeat visits to the same user. Assuming a stationary arrival process, previous work shows that consistent (i.e., asymptotically accurate) estimation of the lifetime distribution is possible; however, the problem remains open for non-stationary cases. Questions include what distributions these methods sample when the assumptions on the arrival process are violated, under what conditions consistency is possible with existing techniques, and what avenues exist for improving their accuracy and overhead. To investigate these issues, we first use random-measure theory to develop a novel churn model that allows rather general non-stationary scenarios and even synchronized joins (e.g., flash crowds). We not only dispose with common assumptions, such as existence of arrival rate and ergodicity, but also show that this model can produce all metrics of interest (e.g., sampled lifetime distributions, bandwidth overhead) using simple expressions. We apply these results to study the accuracy of prior techniques and discover that they are biased unless user lifetimes are exponential or the arrival measure is stationary. To overcome these limitations, we then create a new lifetime-sampling technique that remains asymptotically robust under all periodic arrival measures and provide a methodology for undoing the bias in the sampled arrival rate created by missed users. We demonstrate that the proposed approach exhibits accuracy advantages and 1-2 orders of magnitude less bandwidth consumption compared to the alternatives. We finish by implementing the proposed framework and applying it to experimental data from massive crawls of Gnutella.
Read full abstract