Abstract
Online controlled experiments, e.g., A/B testing, is the state-of-the-art approach used by modern Internet companies to improve their services based on data-driven decisions. The most challenging problem is to define an appropriate online metric of user behavior, so-called Overall Evaluation Criterion (OEC), which is both interpretable and sensitive. A typical OEC consists of a key metric and an evaluation statistic. Sensitivity of an OEC to the treatment effect of an A/B test is measured by a statistical significance test. We introduce the notion of Overall Acceptance Criterion (OAC) that includes both the components of an OEC and a statistical significance test. While existing studies on A/B tests are mostly concentrated on the first component of an OAC, its key metric, we widely study the two latter ones by comparison of several statistics and several statistical tests with respect to user engagement metrics on hundreds of A/B experiments run on real users of Yandex. We discovered that the application of the state-of-the-art Student's t-tests to several main user engagement metrics may lead to an underestimation of the false-positive rate by an order of magnitude. We investigate both well-known and novel techniques to overcome this issue in practical settings. At last, we propose the entropy and the quantiles as novel OECs that reflect the diversity and extreme cases of user engagement.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.