BackgroundProgrammed-death-1/ligand-1 inhibitors (PD-1/L1i’s) have emerged as pivotal treatments for many cancers. A notable feature of this class of medicines is the dichotomous response pattern: A small (but clinically-relevant) percentage of patients (5% - 20%) benefit from deep and durable responses resembling functional cures (durable responders), while most patients experience only a modest or negligible response. Accurately predicting durable responders remains elusive due to the lack of a reliable biomarker. Another notable feature of these medicines is that different PD-1/L1’s have obtained statistically significant results, leading to marketing approval, for some cancer indications, but not for others, with no discernible pattern. These puzzling inconsistencies have generated extensive discussions among oncologists. Proposed (but not entirely convincing) explanations include true underlying differences in efficacy for some types of cancer, but not others; or subtle differences in trial design. ObjectiveTo investigate a less-explored hypothesis—the durable-responder effect: An initially unidentified group of durable responders generates more statistical noise than anticipated, leading to low-powered randomised controlled trials (RCTs) that report randomly variable results. Study designEmploying simulation, this investigation divides participants in PD-(L)1i RCTs into two groups: durable responders and patients with a more modest response. Drawing on published data for melanoma, lung and urothelial cancers, multiple pre-specified scenarios are replicated 50,000 times, systematically varying the durable-responder percentage from 5% to 20% and the modest-response hazard ratio for overall survival [HR(OS)] from 0.8 to 1.0. This allowed evaluation of the effect of durable responders on power, point estimates of the treatment effect for OS, and the probability of a misleading signal for harm. ResultsWhen the treatment effect for the modest responders is similar to the comparator arm, statistical power remains below 80%, limiting the ability to reliably detect durable responders. Conversely, there is a material probability of obtaining a statistically significant result that exaggerates the treatment effect by chance. For instance, with an average HR(OS) of 0.93 (corresponding to 5% durable responders), statistically significant trials (7.2%) show an average HR(OS) of 0.77. Additionally, when 5% are durable responders, there is a 20% probability that the HR(OS) will exceed 1.0—suggesting potential harm, when none exists. ConclusionThis paper adds to the possible explanations for the puzzlingly inconsistent results from PD-(L)1i RCTs. Initially unidentified durable responders introduce features typical of imprecise, low-powered studies: a propensity for false-negative results; estimates of benefit that might not replicate; and misleading signals for harm.
Read full abstract