As we aspire to implement evidence-based medicine in the daily practice of orthopaedics, it is critical to understand the level of evidence that guides routine clinical decisions. Randomized controlled trials (RCTs) offer methodology for delivering the highest level of evidence1. Whereas the first RCTs were conducted half a century ago to test the efficacy of new pharmacologic agents, such trials are now being used increasingly to evaluate therapies other than medications, including surgery and behavioral interventions. Surgical RCTs involve additional complexity with regard to several key methodological components of the RCT: the willingness of patients to enroll, implementation of blinding (masking), and the impact of crossovers from one trial arm to another. The American Academy of Orthopaedic Surgeons (AAOS) and the Orthopaedic Research Society (ORS) sponsored a symposium on clinical trials in orthopaedic research in May 2009 at which each of these challenges was discussed2,3. Discussions during the symposium focused on offering suggestions regarding how to address these methodological issues during the design and implementation phases of a study. This paper summarizes key discussion points highlighted during the AAOS-ORS Symposium and comments on related issues, including compliance with nonsurgical regimens in studies comparing surgery with nonoperative treatment and the appropriate use of subgroup analysis. The paper concludes with suggestions regarding cost-effectiveness analysis in orthopaedic research. Planning Trials and Performing Sample Size Calculations For an RCT to make a difference in the clinical management of musculoskeletal disorders, it must be planned carefully. In addition to detailed specification of the treatment regimen for each arm in the trial, it is critical to perform a careful analysis of the sample size required to conduct a trial that is conclusive (positive or negative). A study with an insufficient sample size may fail to detect a statistically significant difference between the arms of the study even when the magnitude of the difference is clinically meaningful. This situation would result in a need for additional evidence in favor of the treatment being examined. Sample size estimation should be based on the results of pertinent preliminary or pilot studies. If the data used as a basis for the sample size calculation are only remotely relevant to the condition or sample studied in the trial, the resulting sample size estimates may be incorrect, potentially leading to inconclusive trial results. To be useful for sample size estimation, preliminary data should contain measures of central tendency (mean or proportion) and measures of variability (standard deviation). For greater accuracy of the sample size estimation, the pilot data should come from populations similar to the trial population and from interventions similar to those being studied in the trial. The sample size estimates should take into account the minimum clinically important difference (MCID) to ensure that the documented improvement or worsening is meaningful to patients4,5. The sample size calculations should also be consistent with the main hypothesis underlying the trial design, such as a hypothesis of superiority or a hypothesis of noninferiority. In addition to collecting preliminary data pertinent to measuring trial outcomes, it is important to estimate trial acceptance among potential study participants and among referring clinicians or health-care professionals. Such estimates may be obtained using published data from CONSORT (Consolidated Standards of Reporting Trials) diagrams from similar trials or using prospective preference assessments6,7. In addition to willingness to participate among the population of interest, the clinic capacity (defined as the number of patients at an individual clinic at any given time who are potentially eligible to be enrolled in a trial) serves as an important factor in determining the number of clinics or other providers that should be included. A study site that sees fewer potentially eligible patients will require a longer recruitment phase than a study site that sees a higher volume of eligible patients. Inclusion of additional centers would increase the logistical complexity of a study but it would likely reduce the duration of the recruitment stage of the trial. Although a power of 80% has traditionally been considered sufficient in sample size determinations, it can lead to a 20% chance of inconclusive trial results when the hypothesized effect is present—i.e., if the intervention is truly efficacious, a trial with 80% power will demonstrate efficacy only 80% of the time8–10. Establishing trial infrastructure may be one of the most expensive components of a study. Using a sample size that ensures 90% power would reduce the chances of inconclusive trial results by one-half (from 20% to 10%). When determining the basis for the sample size estimation, it is important to ensure that the magnitude of the effect that the study plans to detect is clinically meaningful. The effect is said to be clinically meaningful if it justifies a change in patient management11. Enrollment into Orthopaedic RCTs RCTs often involve experiments on people performed in the presence of uncertainty about the best clinical strategy for management of a specific condition, as the absence of uncertainty would compromise the ethical foundation underlying such clinical research. Such uncertainty should be acknowledged and accepted by the clinicians who are recruiting patients into the trial12. That is, the clinicians must be in equipoise—comfortable with either treatment strategy to the extent that they are willing to enroll and randomize patients13–15. Enrollment in the trial should be accompanied by a detailed and clear informed consent document describing the details of the treatment in each arm and the possibility for planned crossover. (If possible, the timing of any planned crossovers should be after assessment of the primary outcomes.) If crossovers are a part of the trial, the consent form should indicate the timing, the financial responsibility of study participants, and the process for care delivery. Randomization Randomization in clinical trials facilitates the minimization of selection bias. Selection bias is related to the differential selection of subjects depending on treatment assignment16–19. If selection bias is not controlled for, one cannot distinguish between post-treatment differences in outcomes that are caused by baseline differences between the groups and those that are caused by the treatment. Rigorous randomization schema and allocation concealment are required to ensure the validity of the trial20. Allocation concealment refers to methods ensuring that assignment to a specific treatment arm cannot be either guessed or altered. To maximize the success of the randomization, it may include block randomization with a variable block size and/or stratification by factors that are known to be prognostic of the outcome as well as by clinics or sites (in the case of a multicenter trial)21–24. If concerns exist regarding the impact of surgeon proficiency on the outcome of the surgical treatment, it may also be useful to perform stratification by surgeon. Blinding in Orthopaedic RCTs A double blinding approach, in which both patients and providers are blinded to the treatment assignment, is considered to be the standard of care in pharmacologic RCTs25–27. However, ensuring blinding in orthopaedic trials is challenging28. When the treatment assignment involves a surgical approach or technique, the surgeon cannot be blinded. If the physician performing the surgery cannot be blinded to treatment assignment, an effort is often made to ensure that the outcomes of the study are ascertained by study personnel who are not involved in clinical care (i.e., were not present in the operating room at the time of surgery) and who are blinded to the treatment assignment. When the primary outcome of the study is patient-reported, ensuring that the patient is blinded to the treatment assignment is critical. Although achievement of blinding of study participants is relatively easy in studies comparing different surgical approaches, such blinding may be too difficult to achieve in studies comparing surgery and nonoperative approaches. Trial protocols often include ascertainment of the patients’ impressions of their actual treatment assignment in order to evaluate whether they have remained blinded at the time of assessment29,30. Low agreement between patients’ guesses and the true treatment allocation is indicative of successful blinding mechanisms. Impact of Unplanned Crossover Between Treatment Arms on RCT Conclusions and Generalizability Since the validity of the RCT’s conclusions depends greatly on the completeness of the follow-up, it is critical to ensure that subjects enrolling in the study are committed to the full course of the trial, regardless of their treatment assignment. If the RCT is blinded, it is critical to ensure that the protocol is designed in a way that minimizes the possibility of “unblinding,” which would likely lead to differential dropout rates in some trials. For example, if a study subject finds out that he or she is in the placebo arm, the person may have less incentive to continue to participate in the trial. A patient’s willingness to participate in an RCT is often affected by whether he or she has a strong preference for a particular treatment strategy. Patients who strongly prefer one arm or the other are less likely to accept random assignment than patients who are indifferent. The researcher should attempt to ascertain the basis of such preferences, especially in the typical RCT setting in which the scientific evidence does not identify either arm as superior. Standardized scripts used in enrollment conversations may help to identify the strength of the subject’s preferences and the likelihood that the subject will commit to trial completion. If study personnel sense that a potential subject is not committed to completing the trial, it is better not to enroll the subject. Factors influencing the validity and generalizability of RCT findings include nonadherence to the study protocol, unplanned crossover from one arm to another, and dropout31. Adherence to the protocol is critical to ensure accurate attribution of the outcome to the corresponding treatment. In surgical arms, it is critical that surgeons perform the surgery in a timely fashion according to the prespecified protocol. In nonsurgical arms, adherence to the protocol may be harder to measure, but detailed diaries completed by study participants may help to assess the degree of compliance with regimens such as exercise. Suboptimal adherence to the treatment regimen may reduce the estimates of treatment efficacy in the primary intent-to-treat (ITT) analysis. ITT analysis assumes that subjects are analyzed according to the treatment arm to which they had been randomized, regardless of the actual treatment they received. Unplanned crossover from one treatment arm to another may lead to trial results that are inconclusive32,33. The ITT analysis is the most unbiased evaluation, as it fully utilizes the benefits of randomized assignment. However, high crossover rates may render the ITT analysis uninterpretable. Differential crossover or dropout rates create a number of complex analytic issues. It is especially important that a statistician experienced in these issues conduct the analysis of such trial data. Conducting an ITT analysis with the commonly used method of “last observation carried forward” is subject to substantial bias, and more sophisticated methods may be required34. In general, including a statistician in the core team overseeing the design, conduct, and data analysis of the RCT is recommended to ensure efficient design, appropriate analysis, and generalizability of the results35. In several multicenter trials in orthopaedics, such as SPORT (the Spine Patient Outcomes Research Trial), high rates of unplanned crossover impacted the interpretability of the trial results36,37. To reduce the rates of unplanned crossover, it is critical to ensure timely delivery of the assigned treatment (e.g., surgery performed within a short period of time after enrollment in the trial and randomization), continuous monitoring of the rate of unplanned crossover, and regular discussions in which study investigators focus on approaches to reducing unplanned crossover from one arm to another. Missing data due to patient dropout pose another threat to the interpretability and generalizability of RCT results, particularly when the dropouts occur differentially across treatment arms. To minimize the rate of dropouts, the design phase of the study may include discussions among the investigators regarding the minimum follow-up duration that would be clinically meaningful without straining both study participants and personnel. Ways to minimize study dropouts include offering study participants the opportunity to complete visits either in person or by telephone, accommodating their schedules, and offering them reimbursement for travel and/or parking and modest incentive stipends for participation. Analyzing and Reporting RCT Data Analysis and reporting of results from an RCT should be conducted according to the guidelines proposed by the CONSORT group or the CLEAR NPT (Checklist to evaluate a report of a nonpharmacological trial) group38–48. The ITT analysis should be presented as the primary analysis49. In this analysis, all study participants are analyzed according to the treatment assignment resulting from the randomization, thus ensuring balance between treatment groups across a wide spectrum of measured and unmeasured factors50,51. Even if subjects do not receive the treatment to which they were randomly assigned, they are nonetheless allocated to that arm in the ITT analysis. When rates of noncompliance or unintentional crossover are high, use of the ITT principle may lead to a biased estimate of the causal effect, as noncompliance or unplanned crossover may affect outcomes. The ITT approach will likely lead to more conservative estimates of the treatment effect. In addition to reporting the ITT analysis, investigators may choose to report the results of the as-treated (AT) analysis, in which subjects are analyzed according to the treatment they actually received. This approach turns the RCT into an observational study, as it disrupts the balance in observed and unobserved potential confounders that is achieved by successful randomization. Although it is possible to control for some measured confounders with use of data collected at baseline, control for unmeasured factors in an AT analysis is challenging. If the rates of dropout and unplanned crossover between the treatment arms are small, the results of the ITT and AT analyses will be similar in terms of the magnitude of the effect. The paper reporting the results of the trial must report the ITT results for the prespecified primary outcome measure as the primary analysis. This primary outcome measure should be exactly the same as the measure specified as the primary outcome during the registration of the trial (e.g., on ClinicalTrials.gov). If the trial results show an effect that is substantially smaller than the effect that the trial was designed to detect, the trial is regarded as negative. If the results of the ITT analysis indicate that the magnitude of the effect is similar to or greater than the effect that the trial was designed to detect, but significance was not achieved, the trial is regarded as inconclusive. (Note that even if the trial has 90% power to detect a particular difference, there is still a 10% likelihood of a false-negative result.) It is critical to report both negative and inconclusive trials as they comprise important components of the evidence for or against a certain treatment. These trial reports are also invaluable when conducting meta-analyses. All secondary outcomes and secondary analyses should be explicitly stated to be secondary in nature. Any analysis of the data at interim time points needs to be prespecified in the protocol. Unplanned interim analyses can be problematic for many reasons, ranging from the effect on the type-I error rate of the study to reducing accrual rates at the participating centers if recruiting physicians become discouraged by lackluster interim results. Preplanning interim analyses and describing how the interim results will be used can help to avoid such problems. If an interim analysis is planned, rules for stopping the trial on the basis of efficacy, futility, or toxicity should be considered and incorporated into the protocol before the trial opens. Approval of studies without such formal stopping rules is often held up by scientific review committees and institutional review boards. Subgroup Analysis As recommended by the CONSORT guidelines, only preplanned subgroup analyses should be undertaken during the analysis stage of the trial. The essence of subgroup analysis lies in testing the hypothesis that treatment works differently in certain subgroups52–54. A proper subgroup analysis requires a larger sample size since the inference is based on the statistical test for interaction55,56. Showing that the treatment “works” in a particular subgroup—i.e., the effect of the treatment reaches significance in that population subgroup—is not sufficient to support a claim of a differential effect across subgroups. All preplanned subgroup analyses should be stated in the original registration of the RCT (e.g., on ClinicalTrials.gov). Generalizability of RCT Findings The generalizability or external validity of the RCT’s findings will be influenced by the inclusion and exclusion criteria as well as by the observed characteristics of the patient sample enrolled in the trial. Reporting detailed inclusion and exclusion criteria in the paper summarizing the trial results as well as in the trial registration materials aids assessment of the applicability of trial results to specific populations and conditions. Self-selection (wherein individuals volunteer to participate in trials) may lead to substantial selection bias and should be assessed by comparing the subjects enrolled in the trial with the subjects declining enrollment. It is also critical to assess the willingness of participating clinicians to offer the trial to their patients with the condition. Therefore, descriptions of the “funnel” of patient flow (screened→eligible→offered→enrolled) help readers to gauge the generalizability of the trial results. Defining study samples requires addressing the inherent tradeoff between internal and external validity. A narrowly defined, homogeneous sample will maximize the internal validity of the study but at the cost of reducing generalizability (external validity). Although it is generally advisable to maintain internal validity to the extent that is possible, there is no “correct” textbook answer to this quandary. Investigators will need to weigh these considerations when defining their inclusion and exclusion criteria. The potential efficacy of a treatment is often established by means of an RCT conducted in an academic center of excellence to answer the question of whether the treatment can work in ideal settings. Further effectiveness studies, conducted in community settings, should then be designed to address the question of treatment effectiveness in “real life” settings. A multicenter trial may be conducted to improve the generalizability of the trial results. The goal of a multicenter RCT is to overcome the limited boundaries of a single treatment center in administration of the treatment. To be successful, multicenter trials should be based on standardized protocols for screening, enrollment, and follow-up. A strong management team and regular site monitoring may facilitate successful completion of a multicenter RCT. In the setting of a multicenter RCT, the use of a single data coordinating center is helpful to ensure a standardized approach to data collection, monitoring, and reporting. It is often advisable, especially in trials of small or moderate sizes, to stratify randomization by center to ensure balance between the arms at each individual center. Economic Evaluations Alongside RCTs New treatment strategies, both surgical and nonsurgical57–59, may be shown in RCTs to be more efficacious than standard therapy but also costlier60,61. In such cases, it is important to address the question of whether the new treatment strategy provides good value for the additional money spent. To address this question, a formal cost-effectiveness or cost-utility analysis should be conducted alongside the RCT60,62,63. Many surgical treatment approaches in the management of musculoskeletal conditions are focused on improvement in health-related quality of life. Therefore, it is extremely valuable to include an estimate of quality of life in surgical RCTs. In economic evaluations, quality of life is often measured with use of a utility, which is a type of measure that can range from 0, corresponding to death, to 1, corresponding to perfect health64,65. Utilities can either be estimated directly, using the standard gamble method or the time-trade-off method, or derived indirectly from well-accepted outcome instruments such as the EuroQol, SF-36 (Short Form-36), SF-12, or WOMAC (Western Ontario and McMaster Universities Osteoarthritis Index)66–73. In addition to collecting detailed data on quality of life during the follow-up period, it is critical to ensure careful and detailed data collection on health care utilization. At enrollment, it is helpful to assess health care utilization during the three months directly preceding enrollment; such baseline utilization should include the services of various types of health care providers, medication use, and use of laboratory examinations and procedures. Such an assessment of resource utilization should be conducted several times during the follow-up period to ascertain changes in resource utilization due to the treatment tested within the RCT. Careful collection of both health-related quality of life data (utilities) and resource utilization (which would be converted to costs) within the timeline of the trial enables researchers to estimate the cost-effectiveness of the treatment over the duration of the trial. In estimating costs related to health care, it is important to account for both direct and indirect medical costs. Direct medical costs refer to costs of pharmacologic and nonpharmacologic regimens, ambulatory visits, and hospital stays. Indirect costs capture productivity loss due to receipt of medical care and additional child care or transportation expenses related to receiving the necessary care. Cost-effectiveness analyses conducted over a short period of time are often insufficient, as some of the negative as well as positive consequences of treatment may not be realized until years after the formal RCT follow-up period ends74,75. For instance, events such as implant failure resulting in additional surgical procedures typically occur long after completion of the trial. This creates a rationale for extending the cost-effectiveness analysis over a longer time frame, preferably the remaining life span of the patient population. To conduct such an analysis, decision analysis models are often built, augmenting the data obtained from the trial per se with sources of information not provided by the trial-based data. Conducting an economic analysis alongside the RCT is not feasible in every case. Enhancing the RCT with an economic analysis is ideal, but if this cannot be done, the RCT still yields valuable information about treatment effectiveness. If the study team plans to undertake an economic evaluation, it should be done in consultation with a health economist or a researcher with a proven track record in undertaking cost-effectiveness analyses. Summary RCTs serve as the best means of delivering evidence in support of or against specific treatment strategies. To be valid, such trials require careful planning, rigorous attention to the study protocol during implementation, and thoughtful data analysis and reporting. Dedicated multidisciplinary teams consisting of clinicians, epidemiologists, biostatisticians, and data coordinating center personnel can work to ensure successful implementation of RCTs and fast dissemination of the RCT results into clinical practice.