Abstract

A recent article in this journal by Tabak (2006) highlighted a potentially serious source of bias that can arise when multiple statistical tests of a damages theory are performed and even one test rejecting the null hypothesis is regarded as supporting the damages theory. Repeated testing will eventually produce a “false discovery,” that is, rejection of the null hypothesis in favor of the alternative hypothesis when the null hypothesis is true, which statisticians refer to as type I error. Consequently, performing multiple tests without adjusting the critical value can be problematical because it can lead to improperly accepting statistical evidence that apparently supports rejection of the null hypothesis as reliable when it is not. Tabak (2006) recommends making the Šidák (1968, 1971) multiple-comparison adjustment to the standard statistical t-test to correct for the false-discovery bias inherent in multiple-comparison testing. In particular, he recommends making this adjustment when performing 10b-5 securities fraud event studies when more than one corrective disclosure date is involved.This article clarifies the circumstances in which a multiple-comparison adjustment is appropriate and explains why the correction is normally not needed in securities fraud event-study testing. More generally, I explain why it is not required when each of several tests is performed and its results are reported separately, as for example, where the objective is simply to test the statistical significance of the abnormal stock return on each day on which a new and distinct curative disclosure occurred. I show that the Šidák multiple-comparison adjustment is nearly as stringent as the classical Bonferroni procedure (Simes, 1986), which can increase the risk of type II error. This article discusses a more powerful alternative to the Šidák adjustment due to Benjamini and Hochberg (1993), which directly corrects for the false-discovery bias in multiple-comparison testing and reduces the risk of type II error.Multiple-comparison-false-discovery bias can arise when (a) several statistical tests are performed on subsets of the same larger data set in an effort to identify a significant relationship and (b) a favorable finding in any one of the multiple tests would be taken as support for the existence of a significant relationship (Tabak, 2006). The repeated testing increases the likelihood of making a false discovery. However, this situation needs to be carefully distinguished from one in which a series of single tests are performed each on a separate information set. Tabak (2006, page 232), provides an instructive example concerning testing data from several business units for any evidence of bias in hiring or promotion. Extending his example, suppose disparaging letters are found in more than one business unit. It would be appropriate to test the null hypothesis of zero bias in each business unit separately. In contrast, suppose bias is suspected somewhere in the firm but no incriminating evidence exists. The expert could investigate the presence of any discrimination on a firm-wide basis by testing the joint null hypothesis of zero bias for all the firm's business units examined. A multiple-comparison adjustment should be made in the second case but not the first.Similarly, a multiple-comparison adjustment would be appropriate in a securities fraud case if the expert is testing the joint null hypothesis that all of the abnormal returns are zero (De Veaux, Velleman, and Bock, 2008). A securities fraud case typically involves multiple corrective disclosures as the fraud is revealed through a series of information releases by the firm or regulators announcing enforcement actions. The content of the information the firm (or someone else) releases into the market identifies each curative disclosure date.1 Duplicative disclosures are excluded, and so the news released on each date should be distinct from the news released on all previous dates, which are analogous to the disparaging letters in Tabak's (2006) example. A separate event study is performed for each disclosure date. We test the null hypothesis that the abnormal stock return on that date is zero based on a t-test and the standard level of statistical significance for a single test.2 A lack of statistical significance is usually interpreted as signifying that the stock market did not regard the disclosure concerning the alleged fraud as economically meaningful. The null hypotheses for the different disclosure dates are tested singly, not jointly, because each disclosure date is its own family.3 Thus, the Šidák adjustment Tabak (2006) recommends is usually not appropriate for securities fraud event studies.4When multiple tests are performed, the likelihood of type I error increases because the probability of one or more rejections increases with the number of tests. For example, suppose 20 independent event-study tests are performed and the pre-specified significance level is α = 0.05. If the null hypothesis is true, then it would be expected that one out of the 20 test results (that is, 5%) would indicate rejection of the null hypothesis simply by chance. Multiple-comparison adjustments control the probability of a type I error when there are multiple tests performed on the same data set. By reducing the critical value appropriately, one can reduce the likelihood of one or more rejections to α = 0.05 when the null hypothesis is true. Any such adjustment necessarily reduces the likelihood of type I error in any individual test to less than 5%. However, it also weakens the power of the testing procedure and increases the probability of type II error in any particular test because its critical value is lowered by the same amount (Benjamini and Hochberg, 1995). If the multiple-comparison correction is too large, then the overcorrection will impair the power of the test. The event-study testing will be biased against rejecting the null hypothesis of zero abnormal return when that hypothesis is false, that is, biased against finding a statistically significant abnormal stock price reaction to the corrective disclosure.The classic multiple-comparison adjustment to the standard statistical significance tests is attributed to Bonferroni, who suggested the adjustment in two papers in 1935 and 1936 (De Veaux, Velleman, and Bock, 2008).5 The Bonferroni procedure makes a simple adjustment to the standard significance tests. Suppose N independent tests will be performed, the significance level selected for the tests is α, and the p-values are ρ1, ρ2,..., ρN. The Bonferroni procedure involves adjusting the p-value for each individual test to the lesser of one and The Bonferroni procedure is implemented by adjusting the significance level for each individual test to α/ N and then fixing the critical value to correspond to this adjusted significance level. This adjustment apportions the probability of type I error α equally among the N critical regions. Thus, for example, if the significance level is α = 0.05 based on a one-tailed test and N = 20 tests, then α/ N = 0.0025. The critical z-value is −2.8070 for each of the 20 tests versus a −1.645 critical z-value for a single independent test. The probability of type I error before the Bonferroni adjustment is and this is unacceptably high for most forensic investigations. There is nearly a two-thirds chance of a false discovery. The probability of type I error after the adjustment isThe Bonferroni adjustment is very restrictive because it distributes the error rate equally across all the confidence intervals (De Veaux, Velleman, and Bock, 2008, page 732). For example, suppose there are 20 independent trials of a hypothesis and that each test result is significant at the 5% level. Common sense and sound scientific practice would interpret the results of these 20 trials as evidence against the null hypothesis. However, the Bonferroni adjustment would render all of this statistical evidence irrelevant because this adjustment would convert all 20 p-values into a value of 1.0. A p-value of 1.0 would occur, for instance, if one were testing a coin for fairness by flipping it 100 times and getting exactly 50 heads and 50 tails in the 100 trials. Consequently, 20 repeated trials are performed and the results of each apparently significant trial are adjusted by applying the Bonferroni method to produce a set of adjusted results that are the same as 20 trials where the data come out exactly as the null hypothesis would predict. Practically speaking, such an adjustment makes no sense.Tabak (2006) recommends the Šidák adjustment, which inverts the probability expression in equation (2). It sets the significance level for each individual test in the preceding example to when α = 0.05 and N = 20. As Tabak (2006) notes this common significance level for each test is consistent with an overall 5% probability of type I error.6 For each individual test in the example, the significance level is 0.256% and the critical z-value is −2.7994 versus a −1.645 critical z-value for a single independent test. But as with the Bonferroni adjustment, there is a cost to imposing a sharply reduced significance level for each individual test: the power of the test decreases.The Šidák adjustment is very similar to the Bonferroni adjustment for N<10 and small α (Simes, 1986). Both apply the same significance level to each individual test, and the two adjustments closely approximate each other.7 In the preceding example, α/N = 0.00250 and αs = 0.00256. Consequently, the loss of statistical power is similar for the Šidák adjustment and the Bonferroni adjustment.8Benjamini and Hochberg (1995) have devised a more powerful test that explicitly controls for false-discovery bias. Suppose the testing involves N independent tests. Order the N p-values from lowest to highest: The corresponding hypotheses are Reject the k null hypotheses H1, H2,...,Hk where k is the largest value of i for which The Benjamini-Hochberg procedure is implemented by adjusting the respective p-values to the lesser of one and Comparing equations (1) and (6) reveals the more restrictive nature of the Bonferroni adjustment (essentially i=1), which is also responsible for its weaker power.The Benjamini-Hochberg test can be thought of as a step-down procedure in which one first tests whether ρN ≤ α. If so, then reject all N null hypotheses. If not, then compare ρN–1 and α(N–1) / N. If inequality (5) is satisfied, then reject the first N − 1 null hypotheses. If it is not satisfied, then proceed to progressively smaller p-values until one satisfies the inequality. At the first step, this procedure runs the standard test. At the last step, if not terminated earlier, it compares ρ1 and α / N as in the classic Bonferroni adjustment. In between, the Benjamini-Hochberg procedure applies more stringent testing (i.e., lower α(i / N)) to the successively stronger empirical test results (i.e., lower p-value). Benjamini and Hochberg (1995) demonstrate that their procedure has greater power than the Bonferroni adjustment. Given the similarity of the Šidák adjustment to the Bonferroni adjustment, the Benjamini-Hochberg adjustment is also more powerful than the Šidák adjustment.When a multiple-comparison adjustment is appropriate, the particular test should be chosen with care because multiple-comparison adjustments tend to increase the likelihood of type II error (Dunnet and Coldsmith, 2005). The more restrictive the adjustment, the greater the loss of power. The following event-study example illustrates how this choice can affect the results of the testing.Suppose a firm allegedly committed securities fraud by engaging in improper accounting. The firm made six separate news announcements disclosing its participation in an industry-wide scheme to pay contingent insurance commissions and rig insurance bids and confessing to its improper accounting for this activity. Suppose a forensic economist wishes to investigate the theory that the firm committed securities fraud and decides to examine the materiality of each news announcement based on a two-tailed test with significance level α = 0.10. An event study test is performed for each of the six disclosure event days. The null hypothesis is that the expected abnormal stock return is zero. The probability of falsely rejecting the null hypothesis of zero abnormal return is 0.10.Each of the announcements discloses different corrections to one or more of the firm's previously published financial statements that will require a restatement. First, based on a careful analysis of each announcement, it was determined that economic theory predicts that each of the six relevant daily disclosures should be regarded as economically significant by investors. Next, the materiality of the disclosures on each date was examined for statistical significance to assess its materiality. Since the information disclosed by the firm on each date was found to be unique to that particular date, its significance should be evaluated separately from the investigations performed for the other five dates. Statistical significance at the 10% level is found for each date, as reported in Table 1.Suppose instead the forensic economist is only interested in reporting what are the specific incidences where the announcement elicited a significant stock price reaction and adjusts for multiple comparisons. Table 1 compares the event-study t-test results with the Bonferroni, Šidák, and Benjamini-Hochberg adjustments. The Bonferroni or Šidák adjustments would lead to rejection of three of the six null hypotheses and thus support the conclusion that the statistical significance of the fraud disclosures could not be rejected at the 10% level based on two-tailed tests for three of the six dates.9 In contrast, the Benjamini-Hochberg procedure would lead to the conclusion that the null hypothesis of zero abnormal return could still be rejected at the 10% level based on two-tailed tests for all six dates. As illustrated in Table 1, the abnormal returns for the six dates are statistically significant at the 3.0%, 3.2%, 3.6%, 6.0%, 9.0%, and 9.0% levels, respectively, based on the Benjamini-Hochberg adjustment. This adjustment leads to q-values that exceed the p-values for all but one test result but does not reverse the conclusion regarding the statistical significance of the abnormal returns for any of the six dates.Tabak (2006) is the first article to draw attention to the potential usefulness of the multiple-comparison adjustment in forensic studies. However, no adjustment for multiple comparisons is required in securities fraud event studies when each of a series of fraud-related news disclosure events is properly evaluated separately from all the other news disclosure events under consideration. In situations where a multiple-comparison adjustment is called for, the Benjamini-Hochberg adjustment, which explicitly controls for false-discovery bias, is more powerful than the Bonferroni or Šidák adjustments. Ultimately, the forensic economist must decide whether the restrictiveness of the Bonferroni or Šidák adjustments is worth the loss of power inherent in choosing these techniques.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call