The Significant Decline in Null Hypothesis Significance Testing?
(Cross-posted at Prawfs.)
Prompted by Dan Kahan, I’ve been thinking a great deal about whether null hypothesis significance testing (NHST, marked by p values) is a misleading approach to many empirical problems. The basic argument against p-values (and in favor of robust descriptive statistics, including effect sizes and/or Bayesian data analysis) is fairly intuitive, and can be found here and here and here and here. In a working paper on situation sense, judging, and motivated cognition, Dan, I, and other co-authors explain a competing Bayesian approach:
In Bayesian hypothesis testing . . . the probability of obtaining the the effect observed in the experiment is calculated for two or more competing hypotheses. The relative magnitude of those probabilities is the equivalent of a Bayesian “likelihood ratio.” For example, one might say that it would be 5—or 500 or 0.2 or 0.002, etc.—times as likely that one would observe the results generated by the experiment if one hypothesis is true than if a rival one actually one is.
Under Bayes’ Theorem, the likelihood ratio is not the “probability” of a hypothesis being true but rather he factor by which one should update one’s prior assessment of the probability of the truth of a hypothesis or proposition. In an experimental stetting, it can be treated as an index of the weight with which the evidence supports one hypotheses in relation to the another.
Under Bayes’ Theorem, the strength of new evidence (the likelihood ratio) is, of course, analytically independent of one’s prior assessment of the probability of the hypothesis in question. Because neither the validity nor the weight of our study results depends on holding any particular prior about the [question of interest] we report only the indicated likelihood ratios and leave it to readers to adjust their own beliefs accordingly.
To be frank, I’ve been resisting Dan’s
hectoring entreaties arguments to abandon NHST. One obvious reason is fear: I understand the virtues and vices of significance testing well. It has provided me a convenient heuristic to know when I’ve “finished” the experimental part of my research, and am ready to write the over-promising introduction and under-delivering normative sections of the paper. Moreover, p-values are widely used by courts (as Jason Bent is exploring). Or to put it differently, I’m well aware that the least positive thing one can say about a legal argument is that it is novel. Who wants to jump first into deep(er) waters?
At this year’s CELS, I didn’t see a single paper without p-values. So even if NHST is in decline, the barbarians are far from the capital. But, given what’s happening in cognate disciplines, it might be time for law professors to get comfortable with a new way of evaluating empirical work.