Shy Tories & Other Known Unknowns
Over on 538, Nate Silver concludes his latest polling roundup with the following observation:
McCain’s chances, in essence, boil down to the polling being significantly wrong, for such reasons as a Bradley Effect or “Shy Tory” Effect, or extreme complacency among Democratic voters. Our model recognizes that the actual margins of error in polling are much larger than the purported ones, and that when polls are wrong, they are often wrong in the same direction.
The shy tory effect arises from systemic sampling error — nonrespondents to a survey being distinct from respondents, in a statistically significant way. In tomorrow’s vote, such systemic bias in polling might arise from conservatives or African-Americans refusing to be polled.
What’s difficult about the shy-tory bias, unlike the cell-phone effect, is that there are few ways to excavate it before the egg hits your face. If you can’t look at a part of the population, you can’t do much to model its characteristics.
Legal empiricists face a variant of the shy-tory problem in various areas. Take, for example, settlement practice. Since settlements are private, we have basically no idea about their content. (Apart from those few doctrinal nooks where settlements must be on the record.) For lack of information about settlements, we are left drawing very strong conclusions about the distributive consequences of litigation (e.g., plaintiff win rates) that are almost certainly wrong. (By contrast, we know a great deal about plea bargains, and the resulting scholarship offers quite a dark view of the criminal justice system.) The real problem for empiricists is that we’re unlikely to ever learn much about settlement practice in the aggregate, and thus will be unable to reality-check our models.
A commentator to a previous post asked me to start a conversation on the pitfalls of collecting data. Here’s the first: every sample (and every survey) faces a degree of bias that can’t be captured. Every time I’ve presented a a data-centered paper to an audience of non-empiricists, the (un?)known unknown question comes up – – you haven’t/can’t measure part of the population, doesn’t that mean your conclusions are overstated? You should welcome debate about this sampling error. In a real sense, the questioner is right: it’s very tempting to step outside your data, especially if you think you’ve a representative sample, but it is important to try to make parsimonious claims about even good datasets. That way, if reality bites against you, you will at least be less embarrassed.