Replicability, Exam Grading, and Fairness
What does it mean to grade fairly?
At my law school, and presumably elsewhere, law students aggrieved by a grade can petition that it be changed. Such petitions are often granted in the case of mathematical error, but usually denied if the basis is that on re-reading, the professor would have reached a different result. The standard of review for such petitions is something like “fundamental fairness.” In essence, replicability is not an integral component of fundamental fairness for these purposes.
Law students may object to this standard, and its predictable outcome, asserting that if the grader can not replicate his or her outcomes when following the same procedure, then the total curve distribution is arbitrary. On this theory, a student at the least should have the right to a new reading of their test, standing alone and without the time-pressure that full-scale grading puts on professors.
To which the response is: grading is subjective, and not subject to scientific proof. Moreover, grades don’t exist as platonic ideals but rather distributions between students: only when reading many exams side by side can such a ranking be observed. We wouldn’t even expect that one set of rankings would be very much like another: each is sort of like a random draw of a professor’s gut-reactions to the test on that day.
This common series of arguments tends to engender cynicism among high-GPA and low-GPA students alike. To the extent that law school grading is underdetermined by work, smarts and skill, it is a bit of a joke. The importance placed on these noisy signals by employers demonstrates something fundamentally bitter about law – the power of deference over reason.
This is an old debate, made more salient today by my recent experiences at the AELSC One the major messages of the ELS proponents was replicability: put your data online; record your methods (even your syntax); and make sure that further researchers can recreate your results. This raised for me the question of what is an empirical legal studies approach to grading?
My grading method uses an Excel matrix. Although I sometimes say that “no one can be told what the Matrix is,” in practice this is probably untrue. Basically, I break up each exam question into its constituent issues, award points for spotting and discussing that issue, and move on. I do not have much room in my spreadsheet for novel attacks on problems (which is a problem absent notice), but I do try to correct for outlier issues. Each issue spotting test has something like 50-70 issues to catch.
There are major advantages to this method, primarily, that it allows students to see exactly how and why they lost points during exam review. However, I doubt that the scores awarded are particularly replicable (although I hope that my overall ranking of students would be fairly robust.) That is to say, my grading of law school exams essays is more art that science.
Is this bad? ELS might suggest reasons to think so, as would a fairly traditional behind-the-veil analysis. So, perhaps professors have the duty to make sure (through data diagnosis) that their grading is replicable. One way to do so would be to do blind re-grades of randomly selected exams; another would be to have two sets of graders; another would be to change to a multiple-choice framework. These are all fairly drastic or labor intensive solutions. I might support them, but I’m not yet fully convinced of the underlying claim. What do others think? Does good grading practice mean replicability? Transparency with respect to the data? Auditing? Or are those quantitative principles inapt when considering essay tests? (Another way to think about this is that the mistake is mine when I reduce each issue to a number of points. Better to consider the test as a whole?)