Replicability, Exam Grading, and Fairness

Exam-Grade-2a.jpgWhat does it mean to grade fairly?

At my law school, and presumably elsewhere, law students aggrieved by a grade can petition that it be changed. Such petitions are often granted in the case of mathematical error, but usually denied if the basis is that on re-reading, the professor would have reached a different result. The standard of review for such petitions is something like “fundamental fairness.” In essence, replicability is not an integral component of fundamental fairness for these purposes.

Law students may object to this standard, and its predictable outcome, asserting that if the grader can not replicate his or her outcomes when following the same procedure, then the total curve distribution is arbitrary. On this theory, a student at the least should have the right to a new reading of their test, standing alone and without the time-pressure that full-scale grading puts on professors.

To which the response is: grading is subjective, and not subject to scientific proof. Moreover, grades don’t exist as platonic ideals but rather distributions between students: only when reading many exams side by side can such a ranking be observed. We wouldn’t even expect that one set of rankings would be very much like another: each is sort of like a random draw of a professor’s gut-reactions to the test on that day.

This common series of arguments tends to engender cynicism among high-GPA and low-GPA students alike. To the extent that law school grading is underdetermined by work, smarts and skill, it is a bit of a joke. The importance placed on these noisy signals by employers demonstrates something fundamentally bitter about law – the power of deference over reason.


This is an old debate, made more salient today by my recent experiences at the AELSC One the major messages of the ELS proponents was replicability: put your data online; record your methods (even your syntax); and make sure that further researchers can recreate your results. This raised for me the question of what is an empirical legal studies approach to grading?

My grading method uses an Excel matrix. Although I sometimes say that “no one can be told what the Matrix is,” in practice this is probably untrue. Basically, I break up each exam question into its constituent issues, award points for spotting and discussing that issue, and move on. I do not have much room in my spreadsheet for novel attacks on problems (which is a problem absent notice), but I do try to correct for outlier issues. Each issue spotting test has something like 50-70 issues to catch.

There are major advantages to this method, primarily, that it allows students to see exactly how and why they lost points during exam review. However, I doubt that the scores awarded are particularly replicable (although I hope that my overall ranking of students would be fairly robust.) That is to say, my grading of law school exams essays is more art that science.

Is this bad? ELS might suggest reasons to think so, as would a fairly traditional behind-the-veil analysis. So, perhaps professors have the duty to make sure (through data diagnosis) that their grading is replicable. One way to do so would be to do blind re-grades of randomly selected exams; another would be to have two sets of graders; another would be to change to a multiple-choice framework. These are all fairly drastic or labor intensive solutions. I might support them, but I’m not yet fully convinced of the underlying claim. What do others think? Does good grading practice mean replicability? Transparency with respect to the data? Auditing? Or are those quantitative principles inapt when considering essay tests? (Another way to think about this is that the mistake is mine when I reduce each issue to a number of points. Better to consider the test as a whole?)

You may also like...

3 Responses

  1. Jeff Lipshaw says:

    Your method sounds a lot like mine. I don’t think it’s a good idea to look at the test as a whole. To the point about relative scores versus objective scores, I do one question at a time without looking at the student’s scores on preceding questions, and try, if possible, to do all of a question at one sitting. And I reverse the stack of exams on each question. It gives the student a fair shake on each question without preconceived notions. And I find that it still shakes out into a curve – by and large, the people who get As have done well consistently across all the questions.

  2. KipEsquire says:

    In the spirit of “interdisciplinary studies,” you might find this post on “grades and Wall Street” interesting. 😉

  3. Belle Lettre says:

    Speaking as both a graduate law student and an aspiring prof who has graded hundreds of exams as a TA, I say cynically that a bona fide attempt at fairness is requisite–but it’s the apearance of fairness that for some reason matters.

    Jeff’s grading system is standard, most professors in other disciplines (I speak of political science, sociology, English lit) request that you do so. And Dave, your grading system “appears” fair; that is you have created a set of metrics by which each student will be judged. Question is, do metrics ensure objectivity? The complaint about the law, social sciences and humanities from science-trained students is that we’re so subjective in our grading. Do we get around subjectivity by creating “objective” indicia? Or is everything too subjective–the framing of the question, the ranking of of the issues to spot–to get around?

    I kind of like the subjectivity of the law, there are some obvious right things to spot but it’s in the “massaging of the facts” and the application of the law to them as my Civ Pro prof said–that we find the answers. My co-blogger Jeff Harrison at Money-Law happens to think multiple choice questions are easy outs for faculty, and as a student I can say I hate them. They just bug me.

    I’m a bit disturbed by this retro kick to objectivise the law and try to create a metric for everything–aren’t we supposed to be old school legal realists? (Leiter has declared ELS to not be the New Legal Realism, it’s law + stats). But I do understand the end goal of fairness and measurability, I’m just wary of the methods. It’s all about methods in ELS.

    But for more pointed grading strategies, in every class I’ve graded (six so far) we made a key (like you, Dave) and we took a few models answers of A, B, C, D range papers. If 2-3 are of consistent quality per range, they’re good models as you go along for the rest. And I’ve found that when I type up comments students seem to think that lends validity and credibilty to my assessment.