# Exam Grading and Standard Deviations

Dave’s recent posts about grading have me wondering. Whenever I grade, I encounter the following mathematical choice, and I am often torn about which is the proper, fair choice to make.

Imagine you give an exam with two questions, each supposedly worth 50% of the final grade. Imagine further you grade both questions and properly normalize the scores for each one to a 50 point scale. (I’m not so sure all professors normalize properly, but that’s a different problem.)

What do you do if the standard deviations in the two normalized grade populations vary widely? In other words, imagine that question one elicits a long, flat curve: the lowest score is much lower than the highest score, and there is a lot of variation in the scores in between, while question two elicits a compact curve with a very high peak that drops off quickly in both directions.

Is it legitimate (fair, proper) simply to add the normalized scores for questions one and two to derive the final score? Does this cause the first question to exert an unfairly disproportionate effect on the final curve? First, consider the extreme case. In a class of 50 students, every student gets a different normalized score for question one–from one to fifty points–while every student in the class gets the exact same normalized score–say 20 points–for question two. Simply adding the scores together means the final curve will match the curve for question one exactly, and question two will have been written out of the exam.

This seems to be the fair result. Question two is a bad question. It didn’t differentiate between the students in the class, so it is fair to curve the class based solely on their performance on question one. What is the alternative?

But what if we’re not at the extreme case? Imagine question one’s curve is much flatter than (the standard deviation of the scores is much higher than) question two’s curve, yet question two’s curve nevertheless differentiates between the students. Is it fair simply to add the two, or are you failing to abide by your promise to your students to have each question be worth 50% of the exam?

If you think that it is *not* fair simply to add, you can apply a transformation to one set of data or the other to bring the standard deviations more in line with one another. Is this proper?

My initial take is that sometimes the transformation is fair and sometimes it is not. It depends on what you think about the objective quality of your grading methods and the uniformity of the difficulty of the questions you wrote. For example, if question one is much more difficult than question two, perhaps the curve *should* be driven by question one, and the data should not be transformed (you can make the opposite argument). In contrast, if question one is an issue spotter and question two is a policy question, simply adding the normalized scores may not reflect the greater subjectivity in grading policy questions, and a transformation may be in order.

There are no neutral choices here. Unless the scores for questions one and two are highly correlated, many students’ final grades will vary based on the choice made. At the very least, this is yet more proof of the inherent subjectivity of the entire grading process. Have others thought about this, and if so, which choices have you made?

I hope it’s not bad form to post the first comment to myself. As I have learned so often in this job, I should have done a literature search before posting! This article for example seems a good start. It calls what I am talking about the “Weight Problem” and suggests ALWAYS using a transformation (in particular, a T-score) to bring multiple assignments’ standard deviations into line.

I don’t think this is right. I think it unconsciously embeds particular answers–value choices–to the questions I posed above.

Just an initial reaction, but doesn’t the problem assume correlation between the two scores? If the two grades aren’t correlated, then you truly are counting each 50%. If they are correlated, wouldn’t a transformation not accomplish much?

It seems like a post-curve transformation would strip the grading process of some transparency.

For example, if I’m a student and I know that each question is worth half my grade, I’ll work equally hard on both questions. If one of the questions is worth 75% and the other 25%, I’ll scale my efforts accordingly. But if the questions say they’re weighted 50-50, but end up transformed to a 75-25 weight, how should I concentrate my effort and time? Would I have done anything different on a 75-25 split than I did on a 50-50?

A transformation would also seem to harm those who did well on the question that was a better differentiator (for lack of a better word, the “harder” question), and help those who did better on the worse differentiator (the “easier” question). Does it make sense to punish success on a high-variance question in order to reward marginally better answers on a low-variance question?

I don’t think “each question is worth half the test” necessarily implies that each question must be equally important in determining a student’s place in the curve. Students know that some questions are tougher than others, and that “10% of your score on this exam” doesn’t mean “10% of the impact on your standing within the class.”

All this statistics talk is making the stairs look like a better, and maybe no less accurate, choice!

Bill, I don’t think it matters. As an example, imagine that the ten students in a class received the following normalized scores out of one hundred (I’m using 100 instead of 50 because it makes the numbers easier to cook, but it’s still a valid example)(in order, by student ID #):

Question 1: 10 20 30 40 50 60 70 80 90 100

Question 2: 92 96 98 97 99 93 94 95 100 91

These two data sets are uncorrelated (Pearson r = -0.0667).

If the final score is calculated through a simple sum, the students’ final grades will be:

Sum: 102 116 128 137 149 153 164 175 190 191

In other words, the final order of the students is entirely dictated by their order on question one.

If, instead, you calculated T-scores as the paper linked to in my first comment recommends, the students would finish in the following order:

Student #: 9 (best) 5 8 10 7 4 3 6 2 1 (worst)

These are useful insights, Jim. Even if I agree that “student expectations” should inform the answer to my question, I think it begs the question a bit. What do students expect?

For example, I practice the “point counting” method of grading issue spotters. A typical question might have 35 or 40 possible points, for example. Every year, even students who botch a problem badly tend to mop up 15, 20, 25 points, or around half of the points available.

In contrast, I tend to grade policy questions using a 1 to 15 or 1 to 10 scale, more gestalt grading system. Although students rarely score a 1 or 2, I regularly give 3’s and 4’s. It wouldn’t seem fair to those rare souls who earn the perfect 15’s to do otherwise. Assuming an otherwise normal distribution, when I normalize these two types of questions to 50 points, the policy question will have much more spread and much more of an effect on the final grade.

Put another way, there is more of a premium placed on getting the best score on the policy question than in getting the best score on the issue spotter, all else being equal.

Is this consistent with student expectations?

In fact, if you know that your professor follows the “Sum without Transforming” rule, then you can game the system. Faced with an nominally 50/50 exam, if you notice that one question is much, much more difficult than the other (ie likely to lead to much more point spread) it is in your best interest to spend more than half the time on the harder question. Every extra point you mop up during the extra time you spend will outweigh the equivalent point you might lose in the “narrower curve”, easier problem. Obviously, it’s a risk, but it also shows the possible unfairness of not doing a transformation.

This is a question that I’ve thought about, without arriving at a great answer.

My approach has varied. At first, I just added up points. Then, I went to a pure normalization, using a “z-score” spreadsheet that a colleague provided. Currently, I use a hybrid model.

I allocate points using a normalized curve. However, I keep the students’ raw points in a column nearby, and often use raw points to determine grade break points.

However, additional questions come up as to what level of specificity to apply the normalization. I normalize in two ways. I normalize an “essay” grade and a multiple-choice grade. (My school culture is to include a multiple-choice section). I also normalize a grade for each individual essay. Then, I take the higher of the two — that is, either the “essay” or the “essay 1 + essay 2.”

Any set of choices we make will result in some group being favored. My hybrid essay normalization will result in favoring either students who did phenomenally well on one essay and not on the other, or students who did exactly the same on both essays. Students who fall into the in-between areas will be less advantaged.

One of the points of having multiple questions, it seems to me, is to broaden the amount of knowledge necessary to do well, and lessen the probability that someone could do poorly just because they overlooked 5 pages in a 1000-page textbook. One of my worries is that my selection of topics on which to test is not sufficiently well-dispersed to begin with, and I certainly wouldn’t want to make it less so by effectively eliminating one of the questions. So I guess I “transform” my grades by converting the raw scores to roughly equally lumpy bell curves for each question, within reason, then averaging those, so that no one question is much more determinative of the outcome than another. I’m also much more confident in my ability to rank essay answers in quality than to assign raw scores to them, so a 5-point difference on Question 1 might be equivalent to a 10-point difference on Question 2 anyway. But I don’t know much about statistics, so it’s all done according to a “look and feel” test.

Not that I’m averse to learning more. Where can I go to learn about proper “normalizing”?

It terms of the underlying problem of unfairness posed by Paul, the T-Score referred to the article would solve the problem. In response to Bruce’s query, here is how you standardize (aka Z-score) using Excel:

1) List the raw score for each question in a column; let’s assume it is 50 students in cells A1:A50

2) Calculate the mean for the question [=average(A1:A50)]

3) Calculate the standard deviation for the question [=stdev(A1:A50)]

4) With these calculations, you can “standardize” grades for the question. In cell B1, type the following formula: =standardize(A1,[average],[std dev]). Then fill in this formula for the remaining 49 cells in column B. [Note that the fill process goes smoother if you “name” your arrays; this is found on the “Insert” tab on Excel]

(for the ambitious, steps 1-4 can be done in a single equation)

5) repeat for each question.

Step #4 will change the mean score to zero and every score above and below the mean will be reported in standard deviation units. For example, a -2.00 will be a terrible answer (bottom 2%) and a 2.00 will be terrific answer.

To transform this to a 50 point mean / 10 point std dev (the T-Score discussed above by Paul), you multiple the Z-score (the value in cell B1) x 10 and add 50. So, in cell C1, type: =B1*10+50

Whatever you multiple the Z-Score by is effectively the weight given to the question. So weight the questions, add them up, and that is the final (and fully normalized) score.

My colleague, Jeff Stake, published an article on this very topic. See Jeffrey Evan Stake, Making the Grade: Some Principles of Comparative Grading, 52 J. Leg. Educ. 583 (2002).

I roughly normalize, but only roughly, in part because normalization using T-scores may not work so well if the underlying distributions aren’t identical. For some questions there are distinct populations of those who basically get it and those who basically don’t. These will naturally have higher standard deviations than a classic normal distribution. Sometimes as well it’s obvious that a particular question was the one that students skimped when they ran short on time. This means that there’s a lump of grades at the low end, which also increases the standard deviation. Decreasing the weight of that question means that students who didn’t use time efficiently will be somewhat insulated from the consequences of their actions. I therefore do adjust standard deviations, but I do so only after looking closely at the underlying distributions and making appropriate allowances.

This is a fantastic discussion, I think – Bill, I’ll be using your method in the future. But, the question then becomes: do “best grading practices” entail telling the students about normatlization before the fact, and showing them the spreadsheet after the fact?

Normalizing assumes that the questions are of equal value. I tend to think that greater diversity of scores suggests that a question was better able to differentiate among students, in which case it should be weighted more heavily. However, I don’t suppose this is necessarily the case, and I look at it subjectively.

Another problem with normalization. It may create a situation where a small absolute score differential of one point produces a large Z-score difference. Given my uncertainty about my accuracy in making such small absolute score differentials in the first place, that worries me.

If I have told my students that half of the weight will be on each of two questions, and they have spent their studying and exam time relying on that representation, I feel I owe them my best effort at giving the questions equal weight. To do that, each score must be devided by the standard deviation for the scores on that question before being added to the score on the other part. Z-scores are one way of doing this. In addition to the article mentioned by Bill Henderson, I have software that does this automatically. Write to me if you want to try it.

If I felt I might not end up using Z scores to weight the scores equally [or as announced], I would carefully explain in advance that I cannot tell in advance how much weight each part will have. That might not be very satisfying to students, but it is a lot more honest than saying you will give certain weights and then not doing it.

I would suggest that there is, in fact, no single method that is without defect. Even T-scores (which give consideration to differing variances across grade distributions of different “tests”) cannot be the whole answer. It is true that we do not want to implicitly weight exams with higher variances more heavily, but to normalize each test is to obviously risk losing some valuable data, as well. In the most general sense, variance may reflect one of three things: something important about the distribution of student understanding/ performance, something important about the effectiveness of our teaching, or something unrelated to the evaluation of either learning or teaching. I wish I could get rid of the unrelated data entirely, process and then remove from grading the teaching component (e.g., the question was too hard or easy or I didn’t do a good job teaching the topic, so the performances on that test were not well correlated with fundamentals of student performance), but leave behind the component of variance that reflects something important about student performance. Perhaps in one test there was a critical threshold of understanding or problem solving skill needed and that left a long tail in the distribution that accurately reflects students who showed up with insufficient mastery while in another a more superficial but comprehensive understanding of the material was being examined, and that knowledge was broadly distributed? That is meaningful, and lost if we give up the variances.

The problem, of course, is that we have no means of actually sorting through the causes of the differential variations, and keeping just what we want. But it cannot be said that getting rid of all the bathwater solves the problem perfectly.

That’s not to say T-scores and related methods don’t do more good than harm if used correctly. For me, two concerns persist: (1) I don’t think those of us who can should stop looking for ways to preserve the information content in the variance that is related to fair evaluation of student performance and (2) I think the movement in education generally (higher ed and secondary ed) toward increasing standardization of criteria for evaluation have pressured many folks to ineptly utilize naive weighted-average methods, devoid of variance adjustments. Those folks, >99%of the higher ed faculty I know, were actually better off using their unconscious faculties and professional judgment (a la Justice Potter Stewart, they know an A when they see one) then using, poorly, silly weighted-average methods, sometimes foisted upon them by naive administrators.