Article from: The UMAP Journal Fall 1998 pages 323-327
The Outstanding Grade Inflation Papers
Grade point average (GPA) is the most widely used summary of
undergraduate student performance. Unfortunately, combining student
grades using simple averaging to obtain a GPA score results in
systematic biases against students enrolled in more rigorous
curricula, and/or taking more courses. An example from Larkey of four
students (call them I-IV) in which student I always obtains the best
grade in every class she takes and IV always obtains the worse grade
in every class he takes, yet student I has a lower GPA than student IV
does, is as follows:
The MCM problem was to determine a ``better'' ranking than one using
pure GPAs; this problem has no simple ``solution''. A recent paper by
Johnson  refers to many studies of this topic, and suggests a
technique that was considered, but not accepted, by the faculty at
Duke University in North Carolina.
Each of the participating schools is to be commended for its efforts
in tackling this problem. As in any open-ended mathematical modeling
problem there is not only great latitude for innovative solution
techniques but also the risk of finding no results valuable in
supporting one's thesis. Solutions submitted contained a wide variety
of approaches including graph theory and fuzzy logic.
Unfortunately, several teams were confused as to the exact problem the
Dean wanted solved. Assigning students to deciles, by itself, was not
the problem. For example, deciles could be assigned by obtaining a
list of student names, and choosing the first 10% of the students to
be in the first decile, etc. What the Dean wanted was meaningful
students deciles, based on students' relative class performance.
Simply re-scaling GPA's such that the average became lower (and the
top 10% became more spread out) would not change the inherent
The problem statement suggested that relative rankings of students
within classes should be used to evaluate student performance. With
this assumption, possible approaches include:
In the latter case (which was chosen by most teams), an instructor who
assigns A's to all students in a class provides exactly the same
information as an instructor who assigns all C's to the same students
when enrolled in another class.
Specific items that the judges looked for in the papers included:
Reference to ranking problems in other fields. For example, the
ranking of chess players and professional golf players is obtained
from relative performance results.
- A detailed worked out example, illustrating the method(s)
proposed; even if there were only 4 students in the example.
- Computational results (when appropriate) and proper
consideration of large datasets. Teams that used only a small
sample in their computational analysis (say 20 students) did not
appreciate many of the difficulties with implementing a grade
- Even though the GPA may not be as ``good'' a discriminator as the
various solutions obtained by the teams, it would seem reasonable
that there be some correlation between the two. Judges were
pleased to see this observation mentioned, even if not used.
A response indicating understanding of the question about the
changing of an individual student's grade was rewarded. Such a
grade change could affect that student's ranking, but if it
affected many other students' ranks then the model is probably
- An outstanding paper had a clear, concise, complete, and meaningful list
of assumptions. Needed assumptions included:
- The average grade given out was an A-. (This must be assumed, as it
was stated in the problem statement--amazingly, several teams
assumed other starting averages!)
When using a (A+,A,A-,...)system not all the grades
given out were A-. (Otherwise, there is no hope for
distinguishing student performance.)
Many teams confused assumptions with the results they were trying to
obtain. Teams also made assumptions that were not used in their
solution, were naive, or needed further justification. For example,
- Many teams assumed a continuous distribution of grades. As an
approximation of a discrete distribution, this is fine. However,
several teams allowed grades higher than A+'s to be given out, and
other teams neglected to convert the continuous distribution to a
discrete one when actually simulating grades.
- Several teams assumed that teachers routinely turn in a percentage
score or class ranking with each letter grade. This, of course,
would be very useful information but is not realistic.
Low grades in a course do not necessarily imply a course is
difficult. A course could be scheduled only for students who at
``at risk''. Likewise, a listing of faculty grading does not
necessarily allow ``tough'' graders to be identified: a teacher
may only teach ``at risk'' students.
- The most straightforward approaches to solving this problem were:
- Use of information about how a specific student in a class
compared to the statistics of a class. For example, ``student 1's
grade was 1.2 standard deviations above the mean, student 2's
grade was equal to the mean, ...''. The numbers
(1.2, 0, ...) can be used to build up a ranking.
- Use of information about how a specific student in a class
compared to other specific students. For example, ``in class 1,
student 1 was better than student 2, student 1 was better than
student 3, ...''. This information can be used to build up a
As these techniques are quite simple, the judges rewarded mention
of these techniques, even if other techniques were pursued.
- Other features of an outstanding paper included:
- Clear presentation throughout
- Concise abstract with major results stated specifically
- A section devoted to answering the specific questions raised in the
problem statement; or stating why answers could not be given.
- Some mention of whether the data available (i.e., sophomores
being ranked with only 2 years worth of data) would lead to
statistically valid conclusions.
None of the papers had all of the components mentioned above, but the
outstanding papers had many of these features. Specifics pluses of
the outstanding papers included:
- Duke team
- Their summary was exemplary. By reading the summary you could tell
what they were proposing and why, what the issues they saw were,
and what the models the produced were.
Their use of least squares to solve an over-determined set of
equations was innovative.
- Their figures of raw and ``adjusted'' GPAs clearly and visually
showed the correlation between the two, and the amount of
``error'' caused by exclusive use of GPAs.
- Harvey Mudd team
- Their sections on ``Practical considerations'' and ``What
characterizes a good evaluation method'' demonstrated a clear
understanding of the problem.
- Their figures of ``Raw GPA'' versus ``Student quality'' clearly
and visually showed the correlation between the two, and the
amount of ``error'' caused by exclusive use of GPAs.
- Stetson team
- Their use of the median, as well the mean, when comparing a
specific student to the statistics of a class was innovative.
(Use of the median reduces the effects of outliers.)
They interpreted the results of rank adjustment for specific
individuals in their sample.
An awareness of the problem as indicated by literature references
in their ``Background Information'' section.
- Valen E. Johnson, ``An Alternative to Traditional GPA for Evaluating
Student Performance'', Statistical Science, 1997, Vol. 12,
No. 4, pages 251-278.
About the Author
Daniel Zwillinger attended MIT and Caltech, where he obtained a PhD in
applied mathematics. He taught at RPI for four years, worked in
industry for several years (Sandia Labs, JPL, Exxon, IDA, MITRE, BBN),
and has been managing a consulting group for the last five years. He
has worked in many areas of applied mathematics: signal processing,
image processing, communications, and statistics. He is the author of
several reference books in mathematics.