Article from: The UMAP Journal Fall 1998 pages 323-327

Judge's Commentary

Grade point average (GPA) is the most widely used summary of undergraduate student performance. Unfortunately, combining student grades using simple averaging to obtain a GPA score results in systematic biases against students enrolled in more rigorous curricula, and/or taking more courses. An example from Larkey of four students (call them I-IV) in which student I always obtains the best grade in every class she takes and IV always obtains the worse grade in every class he takes, yet student I has a lower GPA than student IV does, is as follows:

 Student I Student II Student III Student IV Class GPA Class 1 B+ B- 3.00 Class 2 C+ C 2.15 Class 3 A B+ 3.65 Class 4 C- D 1.35 Class 5 A A- 3.85 Class 6 B+ B 3.15 Class 7 B+ B 3.15 Class 8 B+ B B- C+ 2.83 Class 9 B B- 2.85 Student GPA 2.78 2.86 2.88 3.0

The MCM problem was to determine a ``better'' ranking than one using pure GPAs; this problem has no simple ``solution''. A recent paper by Johnson [1] refers to many studies of this topic, and suggests a technique that was considered, but not accepted, by the faculty at Duke University in North Carolina.

Each of the participating schools is to be commended for its efforts in tackling this problem. As in any open-ended mathematical modeling problem there is not only great latitude for innovative solution techniques but also the risk of finding no results valuable in supporting one's thesis. Solutions submitted contained a wide variety of approaches including graph theory and fuzzy logic.

Unfortunately, several teams were confused as to the exact problem the Dean wanted solved. Assigning students to deciles, by itself, was not the problem. For example, deciles could be assigned by obtaining a list of student names, and choosing the first 10% of the students to be in the first decile, etc. What the Dean wanted was meaningful students deciles, based on students' relative class performance. Simply re-scaling GPA's such that the average became lower (and the top 10% became more spread out) would not change the inherent problem.

The problem statement suggested that relative rankings of students within classes should be used to evaluate student performance. With this assumption, possible approaches include:

• relative ranking could be used with grade information

(A useful additional assumption might be that faculty would give grades based on an absolute concept of what constitutes mastery of a course.)

• relative ranking could be used without grade information

In the latter case (which was chosen by most teams), an instructor who assigns A's to all students in a class provides exactly the same information as an instructor who assigns all C's to the same students when enrolled in another class.

Specific items that the judges looked for in the papers included:

• Reference to ranking problems in other fields. For example, the ranking of chess players and professional golf players is obtained from relative performance results.
• A detailed worked out example, illustrating the method(s) proposed; even if there were only 4 students in the example.
• Computational results (when appropriate) and proper consideration of large datasets. Teams that used only a small sample in their computational analysis (say 20 students) did not appreciate many of the difficulties with implementing a grade adjustment technique.
• Even though the GPA may not be as ``good'' a discriminator as the various solutions obtained by the teams, it would seem reasonable that there be some correlation between the two. Judges were pleased to see this observation mentioned, even if not used.
• A response indicating understanding of the question about the changing of an individual student's grade was rewarded. Such a grade change could affect that student's ranking, but if it affected many other students' ranks then the model is probably unstable.
• An outstanding paper had a clear, concise, complete, and meaningful list of assumptions. Needed assumptions included:

• The average grade given out was an A-. (This must be assumed, as it was stated in the problem statement--amazingly, several teams assumed other starting averages!)
• When using a (A+,A,A-,...)system not all the grades given out were A-. (Otherwise, there is no hope for distinguishing student performance.)

Many teams confused assumptions with the results they were trying to obtain. Teams also made assumptions that were not used in their solution, were naive, or needed further justification. For example,

• Many teams assumed a continuous distribution of grades. As an approximation of a discrete distribution, this is fine. However, several teams allowed grades higher than A+'s to be given out, and other teams neglected to convert the continuous distribution to a discrete one when actually simulating grades.
• Several teams assumed that teachers routinely turn in a percentage score or class ranking with each letter grade. This, of course, would be very useful information but is not realistic.
• Low grades in a course do not necessarily imply a course is difficult. A course could be scheduled only for students who at ``at risk''. Likewise, a listing of faculty grading does not necessarily allow ``tough'' graders to be identified: a teacher may only teach ``at risk'' students.

• The most straightforward approaches to solving this problem were:
• Use of information about how a specific student in a class compared to the statistics of a class. For example, ``student 1's grade was 1.2 standard deviations above the mean, student 2's grade was equal to the mean, ...''. The numbers (1.2, 0, ...) can be used to build up a ranking.
• Use of information about how a specific student in a class compared to other specific students. For example, ``in class 1, student 1 was better than student 2, student 1 was better than student 3, ...''. This information can be used to build up a ranking.

As these techniques are quite simple, the judges rewarded mention of these techniques, even if other techniques were pursued.

• Other features of an outstanding paper included:
• Clear presentation throughout
• Concise abstract with major results stated specifically
• A section devoted to answering the specific questions raised in the problem statement; or stating why answers could not be given.
• Some mention of whether the data available (i.e., sophomores being ranked with only 2 years worth of data) would lead to statistically valid conclusions.

None of the papers had all of the components mentioned above, but the outstanding papers had many of these features. Specifics pluses of the outstanding papers included:

• Duke team
• Their summary was exemplary. By reading the summary you could tell what they were proposing and why, what the issues they saw were, and what the models the produced were.
• Their use of least squares to solve an over-determined set of equations was innovative.
• Their figures of raw and ``adjusted'' GPAs clearly and visually showed the correlation between the two, and the amount of ``error'' caused by exclusive use of GPAs.

• Harvey Mudd team
• Their sections on ``Practical considerations'' and ``What characterizes a good evaluation method'' demonstrated a clear understanding of the problem.
• Their figures of ``Raw GPA'' versus ``Student quality'' clearly and visually showed the correlation between the two, and the amount of ``error'' caused by exclusive use of GPAs.
• Stetson team
• Their use of the median, as well the mean, when comparing a specific student to the statistics of a class was innovative. (Use of the median reduces the effects of outliers.)
• They interpreted the results of rank adjustment for specific individuals in their sample.
• An awareness of the problem as indicated by literature references in their ``Background Information'' section.

References

1.
Valen E. Johnson, ``An Alternative to Traditional GPA for Evaluating Student Performance'', Statistical Science, 1997, Vol. 12, No. 4, pages 251-278.