Notes on Mengel, Sauermann, Zölitz (2017)

This recent paper supposedly proves that teaching evaluations are biased against women. This paper is much better than most in the area and does make a fairly compelling case for gender bias. Nevertheless I shall note some issues.

The study’s data set is impressive: 19,952 student evaluations of university faculty in courses where students were randomly allocated to instructors. Female faculty were rated lower, despite producing the same outcomes in terms of grades.

A major problem, however, is this: The evaluation forms completed by students never actually asked them to judge whether the teacher was good or bad. Here is what the students were actually asked (39):

T1: “The teacher sufficiently mastered the course content”
T2: “The teacher stimulated the transfer of what I learned in this course to other contexts”
T3: “The teacher encouraged all students to participate in the (section) group discussions”
T4: “The teacher was enthusiastic in guiding our group”
T5: “The teacher initiated evaluation of the group functioning”

When the authors say female faculty received lower evaluations, they mean lower average score on these five items. But these five items are very poorly conceived as a way of capturing teaching quality, for the following obvious reasons.

T1 is a bad measure of teaching quality since you can master the content and still be a lousy teacher.

T4 is a bad measure of teaching quality since a teacher can be enthusiastic but ineffectual, or dry but effective.

T3 is very dubious since the pedagogical strategy of calling on reluctant students is not necessarily positive.

T5 is a bad measure of teaching quality since it’s pointless if the group worked fine already. The data suggests that groups on the whole worked fine (39). If the instructor saw this and hence for this reason did not “initiate evaluation of the group functioning,” then it obviously makes no sense to punish this teacher in the course evaluations for not wasting class time on a needless group evaluation.

The instructor’s performance on T2 can by definition not be checked by controlling for course grade. It could be that female faculty were simply worse at this. The conclusions of the study follow only if we agree that the equality of grade outcomes prove that female faculty performed equally well. But T2 specifically asks for things that go beyond the course, i.e., things that do not count toward the course grade. Hence we have no way of telling whether the students’ assessment of T2 were biased or accurate.

In sum, the supposed evaluative measure of teaching quality is not a measure of teaching quality at all. The assumption—essential for the study’s conclusions—that equality of grade outcomes means equality of instructor performance on T1-T5 is unwarranted.

There are some grounds to nevertheless maintain the authors’ interpretation. One is that the bias seems to cut somewhat uniformly across T1-T5, suggesting that the students harbour blanket or generic depreciation of female faculty rather than giving thoughtful and reliable answers to each item separately. At least this is indicated by the only data we have showing a breakdown of the items T1-T5 one by one (Table B3). Unfortunately, we have such data only for graduate student instructors. There is reason to think that this is the instructor group that most confirm the authors’ thesis of gender bias. For the bias against female faculty “is larger for mathematical courses and particularly pronounced for junior women” (abstract). This could be due to stereotype bias. Alternatively, it could be due to gender bias in favour of women in graduate student recruitment. The fact that evaluations are lowest among junior female instructors and in mathematical fields would then be a reflection of the fact that these fields have lately been very aggressive in recruiting women at all costs.

Another argument for the authors’ interpretation is the fact that the gender bias is “driven by male students’ evaluations” (abstract). If female faculty were genuinely worse, wouldn’t female students too recognise this? Maybe. But an alternative explanation could be that female faculty are especially supportive of female students, so that the differing evaluations by student gender reflect a genuine difference in the quality of instruction received. The authors themselves note that this is by no means an outlandish hypothesis: “Female students receive 6% of a standard deviation higher grades in non-math courses if they were taught by a female instructor compared to when they were taught by a male instructor. … This might be evidence for gender-biased teaching styles.” (30) Note also that it is easy to imagine how T3 in particular could reflect such bias.

One reason to think that the students are not entirely off the mark in their evaluations is how their judgement develops over time. “The bias for male students is smallest when they enter university in the first year of their bachelors and approximately twice as large for the consecutive years. For female students, we find that only students in master programs give lower evaluations when their instructor is female, but not otherwise.” (30) You would think that students would get better rather than worse at judging teaching quality in the course of their education.

Here’s another point:

“Strikingly, despite the fact that learning materials are identical for all students within a course and are independent of the gender of the section instructor, male students evaluate these worse when their instructor is female.” (3)

Two possible explanations suggest themselves:

(a) The students are blinded by bias and cannot evaluate the course materials objectively. They let their predjudice against the female instructor cloud their judgement even on this question which had nothing to do with her.

(b) Female instructors were less good and hence unable to highlight and bring out positives and insights in the course materials, thereby making the course material seem less good. Hence lower evaluations of instructors and course materials go hand in hand.

Of course the authors suggest (a). But the supposed logic behind this is somewhat dubious. If male students hate women, shouldn’t their evaluation of the textbook be based on the gender of the textbook author? If they are driven by and seek to express their dislike of the female instructor, and the textbook was written by a male author, shouldn’t they rate the textbook higher rather than lower, so as to convey that it was the particular instructor rather than the course materials that were at fault? In fact, if the students had done precisely this, then that too could have been used as evidence of their blatant gender bias. Thus two completely different outcomes could both be spun as clear evidence of gender bias. This suggests that we should be careful before jumping to the conclusion that the data confirms our favoured hypothesis.