Boring (2015) is a major study of gender bias in student evaluations. It is based on a massive dataset of thousands of student evaluations of social science university courses.
The data is as follows (Table 1, page 18):
The final exams were double-blind graded by a third party (48), so they can be considered objective.
In a nutshell: Male students like male teachers much better, even though they do not do any better under them. Female students like male teachers a bit better, even though they do a bit better under female teachers.
Thus: Evaluations are biased in favour of male instructors. “The first main result is that gender biases exist.” (5)
Maybe so, but let’s look at the more fine-grained evaluation components in the table. To which specific aspects of teaching do students attribute their higher ratings of male teachers? Note that the answer is quite unequivocal: men are significantly better on “animation & leadership,” “current issues,” and “intellectual development,” and pretty much equal to women on the rest. Both genders of students largely agree on this. This clear pattern suggests that there are underlying differences in teaching approaches between male and female teachers, not just crude, across-the-board gender bias.
In fact, note that “current issues” and “intellectual development” are things that may be very desirable in an education generally but not directly relevant to a specific course examination. So it could be that the male teachers were better at imparting those more general skills, but no better at course-specific aspects like “organization,” “instructional materials,” etc. This hypothesis would give a rational explanation for virtually the entire table without the need to assume that the students were biased. According to this hypothesis: male and female teachers were equally good at the course-specific aspects; therefore they produced equal outcomes at the final exams; but male teachers were better at the bigger-picture aspects “current issues” and “intellectual development,” which, although not directly relevant to the exam, were nevertheless valued by students.
Even differences by student gender can explained by this hypothesis: male teachers and male students alike have a particular interest in those bigger-picture aspects, while female teachers and students alike are more focussed on the particular course at hand. If so, that would explain why male students gave the male teachers a greater “boost” in evaluations than the female students did.
It seems to me that this hypothesis is a simple and unified way of explaining many nuances of the data that are completely unexplained on the crude hypothesis of across-the-board gender bias.
I do not particularly believe this hypothesis. My point is not that it is true and that there is no gender bias. I think there is a fairly decent case to be made for some gender bias here. My point is only that if we are to work objectively with data then we must have an open mind toward alternative hypotheses such as this, instead of jumping to the gender-bias conclusion with blinders on. Gender bias may be a problem, but so is confirmation bias in favour of predetermined conclusions.
The only main point of the data that my hypothesis does not explain is the evaluation item called “quality of animation & ability to lead.” This seems to me a very stupid item, and I am inclined to dismiss it altogether. First of all, animation and leadership are surely very different things, so they should not be combined in one item. Furthermore, in the appendix where the author supposedly gives the actual questionnaire given to the students, the corresponding questions simply reads: “How do you evaluate your teacher’s class leadership skills?” (61) Is this what students were asked? Then why has the author inserted “quality of animation” in the data table at this item heading?
What is “class leadership” anyway? Does it mean leading discussions? If so it would seem to be related to many of the other items, like “communication skills,” “organization of classes,” “usefulness of feedback,” etc. But men score high on “leadership” but not on these other things, so it seems leadership must mean something else. Could it mean something like being strict and firm with deadlines for example? But then shouldn’t it be related to “clarity of assessment,” which it is not in the data? In sum, the question of “class leadership” seems to me much too vague and poorly formulated to be taken seriously. With such a vague question, it would not be surprising if students, lacking any way to answer it meaningfully, fell back on some stereotype about men being “leaders.” But this would say nothing at all about the course.
The author in fact has her own way of explaining the more fine-grained scores on the student evaluations, which goes as follows:
“The second main result I find is that students rate teachers in different dimensions of teaching according to gender stereotypes of female and male characteristics. … Students give more favorable ratings to women for teaching skills that require a lot of work outside of the classroom, such as the preparation and the organization of the course content, the quality of instructional materials, and the clarity of the assessment criteria. … Male teachers, however, tend to obtain more favorable ratings by both male and female students in less time-consuming dimensions of teaching, such as quality of animation and class leadership skills.” (5)
In my opinion, this explanation doesn’t fit the data at all (and certainly not as well as my hypothesis above). The author is trying to force gender stereotypes into the picture, but her attempt is quite absurd. Why would “clarity of assessment criteria” “require a lot of work outside of the classroom”? On the contrary, a lazy teacher will typically make up simple assessment criteria that are very clear indeed, since this makes life easy for the teacher. And I suppose no one ever works on their “leadership” and “animation” skils, they are just innate? What a strange assumption. And in any case it is very dishonest and deceptive to focus on them and claim they are representative, when, as we clearly see from the data, male teachers also scored much higher on “ability to relate to current issues,” which is clearly something that could cost the teacher a lot of preparation time: it is much easier to just “teach from the book” the same way year after year.
In conclusion, I believe the research literature is biased in favour of the gender bias interpretation. Although, as we have seen, other interpretations are perfectly plausible, the author pushes only the gender bias interpretation. And she pushes it too far when she writes that “male students give much higher scores to male teachers … in all dimensions of teaching” (5), and when she claims that gender stereotypes explain item-by-item variation in the evaluations. Yet these unwarranted conclusions are the takeaways from the study picked up and cited by others.