Notes on Mitchell & Martin (2018)

A widely publicised new study claims to find gender bias against women in student evaluations of teaching. There are massive methodological problems with it.

The study concerns the evaluations of precisely two teachers: one male, one female. Obviously no sane human being would draw general conclusions about gender from two individuals. But that is apparently what passes for peer-reviewed research in this field.

In the reported data, the female instructor had 167 students and the male instructor 51, in one semester. Could it be that having a more than three times higher teaching load has a negative impact on teaching quality? And that this rather than gender was the key difference between these two instructors? This possibility is not considered by the authors.

Actually the researchers threw away at least half of the actual data. We have no idea what it said. Here is their justification for this:

> Students had a tendency to enroll in the sections with the lowest number initially (merely because those sections appeared first in the registration list). This means that section 1 tended to fill up earlier than section 3 or 4. It may also be likely that students who enroll in courses early are systematically different than those who enroll later in the registration period; for example, they may be seniors, athletes, or simply motivated students. For this reason, we examined sections in the mid- to high- numerical order: sections 6, 7, 8, 9, and 10. (Supplementary materials, page 4)

This is crazy. The authors are openly admitting that they purposefully selected a non-representative sample, which by their own admission is likely to exclude certain types of students. Why on earth would you do this? Why not sample for instance all odd-numbered sections and hence get a sample that includes early- and late-registering students in representative proportions?

I can think of one reason. The authors of course knew that if they found no gender bias their study would go unpublished and would have been a waste of time, whereas if they found gender bias they could get it published in a Cambridge University Press journal and featured on Slate. So they had every incentive in the world to ensure that the data came out the way they wanted. And if you are allowed to study only two instructors, and arbitrarily discard half the data on nonsensical grounds, then it is not difficult to prove anything you want.

Also extremely problematic is that the teachers in question were the researchers themselves. This is obviously a terrible idea methodologically speaking. It is not far-fetched to think that their obvious incentive, as researchers, to find gender bias influenced their behaviour as instructors. The standard practice of keeping studies double-blind exists precisely to prevent the risk for such contamination. This study is about as far from double-blind as you can get.

For instance, the authors make a big fuss about how the evaluations more often referred to the male instructor as “professor” and the female instructor as “teacher” despite equal credentials. This is obviously the kind of thing that can be easily manipulated by the instructor. Throughout the semester they would have had every opportunity to plant the language they want the students to use.

If the authors wanted to ensure the desired outcome of the study, they could, for example, distribute the evaluations with different prompts. The male instructor can tell the students: “These evaluations are important. The university uses them to decide which professors get their positions renewed.” Now you have planted the terminology of “professor”, and also made students apprehensive about being critical since you have made them think about the possibility of you being fired. The female instructor, meanwhile, might say while handing out the evaluations: “Honest feedback is important to me as a teacher.” Now you have primed the students to refer to you as a “teacher”, and encouraged them to speak freely without holding back, since you have implied that the evaluations are for your own use and that you value honest feedback. The danger of such contamination of the data is vastly greater when the instructors in questions have a blatant vested interest in ensuring a particular outcome of the study, as in the case of this study.

[Zigerell corrected a mistake in an earlier version of the above: I reported the number of students as 1,169 and 357 respectively for the two instructors, but these were the number of data points, and there were 7 data points per students.]