An article in the 26 July issue of Times Higher Education holds some interest for academics, educationalists and others concerned with the increased managerialism of the contemporary university. A long-running dispute between the Canadian academics’ union and Ryerson University went into arbitration, with the arbitrator finding that student evaluations of teaching (SETs) could no longer be used for promotion and tenure because SETs are imperfect, unreliable and ‘downright biased’.
While many academics have long held that student evaluations are being used to monitor, direct and punish staff rather than improve education, evidence of their effectiveness as a marker of teaching and learning has largely gone unexamined. But the closer one gets the more it becomes evident that the emperor has no clothes. The evidence shows that SETs are tests of teacher popularity, and that unreconstructed and subconscious racial and gender bias are involved. SETs, it seems, are generally useless for finding out anything about teaching practices or learning, although they seem to be effective in cowering staff. In comic form of ‘teaching to the test’, Australia and most other Anglophone countries have gone down the SET cul-de-sac. Instead of teaching to the test, we are teaching to the SET! Effectively, staff popularity with students has been equated with teaching ability.
Each university has a different way of collecting this information, but mostly it is gathered online by asking questions about a lecturer or tutor’s performance. Every year I look forward with trepidation or glee—depending on my desire to retire in the near future, or not—to my student evaluations. One year I received a letter from my supervisor asking me to explain my poor scores, while the following year’s letter congratulated me on my performance. The difference was totally unrelated to any change in teaching methods on my part; maybe it was due to my haircut. In any case, I am sure these letters and my scores are on my human-resources file.
The article piqued my interest as to what the available evidence says, and I have almost gone blind reading sophisticated statistical analyses of quantitative studies of student evaluations. I feel that I deserve a promotion just for working my way through them! In the end, they say exactly what most university teachers have been saying for years: SET scores are unrelated to student learning or academic teaching. However, they are related to the racial, sexual and gender bias of the young people who take part in them. If you are an academic who is young, male, good looking and white, you are likely to be liked at greater levels than a middle-aged woman who might be a lesbian. In a national survey of British students, investigators from the University of Reading found that students were most satisfied when they had been taught by academics who were male, white, had a PhD and were on fixed-term contracts. In the United States, African American academic teaching staff were considered substantially less intelligent and competent than white staff—by both white and non-white students.
Other problems with using SET scores are more technical but perhaps more pernicious. Most universities use averages from SET outcomes to rate individual staff via-à-vis faculty and course outcomes. The size of the group that fills in the class survey is crucial here. Academics with small groups of students are much more at the mercy of outlier responses, luck and error than academics teaching larger groups. For example, academics with small response rates are often compared to those with large numbers of students and large response rates. Averages also do not give a good indication of the peaks and troughs of responses. The scores are ordinal categorical variables, which means that they are what statisticians call ‘labels’, not ‘values’. Most importantly to any statistician worth their salt, they are descriptions given a number, but they have no real numerical meaning. On a 7-point measure, the difference between 7 (outstanding) and 6 (very good) may not mean the same thing as the difference between 1 (terrible) and 2 (tolerable), but they are used as if they are—they are used as though there is a continuous numerical relationship between them.
Some studies have looked at the relationship between SET scores and future results. For example, some have looked at the SET score given to a level 1 academic teacher and then the outcome for students at the next level. Researchers have found that student performance at the second-level subject is negatively related to their satisfaction with the lower-level one. The researchers posited that students may have liked the harder (initial) subject less, but they actually learned more because it was harder. Indeed, other research has found that lenient teachers are more liked by their students, but this is not related in any way to how much the students have learned. A meta-analysis of almost 100 studies of SET scores and higher-education results found that students did not learn more from professors with higher SET scores. However, the problem for many academics is that students with less ability are likely to overestimate their capabilities and become angry when their teachers do not recognise their (overestimated) skills; thus the likelihood of a poor student evaluation increases.
In one very large study, students at a French university in a variety of disciplines, who were assigned to different classes taught by male and female teachers but who all sat the same end-of-year exams, rated their satisfaction with male teachers at higher levels than their satisfaction with female teachers. These ratings were negatively associated with their final results. In a US study, students taking an online course were randomly allocated to teaching assistants whom they did not see. The teaching assistants were identified as male or female, regardless of their actual gender. Again, those students who thought their teacher was male considered their teachers to be more competent and effective. Again, these assessments bore no relation to their final results.
SETs have contributed to what some academics claim is a lowering of standards and a less rigorous education overall. There has been significant grade inflation (giving students higher marks than in the past), which is likely to be the result of more lenient marking. This has been clearly related to SET scores rating lenient teachers higher than more demanding ones. Assessments have become smaller, shorter and less critical or analytical (at least in the social sciences and humanities). Institutions like mine have forced on us a system of rubrics, which means that every assessment must have an outline of all possible factors that will be marked, what each category looks like and what each is worth in the final outcome. No need to learn from experience when you just tick boxes! There is no doubt that dumbing down is inherent in such a system.
So, knowing all of this, the question has to be why there is such reliance on SETs in academic settings. If we are going to measure teacher effectiveness at all in quantitative terms, the answer is that large-scale SETs are easy to administer and cheap to run. But effective teaching is a complex issue and to meaningfully understand and improve it would take more time and financial resources than most universities are willing to expend. Good evaluation of teaching and learning would involve peer assessment, ethnographic studies into classroom behaviours, and staff willingness to be open to such intrusions without compromising their position at the university. Certainly, a staff member’s tenure or performance review should not rely on student evaluations as they currently exist.
SETs look objective because they rely on numbers, even if those numbers are meaningless. Metrics are used in all sectors of our late-modern/technologically obsessed society and it would be surprising if we didn’t see them in the university sector, too. But metrics are meaningless when the relationship between what is being studied, how it is being studied and what it is being studied for becomes so skewed. I refer to Robert DiNapoli’s ‘A Supplement to Ambrose Bierce’s The Devil’s Dictionary’ (Arena Magazine no. 155) and his definition of ‘metric’: ‘the science of reducing the organic texture of reality to sequences of numbers for the purpose of evaluating performance’. There seems to me no more apt description of a SET. Bruce Buchan’s article in the same issue, ‘Look on My Works, Ye Mighty, and Despair’, eloquently illustrates the damage to the humanities and social sciences due to the relentless obsession with online teaching and learning. Students are considered consumers of education now, and it would appear that consumer satisfaction counts for more than skill or knowledge.
More importantly, reliance on SETs allows for a form of managerial control of academic staff that has undermined the professional and democratic platform on which higher education was ostensibly based for 500 years or more. It is hard to believe that the Australian system will change the direction it seems to have taken, but this is not a lost cause. The University of Southern California has voluntarily undertaken a review of SETs and how they are used, and other universities in the United States are also going down this path. SETs do give an indication of students’ experience and teachers’ performance (charm, lucidity, hairstyle), but not of academic knowledge or teaching effectiveness. Don’t get me started on rubrics.