“Value-added” performance measures, which purport to rate teachers by whether a year of their instruction produces a year of learning gains in their students, are the new high-stakes method of deciding who goes and who stays in K–12 education. In pursuit of competitive grants through the American Recovery and Reinvestment Act of 2009 (ARRA), most states (including Michigan) adopted requirements for the use of “effectiveness data” in determining compensation levels for teachers and principals. This data is also, increasingly, being used to justify firing “ineffective” educators. I will limit myself here, for reasons of space, to discussing such performance evaluation schemes for teachers. I have a several concerns about them.
Are they accurate?
Can we all agree that the point of any teacher evaluation system should be to improve the quality of teaching? If so, then such systems must, first of all, be accurate in their analysis of the existing quality. If they do not, in fact, fairly represent the strengths and weaknesses of teacher performance, then they are worthless for the purpose of improving it.
This — the blatant inaccuracy of the evaluations — is the root reason for teacher anxiety and loathing regarding the “accountability” systems to which they are increasingly subject. While many such schemes advocate “multiple measures” of effectiveness, too many are based exclusively on students’ standardized test scores. The error rates in such data are simply unacceptable for high-stakes decisions. I imagine that teachers would not fear evaluations that mistakenly offer them extra help in improving their professional practice, but they would find it unacceptable to lose their jobs over bad data. Wouldn’t you?
Answer to the above question: Student test scores are imprecise measures — they do not fairly and accurately measure student learning outcomes. The U.S. Department of Education’s Technical Methods Report “Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains” [Schochet & Chiang, Mathematica Policy Research, July 2010, http://ies.ed.gov/ncee/pubs/20104004/] concludes that typical value-added performance evaluations of teachers are unacceptably imprecise. Specifically, “in a typical performance measurement system, 1 in 4 teachers who are truly average in performance will be erroneously identified for special treatment, and 1 in 4 teachers who differ from average performance by 3 to 4 months of student learning will be overlooked.” And that is assuming the use of three years’ worth of data; the reliability is approximately half as good for only one year of data. These error rates, they note, are greatly understated by certain assumptions, such as that students are randomly assigned to schools and to teachers. There are no controls, therefore, for differences in resources among schools or for differences among student cohorts.
The authors reference “findings from the literature and new analyses that more than 90 percent of the variation in student gain scores is due to the variation in student-level factors that are not under control of the teacher. Thus, multiple years of performance data are required to reliably detect a teacher’s true long-run performance signal from the student-level noise.”
Moreover, the paper’s authors note that error rates they analyzed are only one factor (they list six others) that must be taken into account in designing and using appropriately any value-added estimators of performance. Not attending properly to such features in design and application makes the use of such schemes for high-stakes decisions both ineffective in achieving their stated purpose and manifestly unjust.
Do they improve teaching and learning?
The inaccuracy and unfairness of judging teacher performance almost entirely by student standardized test scores is only one problem with value-added performance evaluations. If we are truly interested in improving educational outcomes — and that is the point, isn’t it? — we must also consider the collateral damage they do.
• They undermine collaboration, which effective schools must encourage. Scoring teachers in ways that make them compete against one another inhibits or even discourages the sharing of lore and techniques that can make everyone better teachers. I recall the havoc created under Jacques Nasser at Ford several years ago when engineers were force-ranked in their evaluations (my late husband was coerced into doing some of the ranking then). This zero-sum game, in which one person winning meant another losing, penalized working together and encouraged sabotage of colleagues. It has taken many years to recover from the damage done. Is this the aim of folks who want schools to be “run more like a business”?
• They distort educational decision-making. Which classes students are placed in (should they be challenged or slotted where they can safely deliver higher test scores?); which students are held back or eased out of a school (dumped by charters or shunted into “alternative” schools); what is emphasized in classes (only what is tested); what level of thinking is encouraged (rote memory versus analysis or independent thought) — practices are encouraged that explicitly undermine good education.
• They are contradictory and inconsistent. On the one hand, today’s reformer bloc assumes that anyone can teach — or can administer a school or a district, for that matter. They find solutions in recruiting and placing “outsiders” through Teach for America or via the Broad Superintendents Academy, for example, with brief training and no experience. Yet the punitive teacher evaluation systems are predicated on the assumption that firing “bad” teachers and principals will improve schooling. There is no sense that these people, who have committed time, money, and working years to their professions, can or should be helped, instead, to improve their practice.
I recognize that public education is inherently political, in that we ultimately do what the people want. That does not mean, however, that those of us on the inside cannot try to change the tenor and direction of the public conversation. I deeply believe that the punitive tone of today’s education reform talk undermines the putative aims of reform. Blaming allegedly incompetent educators for all that is wrong in our society may get the rest of us off the hook (and, I add cynically, may serve the unstated interests of certain politicians and of certain corporations that benefit enormously from public spending on testing, textbooks, software, distance learning, et cetera ad nauseum), but scapegoating will not fix those problems. Never has and never will.
What would work?
I cannot stop at stating the problem but must at least make reference to my own preferred solutions. Teacher performance schemes are not all the rage just because we, collectively, want to blame teachers for things not under their control. All of us and all of our children have experienced at least one truly terrible teacher in our lives, and all of us would like to spare others that experience. So, how should teachers be evaluated?
Classroom observation is a time-honored way of seeing how teachers actually teach, even if it is rarely done often enough or with enough specific, useful feedback to make a difference. Teacher Larry Ferlazzo writes that effective observers “know our school, our students and me — and [have] judgment and skills I … respect. I know they are genuinely concerned about my professional development. They understand that helping me improve my skills is the best thing they can do to help our students…. These purposeful visits have produced detailed and helpful feedback that has made me an even better educator.” The Mathematica report notes that “value-added measures and principals’ assessments of teachers, in combination, are more strongly predictive of subsequent teacher effectiveness than each type of measure alone” [emphasis added].
Multiple data points regarding student assessment are required for valid measurements of student learning. A single set of pre- and post-tests annually is insufficient. Teachers working together to create common curricula and assessments for the same classes can not only assure that all students get the same exposure to a subject, but also help one another jointly improve their classroom practice to achieve similarly high levels of learning. This kind of professional collaboration actually does produce the result we say we want from performance evaluation systems: teachers helping one another to consistently become better at their craft, so that students can perform better not just on tests but in life.
For the point of the evaluation is not just to find out how well or how poorly teachers are performing. It is — or at least it should be — to make them better. If we do not believe that teachers, too, can learn, then we do not believe in education at all.
Note: the statistical mathematics used in the Mathematica study cited goes beyond any “expertise” I developed in a single statistical methods course 40 years ago, so my analysis is subject to error. The conclusions that are not direct quotations may be inadvertently misrepresentative.
Publication URL corrected 8 Jan 11.
May 2011 addition: you have got to see this animation summarizing Dan Pink’s Drive work on motivation and incentives!