Teacher Tests Test Teachers

The Houston teachers union scored a legal victory in May when a federal judge found that the Houston school district's system of evaluating teachers could violate due process rights. The lawsuit centered on the system’s use of value-added modeling (VAM), a controversial statistical method aimed at isolating a teacher’s effectiveness based on their students’ standardized test scores.

United States Magistrate Judge Stephen Smith concluded that the metric's impenetrability could render it unconstitutional. If, he wrote, teachers have “no meaningful way to ensure” that their value-added ratings are accurate, they are “subject to mistaken deprivation of constitutionally protected property interests in their jobs.” More specifically, he continued, if the school district denies its teachers access to the computer algorithms and data that form the basis of each teacher's VAM score, it “flunks the minimum procedural due process standard of providing the reason for termination ‘in sufficient detail to enable [the teacher] to show any error that may exist.'”

It's unclear whether the Houston school district will now negotiate a settlement with the teachers union or end up back in court, but either way, the decision comes at a significant time for the test-based accountability movement, which has faced a number of legal and political challenges over the past several years. The outcomes of the court battles have so far been a mixed bag: Teachers challenging VAM have scored some wins, lost other big cases, and a few major suits are still pending. Outside the courtroom, states have begun implementing the new federal education law—the Every Student Succeeds Act—which imposes far less pressure on the states to use VAM or similar measures than what they faced during the Obama administration.

Donald Trump's education secretary Betsy DeVos has also signaled she's less interested in using test scores to define school performance.

(“I'm not a numbers person in the same way you are,” she said in March, in response to a question about measuring school success. “But to me, the policies around empowering parents and moving decision-making to the hands of parents on behalf of children is really the direction we need to go.”) Considering all this, some experts have gone so far as to say that regardless of what ends up happening in the judicial system, the political momentum for using test-based accountability measures is all but over.

THE MOVEMENT FOR teacher accountability isn’t much older than many schoolchildren. In 2009, an education reform group known as The New Teacher Project (TNTP) issued an influential report finding widespread “institutional indifference to variations in teacher performance.” TNTP reported that less than one percent of teachers in their study received “unsatisfactory” performance reviews, with most teachers receiving ratings of “good” or “great.” TNTP recommended an overhaul of teacher evaluations, urging districts to develop systems that rate teachers “based on their effectiveness in promoting student achievement”—which meant evaluating them by their students’ scores on standardized tests.

The report heavily influenced the Obama administration's $4 billion Race to the Top program, which rewarded states that created new evaluation systems based on student test scores and value-added modeling. (The administration also used No Child Left Behind waivers to incentivize similar policies.) According to the National Council on Teacher Quality, 43 states revamped their teacher evaluation systems to include student achievement as a “significant or the most significant factor” by 2013, up from just 15 states in 2009.

Gage Skidmore/Wikimedia Commons

Many of these policies had the effect of shifting accountability systems away from the school level (where it was emphasized under No Child Left Behind) to the teacher level. Advocates for this shift cited research showing the importance of teacher quality, though critics argued that measuring student growth at the school level was a fairer and more reliable way to use the statistical tools. Not surprisingly, teachers overwhelmingly opposed the shift. A 2014 Gallup poll found that nearly nine in ten teachers felt linking teacher evaluations to student test scores was unfair, and 78 percent felt that all the testing was taking too much time away from teaching.

By 2015, the anti-testing backlash had gained steam across the country, in part because the federal government had pushed for test scores to be used to evaluate teachers across all grades and subjects. States had begun to require assessments in such traditionally untested areas like art and early elementary. Parents, teachers unions, and conservatives rallied together for a rollback of federal testing mandates. With the enactment of the Every Student Succeeds Act in late 2015, they succeeded.

Not only does ESSA reduce standardized testing, it also voids some of the Obama-era waivers that incentivized states to adopt test-based teacher evaluations. In 2016, pro-test education reformers were also frustrated to learn that despite the widespread implementation of new evaluation systems under Obama's tenure, the overwhelming majority of teachers were still receiving high ratings. Reformers had hoped these measures would help identify “ineffective” teachers and lead swiftly to their removal, in addition to rewarding “effective” teachers with new incentives. They held up Washington, D.C.’s reforms as a successful model to emulate, though it’s become clear that the nation’s capital is something of an outlier.

Even before the testing wave had begun to recede, though, some experts had been warning of the legal risks associated with VAM and similar statistical tools. In 2012, education law professors Preston Green and Joseph Oluwole, and education finance professor Bruce Baker, published an article outlining specific legal and policy problems with VAM and teacher evaluations, focusing on due process challenges, equal protection challenges, and disparate impact firings.

Major litigation against VAM quickly followed. Unions brought lawsuits arguing that the measures were arbitrary and capricious, that they unfairly penalized teachers who taught more disadvantaged students, and that they were being inappropriately used to measure things they were not designed for.

The lawsuits have partly been fueled by debates within the academic community over whether it's even scientifically valid to use these measures to evaluate teachers. These debates have not been settled. Some researchers say the statistical growth measures fail to adequately control for all the disadvantages students face outside their classrooms, meaning evaluative scores may be less “objective” than some supporters claim. Other researchers found evidence that the same teachers could receive different value-added scores depending on what types of tests their students took, and others found that scores could vary significantly from year to year for no discernable reason. A complicating factor for VAM supporters has been that even when high-quality research studies showed that VAM could be theoretically used in ways that reduce some critics’ concerns, many states implemented their test-based systems in ways that ignored these recommended practices.

ONE LESSON THAT TEACHERS and their unions have learned over the past several years is that the courts are unlikely to overturn school district policy, even when they agree it’s unfair. If a teacher sues on the basis that a policy unconstitutionally denies them “substantive due process” or equal protection, a judge will consider their complaint under what’s known as a “rational basis analysis,” meaning the judge will look to see if the policy can be shown to have any kind of rational relation to a legitimate government issue. If it can, even if only vaguely, the courts are unlikely to intervene.

“These testing cases are always hard for teachers to win,” says Preston Green, an education law professor at the University of Connecticut.

“A ‘rational basis analysis' is a low bar for the government to satisfy, and a very hard one for plaintiffs to overcome.”

Take this major VAM case in Florida: In 2013, the National Education Association and its Florida affiliate filed a federal lawsuit challenging a state law that required at least half of a teacher’s evaluation to be based on VAM. In practice, this meant that teachers in non-tested grades and subjects were graded based on the test scores of students they didn’t teach. For example, one plaintiff was a first-grade teacher evaluated based on the third-grade test scores of students she herself never taught. Another was a high school math teacher who mostly taught juniors and seniors, but had her VAM score calculated on the basis of freshman and sophomore reading scores. Together, the seven public school teacher plaintiffs in Cook v. Chartrand argued that Florida’s law violated their equal protection and due process rights.

But in 2014, a federal district judge ruled against them, concluding that while the rating system seemed clearly unfair, it was nonetheless still legal. “Needless to say, this Court would be hard-pressed to find anyone who would find this evaluation system fair to [teachers in non-tested subjects], let alone be willing to submit to a similar evaluation system," the judge wrote. “This case, however, is not about the fairness of the evaluation system. The standard of review is not whether the evaluation policies are good or bad, wise or unwise; but whether the evaluation policies are rational within the meaning of the law.” A federal appeals court upheld the ruling in 2015.

More failed legal challenges against value-added measures took place in Tennessee. In 2014, two of the state's teachers, Mark Taylor and Lisa Trout, filed federal lawsuits, later consolidated, arguing they were unfairly denied performance bonuses because so few of their students took the tests used to generate their VAM score. In Taylor's case, for example, just 22 of his 142 students took the exams that formed the basis of his VAM score. Trout and Taylor argued the measures were arbitrary and irrational, and violated their due process and equal protection rights.

AP Photo/Susan Walsh, File

But in 2016, a federal judge from the U.S. District Court in Knoxville dismissed their case. Though the judge recognized the legitimacy of the plaintiffs’ concerns, saying the teachers’ criticisms “are not unfounded,” he cited the Florida precedent, and concluded that it would be up to the Tennessee legislature to make any changes to the system, as it “survives minimal constitutional scrutiny.”

Still, there have been some wins. In addition to the recent legal victory in Houston, last year a Long Island fourth grade teacher named Sheri Lederman won her lawsuit against New York state officials, with a judge concluding that her VAM score for the 2013–2014 school year was indeed arbitrary and capricious and needed to be vacated. During the 2012–2013 school year, Lederman scored 14 points out of 20, the next year she scored 1 out of 20 (considered “ineffective”), and during the 2014–2015 school year she scored 11 out of 20. “It’s the variability and volatility of this model that makes it so arbitrary,” Lederman told The Wall Street Journal. “There’s no reason to suggest that my performance with my children has varied that much year to year.”

Another major suit is playing out in New Mexico. The American Federation of Teachers New Mexico, the Albuquerque Teachers Federation, and other plaintiffs filed a lawsuit against the state’s VAM system in February 2015, arguing that it violates state law and is arbitrary and capricious in design. A state judge issued a temporary injunction in December 2015, blocking New Mexico from using its VAM measures for high-stakes personnel decisions until a later trial could be held. (That trial is scheduled for October.) Notably, the judge said that while value-added modeling can generally be sound, it’s not clear how much New Mexico’s system conforms to those best practices, given that the inner workings of the model “are not easily understood, translated, or made accessible.”

“Courts aren't really good at parsing statistical details, but if they see something is a blunt instrument, and that information is unstable and unreliable, those are concepts judges can understand,” says Rutgers education finance professor Bruce Baker.

“And if it's being used in an arbitrary way, in a way that requires a precision that can't be achieved, judges can look at that and say, ‘Well, I can understand those due process issues.'”

AFT president Randi Weingarten told The American Prospect that in addition to working on the legal and legislative fronts to “defeat VAM,” the AFT is fighting for more constructive evaluation systems that actually help teachers improve their practices.

“VAM is an unjust, unreliable, and unconstitutional method of evaluating teachers in America's classrooms, and the AFT and our affiliates are leading—and winning—the fight against these systems,” she says. “We are heartened by recent court victories in which judges agree with us that VAM does not work for students, teachers, or schools as an evaluation tool.”

OUTSIDE OF COURT BATTLES, one clear sign of how the political winds have shifted is the rhetoric of education reformers. Just a few years ago, prominent leaders were calling to publish teachers’ VAM scores, so that parents and taxpayers could better hold public school teachers accountable.

“Parents and community members have the right to know how their districts, schools, principals, and teachers are doing,” said U.S. Secretary of Education Arne Duncan in 2010. “It’s up to local communities to set the context for these courageous conversations but silence is not an option.”

Duncan's comments came a few months after the Los Angeles Times controversially published the value-added scores for Los Angeles teachers, and posted names of individual teachers rated as effective or ineffective on their website. The New York City Department of Education wanted to follow suit, insisting that doing so was in the public interest. “These are public schools and public dollars,” said a spokeswoman for New York City Schools Chancellor Joel Klein at the time.

Not all education reformers supported publishing VAM scores. Kate Walsh, the president of the National Council on Teacher Quality, spoke out against it. “I just thought it was an absolutely shameful practice,” she told me. “If VAM were 100 percent accurate I would still have a problem with it—but it's not, there are a lot of false positives and false negatives.” Bill Gates also published a New York Times op-ed urging against disclosing the scores. “At Microsoft, we created a rigorous personnel system, but we would never have thought about using employee evaluations to embarrass people, much less publish them in a newspaper,” he wrote.

And while New York did end up publishing teachers’ scores, along with other states like Ohio and Florida, you don’t hear VAM supporters championing such disclosures anymore. (Even Arne Duncan walked back his initial support.) One reason for the retreat is that making the scores available enabled the public to see how biased and error-prone they could be.

“After New York did it, people started realizing it was not a great thing to do,” says Baker. “Researchers reanalyzed the LA Times data and came up with different results, and I analyzed the NYC data, and even though NYC uses a pretty rich value-added model that controls for lots of stuff, eliminating much of the bias, that means you’re left with relatively noisy estimates, that jump around a lot from year to year.”

defense.gov_photo_essay_100622-n-5145s-216.jpg.jpe

U.S. Navy/Public Domain

On top of growing doubts about how states are using VAM, some academics have even begun to challenge the idea that boosted test scores are a reliable proxy for improved life outcomes. This position is most prominently espoused by Jay Greene, the head of the Department of Education Reform at the University of Arkansas, who has argued the evidence for a correlation between test scores and life prospects is weak, especially with regards to high-stakes testing.

In an interview with the Prospect, Greene also said that test-based accountability advocates tend to imagine either that existing accountability systems are already designed according to best practices, or that states will eventually adopt best practices. “But there’s no sign that this will happen,” he says. “Their fantasy is an undemocratic fantasy, that benign dictators will scientifically design the correct evaluation, impose it on an unwilling workforce and population, and then it will stay forever. They always end up sounding a little bit like the ‘communism has never been tried’ argument. You know, once we get the details right, everyone will see how good it is.” Still, Greene thinks that even though reformers have not succeeded in really transforming teacher evaluations, they have effectively narrowed public discourse around education, defining “achievement” down to mean, merely, gains in reading and math scores.

“If you tell me that Chicago public schools are producing greater gains among disadvantaged students than other disadvantaged students across Illinois, it might be that Chicago students have figured out how to focus more narrowly on tests,” he says. “I don’t even know if the information we’re getting now [from tests] is a proxy for school quality anymore, or if it’s gaming.”

WHILE THE FUTURE of using value-added measures in teacher evaluations is unclear, some researchers have been advocating alternative ideas. One would be to use the statistical growth measures as a diagnostic tool, a preliminary screening test to help identify which districts, schools, and classrooms warrant closer attention. The idea would be to think of using VAM like a doctor who diagnostically screens for major diseases. If patients fail the screening test, they are given another, more careful measure. “As in medicine, a value-added score, combined with some additional information, should lead us to trigger classroom observations to identify truly low-performing teachers and to provide feedback,” Doug Harris, a Tulane education economist, wrote in 2012. Bruce Baker and Preston Green have also voiced support for this idea. Some reformers oppose this, saying that using it merely as a diagnostic tool would “water down the metric.”

In an interview, Harris told me that he'd rather see teacher evaluations be based on peers and experts observing teacher practice and coming to a professional judgment. He says he hopes the backlash against VAM will at least motivate people to think more seriously about alternative ways to evaluate teachers.

Though some are worried the country will move entirely away from holding schools and teachers accountable for student test scores—and thereby hurt academic opportunities for historically underserved students—Baker thinks we'll see continue to see more incremental shifts in test-based accountability over the next few years. But some states, he says, will shift to growth measures that are no better than what states were already using.

Walsh, the president of the National Council on Teacher Quality, says she's inclined to be a pessimist, and the pessimist in her doesn't see much progress happening on the test-based evaluation front over the next few years. “But then again,” she says, “the winds change pretty quickly.”

Unlike many news organizations, the Prospect has remained staunchly committed to keeping our journalism free and accessible to all. We believe that independent journalism is crucial for a functioning democracy—but quality reporting comes at a cost. From Trump’s threat to the free press to Musk’s influence on our democracy, there is too much at stake in 2025 to stop now.

We’re behind on our goal to raise $75,000 to continue delivering the hard-hitting investigative journalism you’ve come to expect from us. Your support helps us maintain our independence and dig deeper into the stories that matter most.

We need you to make a year-end contribution today. Any amount helps secure our future and ensure we can continue holding power to account.

Will you support independent journalism with a donation to the Prospect?