Testing Our Patience

January 20, 2004

5:52 PM

State and federal law assume that the quality of public education can be gauged by the number of students who reach the "proficiency" mark on a standardized test. Indeed, the federal No Child Left Behind (NCLB) law provides serious penalties for schools that fail to make sufficient annual gains in these numbers. It is a terribly misguided policy.

But the problem is not, as some critics argue, that all tests are invalid. Standardized tests can do a good job of indicating, though not with perfect certainty, whether students have mastered basic skills, can identify facts they should know or can apply formulas they have memorized. Such tests have a place in evaluating schools, as they do in evaluating students. However, they are of little use in assessing creativity, insight, reasoning and the application of skills to unrehearsed situations -- each an important part of what a high-quality school should be teaching. Such things can be assessed, but not easily and not in a standardized fashion.

To judge schools exclusively by their test results is, therefore, to miss much of what matters in education. Relying on proficiency benchmarks makes things even worse. NCLB requires that every public-school child in grades three through eight be tested annually in reading and math (and within a few years, periodically in science). The law requires every school to report the percentage at each grade level who achieve proficiency and, separately, the percentage of each racial and ethnic minority group and the percentage of low-income children who achieve it. If every grade and subgroup does not make steady progress toward the national goal -- the proficiency of all members in each subject by 2014 -- the penalties kick in.

But what exactly is "proficiency"? The new testing law models its definition on the one used by the National Assessment of Educational Progress (NAEP), a set of federal exams in a variety of subjects given to a sample of students nationwide. The NAEP tests such a broad span of skills that each test-taker can be asked only a small share of its questions, and the test results must be aggregated to generate average performance numbers. The NAEP then describes these group averages as either "below basic," "basic," "proficient" or "advanced." Panels of citizens decide where the lines between those categories should be drawn.

Proficiency, in other words, is not an objective fact but a subjective judgment. And the NAEP judgments have not been very credible. The NAEP finds, for example, that only 32 percent of eighth-graders are proficient in reading, and only 29 percent are proficient in math -- seemingly a national calamity. But international tests show that no country in the world has high proportions of its students close to proficiency as defined by the NAEP. If most students in the United States or elsewhere in the world have never been proficient in this sense, how meaningful is it that less than a third of American students are now meeting this target?

In 1993, shortly after the federal government first began reporting scores in terms of proficiency, the General Accounting Office (GAO) charged that the government had adopted this method for political reasons -- to send a dire message about school achievement -- notwithstanding its questionable technical validity. Confirming the GAO's conclusions, a National Academy of Education report found that the NAEP's definitions of achievement levels were "fundamentally flawed" and "subject to large biases," and that U.S. students had been condemned as deficient using "unreasonably high" standards. A National Academy of Sciences panel rendered a similar judgment.

Nevertheless, under the new federal law, each state must now set its own proficiency standards, and the states are using methodologies similar to the NAEP's. The consequences have often been ludicrous. New York state had to cancel the results of its high-school math exam when only 37 percent of test-takers passed, down from 61 percent the previous year when the curriculum and instructional methods were similar and proficiency was supposed to be defined in the same way. On Massachusetts' state science exam in 1998, only 28 percent of eighth-graders passed the proficiency point; yet on an exam administered internationally, Massachusetts students did as well as or better than students anywhere in the world except Singapore. On the other hand, on Texas' reading exam, 85 percent of fourth-graders passed the state-set proficiency point while the NAEP found that only 27 percent were proficient. The setting of proficiency levels by different groups of panelists is open to almost unlimited variation.

This creates a further absurdity, if school evaluation is the goal: A state's proficiency definitions can be -- and given the penalties in NCLB, they increasingly will be -- watered down to the point that all children can achieve them with little improvement in instruction. Some states have already begun this process, deciding that what they previously had defined as failing will now be considered proficient. Other states have bet that the new federal law is so unworkable that it will be repealed. These states have therefore decided not to do anything for now about schools that make very slow progress toward proficiency -- and not to worry about the inconceivably spectacular improvements those schools will have to make just before 2014, if the law remains in effect.

The federal law was intended to raise student achievement to high standards. But its incentives are functioning instead to lower state sights to existing levels of student achievement.

The new law's incentives are distorting teaching as well. Rational teachers in many states have begun to focus most of their attention on those students who are just below the proficiency point, because only their improvement is rewarded in the accountability system. Imagine a class with some students who score well below the proficiency point, some close to it and some well above. It makes no sense to waste instructional time on the high-scoring students, and little sense to waste much of it on the low scorers. The most surefire way to show annual progress and avoid sanctions is to aim for a small improvement, which is all that's necessary, from the nearly proficient group.

The teachers are not being irresponsible; rather, the federal incentives are accomplishing what they were designed to do. Framers of the law -- not only the Bush administration but also Democrats led by Massachusetts Sen. Edward Kennedy and California Rep. George Miller -- relied on the fact that school leaders, from superintendents to teachers, are more likely to achieve a goal if there are serious penalties for not doing so. But that's the problem. When planners try to manage complex systems that have multiple goals by setting quotas only for the most easily quantifiable of those goals, the incentives distort the output.

It is not the states' official intentions that get watered down. States have mostly complied with the law's requirement that they promulgate high standards. But their tests, which state education officials claim are "aligned" with these standards, point teachers in quite another direction.

True alignment of tests and standards has two parts. First, every test question must assess a skill that is actually included in the standards. This kind of alignment mostly does exist. But just as important, every skill included in the standard must be assessed -- either by tests, student work samples or other evaluations -- and each skill should have the same relative weight in the assessment system as in the standards. This is not happening.

Consider a typical elementary-school reading standard, common in many states, that expects children to be able to identify both the main idea and the supporting details in a passage. There is nothing wrong with such a standard. If state tests actually assessed it, there would be nothing wrong with teachers "teaching to the test." But in actuality, students are more likely to find questions on state tests that simply require identification of details, not the main idea. For example, a passage about Christopher Columbus might ask pupils to identify his ships' names without asking if they understood that, by sailing west, he planned to confirm that the world was spherical. In math, a typical middle-school geometry standard expects students to be able to measure various figures and shapes, like triangles, squares, prisms and cones. Again, that is an appropriate standard, and teachers should prepare students for a test that assessed it. But, in actuality, students are more likely to find questions on state tests that ask only for measurement of the simpler forms, like triangles and squares. It is not unusual to find states claiming that they have "aligned" such tests with their high standards when they have done nothing of the kind.

At first glance, it may seem that such fudging is harmless. Until students are proficient on a basic skills test, some people think, there is no point in wasting time on higher skills. But effective teaching requires that basic and higher skills not be taught sequentially but simultaneously, so they can reinforce each other. With the Columbus passage, a child need not be able to recall every detail before he or she is taught how to summarize the theme of the passage. In math, students who are learning to add, subtract, multiply and divide may still make errors when they perform these operations. But while continuing to practice these basic skills, they should also be learning more difficult topics. If students are given tests that ask for little but basic arithmetic skills, their teachers are unlikely to spend much time teaching algebra. State policy-makers may conclude from rising test scores that students are closer to meeting high standards, but those policy-makers would be wrong.

The same distortion may explain why the current school-reform movement has had some success in narrowing the achievement gap between disadvantaged and middle-class children at the lower grades, only to see the gap widen as children grow older. The pattern may simply reflect the fact that testing (and, therefore, instruction) in the lower grades is increasingly concentrated on basic skills, which are relatively easy -- with enough drill -- to impart. But such instruction may leave students unprepared for curricula in the upper grades, where tests cannot continue to exclude more advanced skills without looking ridiculous.

One further problem makes a folly of the new system, and that is its inaccuracy. Tests, considered by themselves, are not reliably precise even as indicators of the skills they do assess. Yet if schools are to be held accountable for their test results, precision is what's needed. A school either meets its required mark or it does not.

Most people understand that a single annual test should not be the exclusive means of evaluating a student, because performance can vary -- even if only a little -- from day to day; students have their good days and their bad.

School-wide tests should be somewhat less unreliable, because when some students have bad days, others have good ones. If these average out, test-based accountability should work. But statisticians can show that for test scores to average out accurately enough for us to know what proportion of a grade's students have passed a precise proficiency point, very large populations are required, much larger than the grade cohorts in typical schools. In reality, school averages, like student scores, wobble around their true values. Year-to-year changes in school averages are even less reliable. By sheer happenstance, there might be higher-ability students in the fourth grade one year than the next. A rainy day could affect student dispositions. Test-takers might be distracted by a barking dog one year and not the next. The effects of such events could be tiny, yet in the new federal system, schools are sanctioned based on incremental changes in their annual performance scores.

The statistical problems are exacerbated because of the law's laudable intent of holding schools accountable for the learning of minority groups within a school. Because subgroups have fewer students than the school as a whole, minority scores are even more inaccurate. A perverse consequence is that the more integrated a school, the more likely it is to be deemed failing.

In the summer of 2001, when the Bush administration and Congress were designing NCLB requirements, two econometricians -- Thomas Kane, who now teaches at the University of California, Los Angeles, and Douglas Staiger at Dartmouth College -- circulated a paper showing that the proposed system would result in many of the wrong schools being rewarded or punished solely because of these statistical sampling problems. The paper was so persuasive that the introduction of the bill was held up for several months while administration and congressional experts tried to solve the problem. They couldn't. But they introduced the bill anyway, and the result has been some remarkable anomalies: schools rewarded one year and punished the next with no underlying change in teaching effectiveness; schools rewarded under a state's system and simultaneously punished under the federal one. Such arbitrariness undermines the incentive system itself.

Can all this be fixed? Not if we insist on a mechanistic system that allows federal administrators to judge whether schools are successful or failing simply by examining data reports from annual tests.

A good accountability system would not exclude annual testing. Although there would be a lot of randomness in the results, slow progress or low scores for any school or group within it should be a red flag, inviting further scrutiny. But a fair and accurate accountability system has to include more than just standardized tests. It has to include the judgments of experts who visit schools, review student work and projects, evaluate the quality of curriculum and teaching, rate school climate and spirit, draw conclusions about the effectiveness of school leadership, and make determinations about whether school resources are being devoted in a balanced fashion to all the goals we have for schoolchildren -- both those that are easily testable and those that are not.

There are models for such an accountability system. It is, for example, how we accredit hospitals in this country (we don't rely exclusively on death rates or length of patient stays). It is also similar to the system used in Great Britain for evaluating schools. Prior to NCLB, there were a few state and local experiments with systems using multiple measures of school success. But now federal law, motivated by a deep contempt for public education, seeks an accountability system that is teacher-proof, principal-proof, superintendent-proof and even governor- and legislature-proof.

The NCLB system cannot survive until 2014, when its proponents expect all students to be proficient. Long before then, probably when most schools in the country have been defined as failing even by diluted definitions of proficiency, the system will collapse of its own internal contradictions. In the meantime, it is doing great and needless damage.

For more information about the issues covered in this article, visit the special report at Moving Ideas.