As I said in the previous post, I find the student standardized test score data valuable mostly for what it shows us about the significant gaps between subgroups of students. What I find less informative, though, are year-to-year comparisons and comparisons between our students and students elsewhere. It is too easy to impose preconceived conclusions on those comparisons, when there may be other plausible hypotheses that are also consistent with the data.
For example, at our meeting last week, the board were presented with bar graphs showing that our students’ scores compare favorably to those of students nationwide. Then we saw more bar graphs showing that our students compared favorably to statewide students on a measure of annual score growth.
There is always an air of self-congratulation to that kind of discussion. When our students compare favorably with others, that means our district is doing a good job, right? But actually, that isn’t what it means. Even if our students compare favorably with students elsewhere, it doesn’t follow that our district’s policies and practices are the reason why. We live in a university town where a lot of people are highly educated; that alone could potentially explain why our students sometimes score better than the average student. For all we know, they would get even higher scores if our district made different educational choices; our district might actually be holding the students back! The data neither prove nor disprove that hypothesis.
Moreover, the data also show that our students suffer by comparison to students elsewhere in many ways. For example, the Iowa Assessments data show that on all three tested subjects (reading, math, and science), at every grade level tested (third through eleventh), the percentage of our students who are non-proficient is higher than it is for students statewide (with the one exception of eleventh-grade science). Those differences appear to be largely a function of the scores of our subgroups of students on free and reduced-price lunch, black and African-American students, students receiving special ed services, and students who are English-language learners. Here’s an example of the comparisons (no helpful bar graphs on this data set!):
And our students are often doing even worse, relative to statewide, than they did last year. Here’s what that cohort’s chart looked like last year:
If we let the district take credit for every student success story, we also have to blame it for every unsatisfactory result.
But in fact, the data does not dictate either conclusion, because it is very difficult to evaluate all the potential causal factors that affect our student test scores. There are sophisticated statistical methods, like multivariate regression analysis, that could at least try to do so, but they have their own limitations. In any event, I’m fairly sure that our district is not even attempting such sophisticated analyses on its test score data. It’s probably much closer to the truth to say that our district, when given a new set of test score data, has a tendency to simply eyeball the changes from the previous year and then make assumptions (not dictated by the data) about what those data tell us about the effects of the district’s practices.
For example, this year we’re instituting a weighted resource allocation model, shifting resources toward schools that have more kids in the subgroups that are showing proficiency gaps. Next year, we’ll look at how the proficiency scores have changed. If the gaps are at all smaller, it will be tempting to attribute the improvement to the weighted resource model. But the data would be consistent with other plausible explanations as well, since we can’t possibly hold all other variables constant like in a laboratory experiment. At the same time that we’re shifting resources, we might also be changing disciplinary practices, and trying to diversify the teaching staff, and providing bias training, etc. (And of course we’re usually measuring different actual kids from one year to the next.) If the gaps subsequently improve, it’s even possible that the weighted resource model actually made them worse, but that the other variables more than offset that effect.
If we shift resources and then the test scores go up, we’ll want to assume a causal link. But if we shift resources and the scores go down, we won’t assume that the shift caused the decrease—and if anything, people might even say, “See, the scores went down, so we need to shift resources even more.” My point is just that our explanations often come from sources other than the data itself, since the data is consistent with multiple hypotheses. I don’t think that’s crazy; I think it makes sense to bring experience, judgment, and even instincts to bear in interpreting any data. But it can certainly lead people into temptation, as it becomes fairly easy to use data to justify preconceived conclusions.
In any event, if someone tells you that our district is doing great because the kids outperform their counterparts elsewhere, remember that what you’re hearing is probably closer to sales than to science.