The reality of rigor (more on the D.C. vouchers study)
Mike opened the door for my response to the Washington Opportunity Scholarship Program external evaluation, and I’ve just completed a fairly quick read of it. First, in the spirit of full disclosure, I’ll note that my former employer, Westat, was the prime contractor for the evaluation. Though I never personally worked with the Westat staff who conducted the evaluation, I do know their reputations for quality work. This is not the only reason, of course, that I found the evaluation to be of high-quality, but it’s worth mentioning. Disclosure aside, I have a couple takeaways from the evaluation.
First, the impact findings for the program are simply not that compelling (sorry Mike), and even the subgroup analyses—which do provide a ray of hope—are presented with important caveats. The design comprised a randomized controlled trial where eligible applicants were randomly assigned to receive or not receive the scholarship. By all accounts, the sample was drawn appropriately and is of sufficient size (n=2,308 which is, we’re told, larger than impact samples in previous, similar evaluations); furthermore, the analyses appear thoughtfully and meticulously conducted.
So, while I have few qualms with the evaluation design itself, I do think something that occurred naturally within the impact sample—namely, lots of student mobility—is worth keeping in mind. Over the course of two years in the treatment group, only 4 percent remained in the same school they were in when they applied to the program; 71 percent switched schools once, and 25 percent switched schools twice. Among the control group, 22 percent remained in the same school they were in when they applied to the program; 57 percent switched schools once; and 21 percent switched schools twice. That’s a majority of kids (even more so in the treatment group) not attending any one participating school for very long. The authors report that “both groups experienced higher rates of school mobility than the typical annual rate for urban students (22 to 28 percent).” It’s not surprising, then, to see unimpressive findings in an evaluation that covers such a short duration (2 years) and examines achievement data from students who are extremely transient (not to mention that students were tested on Saturdays!).
Second, I’m struck by the number of times that the phrase “adjustments for multiple comparisons suggest that this finding may be a false discovery” (or similar nomenclature) appears in the report. Researchers concern themselves with multiple comparisons because they are in a position of simultaneously evaluating multiple questions and hypotheses. Simply put, when you consider the results of multiple, separate statistical tests together, there is more room for error. The issue has gotten more attention of late, in part because of this recent report from IES which presents methods for dealing with the multiple comparisons problem. Like most people involved with education, I’m interested in the best research possible given the time and resources available to conduct it. Many statisticians believe that ignoring the multiplicity problem leads to misinterpretation of findings, so these researchers covered their bases.
But with all of those “false discovery” caveats in the report, I found myself harkening back to Judith Gueron’s comments in this book. Ms. Gueron (of Manpower Demonstration Research Corporation or MDRC) writes:
Finally, rigor has its drawbacks. Peter Rossi once formulated several laws about policy research, one of which was: the better the study, the smaller the likely impact. High quality policy research must continuously compete with the claims of greater success based on weaker evidence.
Ahh, so true. Sooner or later, we must come to terms with the fact that the bar we set for rigor may unintentionally and preemptively knock out of the running a program that may, in fact, make some improvement in American education. Mind you, I’m not calling for a return to the age of education anecdote equals research. Here’s Gueron again on a lesson she learned about running successful social experiments:
You do not need dramatic results to have an impact on policy. Many people have said that the 1988 welfare reform law, the Family Support Act, was based and passed on the strength of research—and the research was about modest changes. When we have reliable results, it usually suggests that social programs (at least the relatively modest ones tested in this country) are not panaceas but that they nonetheless can make improvements. One of the lessons I draw from our experience is that modest changes have often been enough to make a program cost-effective and can also be enough to persuade policymakers to act. However, while this was true in the mid 1980’s, it was certainly not true in the mid 1990’s. In the last round of federal welfare reform, modest improvements were often cast as failures.
The question is: Will the OSP ultimately pass the “modest improvement” test? At two years—a time period that’s too short to capture impacts that may evolve over time—we don’t know. What I do know is that parents believe the OSP is making improvements, that improvement for certain groups of students may exist, and that school choice in and of itself may prove a laudable goal even without raise-the-roof achievement gains. Also, as an educational community, we’d be wise to continue the dialogue around the financial, political, methodological, and common-sensical (I think that’s a word) tradeoffs involved in rigorous research.
You can skip to the end and leave a response. Pinging is currently not allowed.





June 18th, 2008 at 2:52 pm
Amber,
Very impressive commentary. I especially like the part about the evaluation team being so clever and rigorous. We also are extremely charming and quite modest, by the way.
I do want to clarify one point that you may have misunderstood. All of the students in the study were attending a DC public school (or were rising kindergarteners) at their time of application — the baseline for the study. The 4 percent of the treatment group that remained in their baseline school over the entire two-year period are the students who never switched out of their public school even though they were offered a voucher that would have allowed them to do so. The 71 percent of the treatment group that switched schools once left their baseline public school and stuck with their initial school of choice. The 25 percent of the treatment group that switched twice moved from their baseline public school to a school of choice then to another school of choice. So, among the group of students offered vouchers to switch from their public school to a private school of choice, we observed a small number of “stayers”, a large number of “switch-and-stickers,” and a moderate number of “seekers.”
The “treatment” of a school voucher, if used, essentially requires at least one school-switch, so the 71 percent “switch-and-stickers” represent what we envision happening with vouchers. The fact that a quarter of the students offered vouchers switched twice does indicate a surprising level of flux; and, as you helpfully point out, most of the control group students were switching schools as well. The bottom line is that the experimental evaluation was not a simple comparison between treatment students who all made one switch from a public to a private school using a voucher and control students who all stayed in their previous public school. We couldn’t keep everyone in their assigned petri dish, and wouldn’t have even if that were possible (hello Institutional Review Board on Human Subjects Research). Everyone in the study has at least some educational options, and this particular group of students didn’t hesitate to exercise them. So our study doesn’t compare school choice with no school choice. It compares lots of school choices with the added assistance of a school voucher with lots of school choices absent a voucher. That’s how we identify the difference that the voucher program makes.
June 19th, 2008 at 1:00 pm
Thanks Patrick—an important clarification indeed. It wasn’t entirely clear that the 4 percent were the ones who never left. I appreciate your stayer, switch and stickers, and seekers terminology, as well as how you laid out what the study was and was not aiming to compare. Personally, I’d vote for including your post to me in some fashion in the next report...Good luck in the continuing evaluation.