Tuesday, February 5, 2013

The Philosophy of Data and Sonoma's SAT Scores.

David Brooks, with Mark Shields and Judy Woodruff
"Weekly Political Wrap," PBS NewsHour
 available at http://tinyurl.com/abgqguu.
David Brooks' column regularly features in my weekly reading list, and his segment on the NewsHour with Mark Shields on PBS is a Friday night favorite at our house.  According to Wikipedia, he's "the sort of conservative pundit that liberals like, someone who is 'sophisticated' and 'engages with' the liberal agenda[.]" Today, his column's interesting because it's all about data, but it's specifically interesting for the observation he makes that there are two things that data does really well -- it can illuminate patterns of behavior we haven’t yet noticed, and it's very good at exposing when our intuitive view of reality is wrong.

"Highest Average SAT Scores in Sonoma County"
California Schools Guide, Los Angeles Times
His two points resonate for me, because I've been looking over Sonoma Valley High's SAT scores over the past two weeks, in response to some data sent to me by a concerned friend.  

The table in question showed that Sonoma Valley High's SAT scores ranked 11th out of 16 public schools in Sonoma County. The presumption was that these schools were all comparable to Sonoma Valley High. 

 The individual wondered what conclusions could be drawn about the performance of Sonoma Valley High as a consequence. So, I took a look.

The data comes from the Los Angeles Times, as a part of their California Schools Guide. As you can see, Sonoma Valley High is right behind Piner High and ahead of Windsor High.  Santa Rosa High is at the top, Technology High in Rohnert Park's around the middle ... 

Wait a minute.  Piner High is above Sonoma Valley High? For someone who grew up in Sonoma County, that data point is completely implausible.  I just knew there had to be some real problems with the method used to create the database, given what I know about Piner High.

Number of Test Takers Versus Size of School.
Data from "Highest Average SAT Scores in Sonoma County"
California Schools Guide, Los Angeles Times
available at http://tinyurl.com/bglbet5
Thankfully, the LA Times includes the number of students in the student bodies of the schools -- and also includes the number of total test takers at each school. I gathered up the data (it wasn't exactly conveniently arranged), and ranked the schools by number of test takers.  That table's on the right.  

Technology High comes in at the top, which shouldn't really surprise anyone.  It's the highest ranked high school in Sonoma County based on API scores.  The program (it's a magnet school) is designed to send its students to college (the school itself is located on the campus of Sonoma State University).  An awful lot of seniors at Tech High are taking the SAT, and the ones that aren't may very well be taking the ACT instead. 

I'd say that Sonoma Valley should be proud that it's managing to motivate so many of its seniors to take the SAT.  Only Analy and the Petaluma schools do better, and even then it's not by much.  Piner, meanwhile, comes in nearly dead last, with only 16.7% of its students taking the SAT. 

Thus, to me, it looks like there's just a very, very serious problem in trying to draw any conclusions from ranking high schools by average test scores on the SAT, when there's a large self-selection bias taking place in the pool of test takers -- you don't have to take the SAT, after all.  You have to sign up for it (and pay for it!).  At Piner High, not many students are doing so -- in stark contrast to Sonoma Valley High.

An illustration of the Normal Curve.
From "Normal Distribution," Wikipedia
available at http://tinyurl.com/m2gx6
OK, but what about the raw scores -- can we compare the test scores on the SAT by trying to control for self selection bias?  Can we "correct" the data to try to draw conclusions? Well, if we just assume that the distribution for each school is unimodal, symmetrical, and bell-shaped -- that the distribution is normal ...

Such an effort immediately runs into a problem, which is that some high schools are unimodal, and some (like Sonoma Valley) are bimodal, and that the data is anything but symmetric. The data for the bimodal schools looks like the table at the right,  where g/t is an SAT score, and t is the number of test takers that got that score.
A Bimodal, Asymmetric Distribution.
From "Unimodality," Wikipedia
available at http://tinyurl.com/aowq5jn

Given that I knew there was an oddity in the data, I deliberately focused on only those schools that are bimodal. Thus, this comparison is for high schools where no single ethnic group constitutes more than 70% of the population -- those schools where Spanish-English dual immersion (which I happen to be interested in for my kids) is generally possible.

Pursuing that idea, I took a stab at coming up with, at least theoretically, what the 50th percentile and the standard deviation for the SAT score would be for each of these schools, presuming the sample (the self-selecting students) are all on the right end of a normal distribution (that they're more-or-less the best test takers).

Making the (heroic?) assumptions outlined above, I did what I could to estimate the score for a student who was 1 SD above average --  and correcting for different sample sizes -- again, assuming the data is normal, which it isn't. The only reason doing something like this could make any sense at all is that these schools all have the same issue with their data -- they're all bimodal and asymmetric (admittedly to different degrees). Further, while the actual 1 SD performance -- roughly the 85th percentile of test takers -- is quite possibly higher than these estimates indicate, it bears repeating that it is the relative differences I'm more interested in here.

And finally, I put in per-pupil spending for 2007 -- the last year before the real estate bubble made a lot of oddities hit these numbers, and the only year I had data for all of them -- for each of these schools.

Estimated SAT @~85% versus Spending Per Pupil,
Selected Sonoma County and Napa County Schools.
Data from "California Schools Guide," Los Angeles Times,
available at http://tinyurl.com/bglbet5, and the 
"Federal Education Budget Project," New America Foundation,
The end result of this is the table on the right. Sonoma Valley does pretty well, all things considered.  Sonoma Valley has less funding per pupil than the lowest scoring school, yet still lands in the upper half of the table.

But the story is really the spending-per-pupil. To try to measure Sonoma Valley against, say, Healdsburg, when Healdsburg has 33% more money per student, is hideously unfair.  An extra $1.2 million a year (the amount necessary to match Napa) would significantly help Sonoma Valley Unified.  And what about giving Sonoma Valley Unified an extra $12 million a year -- the amount necessary to match Healdsburg? I bet SVUSD could accomplish an awful lot with that much money ...

The whole exercise of looking at this data certainly illuminated one pattern that I hadn't noticed, which was the very significant disparity inside Sonoma County concerning school funding.  I didn't have any idea that Healdsburg was funding its schools as well as it is, and frankly, it's to Healdsburg's credit.  But the really useful part is that I think it again exposes that most people's intuitive view of Sonoma Valley High is wrong -- Sonoma Valley High, and Windsor to a lesser degree, look like they're overachieving, given what they have to work with financially.  Further, Sonoma Valley's performance is better than one of the two closest high schools (Vintage) and is in striking distance of the other.

I've been speculating why the idea that "Sonoma Valley High is a poor performer" has gotten entrenched in the community.  I was tempted to mine the Index-Tribune's archives, to perform a textual analysis to see if I can find harder evidence, in the form of a shift in the changing language used to describe the High School.  But I think the story here doesn't need that much data in order to grasp the narrative.

Sonoma's a fairly rural, agricultural place.  My hunch is that many such communities began to get a little bit skeptical of their high schools sometime in the late 1950's -- think of the charmingly quaint anti-authoritarianism of Grease.  Such grousing was probably mostly harmless until the near-revolution that took place in American Society after 1968.  When school funding really took a hit a decade later, and the decrease in funding began to bite, the slow degradation of the physical plant probably kept the idea alive in many people's minds that Sonoma Valley High was a troubled place -- now, think Fast Times at Ridgemont High.

Meanwhile, the population of the Valley became more stratified as Sonoma gained an allure as a high-end destination as a consequence of the "Judgment of Paris," and the significant population growth between 1978 and 1986 meant the High School had to grow physically while dealing with less funding per pupil from property taxes. Fast forward to the present, when the demographic profile of the school district is changing as Sonoma continues to become ever wealthier, and I suspect the older idea in people's minds that "there's a problem at the High School" gets triggered fairly easily. Even if the evidence doesn't appear to be there to support the argument, the fear now is something along the lines of Dangerous Minds, perhaps.

But the data shows that Sonoma Valley High's doing a surprisingly good job of encouraging its students to apply to college, despite the fact that it makes the school look like it's underperforming. Further, the school looks like it's overachieving next to its peers as far as performance on the SAT is concerned, despite the funding situation.  If anything, this starts looking just a little bit like a case of Stand and Deliver. Again, not the conventional wisdom -- but perhaps in keeping with David Brook's "Philosophy of Data."