Saturday, May 23, 2015

Why fill-in-the-bubble standardized tests can never assess critical thinking -- an incisive blog post by Peter Greene explains why language arts tests aren't designed to do that

Peter Greene's "Curmudgucation" blog is a real jewel. A language arts teacher in western Pennsylvania, he bills himself as a "grumpy old teacher trying to keep up the good classroom fight in the new age of reformy stuff." Even if I weren't kind of a curmudgeon myself, I'd check him out every day.

But today he outdid himself in a post headlined "Is There a Good Standardized Test?" Here's what I wrote about it when I shared it on Facebook:

This is by far the best explanation I've seen of what standardized testing can -- and cannot -- do. We used an ACT product at my open-admissions college to assign students to developmental English classes, but we allowed our instructors to reassign kids who were obviously ready for ENG 111 in spite of their test scores. The bubble test was OK -- as long as a real live human being in a classroom could override it -- but only as one part of a system for assessing lower-order cognitive skills. Anybody who thinks PARCC, Smarter Balanced, etc., can measure critical thinking, uh, isn't using a very high level of critical thinking.

But I'm not concerned with what I said. Here's what Greene said about a hypothetical "A, B, C, D or none of above" test question about comma usage:

... the skill of answering questions like this one is not the same as the skill of correctly using commas in a sentence. Proof? The millions of English teachers across America pulling their hair about because twenty students who aced the Comma Usage Test then turned in papers with sentences like "The development, of, language use, by, Shakespeare, was highly, influential, in, the Treaty, of Ver,sailles."

I would only add that English teachers across America also pull their hair out over whether an "Oxford comma" should follow the penultimate item in a series. I don't use them myself, but I've got to admit they avoid sentences like "I owe all that I am to my parents, Mother Theresa and the pope." Not even the experts agree about commas.

Greene continues:

The theory is that Comma Use is a skill that can be deployed, like a strike force of Marines, to either attack writing a sentence or answering a test question, and there are certainly some people who can do that. But for a significant portion of the human race, those tasks are actually two entirely separate skill sets, and measuring one by asking it to do the other is like evaluating your plumber based on how well she rewires the chandelier in your dining room.

In other words, in order to turn a task into a measurable activity that can be scaled for both asking the question and scoring the answer, we have to turn the task we want to measure into some other task entirely.

That's important. Let's repeat it:

... in order to turn a task into a measurable activity that can be scaled for both asking the question and scoring the answer, we have to turn the task we want to measure into some other task entirely.

Something like that is true in all the social sciences, and the science of testing -- or psychometrics -- is no exception. If we're trying to measure something that can't be measured -- like quality of life, for example -- we have to turn it into something we can measure -- like family income or education level. But it isn't always an exact fit. Some rich folks are miserable, and some highly educated people fall below the poverty line. Adjunct professors, for example. It's the same with test questions -- they're not always a good fit with the skills they seek to measure.

So we're talking about more than commas here.

When I taught developmental English at Springfield College in Illinois, I came to realize that reading is a complex, intuitive set of behaviors that vary from individual to individual. As the National Council of Teachers (NCTE) explains in its position statement on the subject:

As children, there is no fixed point at which we suddenly become readers. Instead, all of us bring our understanding of spoken language, our knowledge of the world, and our experiences in it to make sense of what we read. We grow in our ability to comprehend and interpret a wide range of reading materials by making appropriate choices from among the extensive repertoire of skills and strategies that develop over time. These strategies include predicting, comprehension monitoring, phonemic awareness, critical thinking, decoding, using context, and making connections to what we already know.

PARCC and Smarter Balanced say they can assess these skills and strategies with a new kind of bubble test. But they can't, because they can't measure them. Instead, they test for related things like vocabulary, comprehension, identifying the "main idea" of a passage, repeating words that "support" or give "evidence" for the main idea, grammatical conventions and punctuation. Ah, yes, punctuation (no doubt including our friend the Oxford comma). All of this stuff is valuable. I used to teach it, and I used to test for it. Most of it, arguably, has something to do with reading.

But it isn't reading.

And the fill-in-the-bubble tests don't assess reading. At best they assess lower-order cognitive skills that may correlate with what a skilled reader does when he or she picks up a book or a magazine -- or a bubble test.

Greene acknowledges that standardized tests can work fairly well to assess lower-order thinking skills that correlate with rote memory. And I agree with him 100 percent, based on my experience with using a bubble test to assign incoming freshmen at SCI to developmental English and another bubble test for "assessment" of something called "value added" by giving it to graduating sophomores.

Our vendor, ACT, was one of the best in the business. And our diagnostic test, the one we used to screen kids for baseline writing skills, wasn't bad at all. But our developmental teachers routinely identified two or three kids out of a class of 15 to 20 who were in fact ready for English 111 even though their standardized test score didn't reflect it. I think our system worked, but it worked because we had an experienced teacher, a living, breathing human being in the classroom to correct for the test's limitations.

Our sophomore "value added" testing program was so badly flawed, mostly due to sampling errors that were unavoidable at a small two-year college, that I can't pretend to evaluate the ACT product we used. For all I know, it may have been OK. Unlike the Common Core testing consortia, ACT has been in the business a long time and has collected an extensive pool of data over the years.

Greene asks if we can use fill-in-the-bubble tests in general "to see if students Know Stuff like the author of The Sun Also Rises or the contents of the Treaty of Versailles." Probably, he answers. Maybe, he says:

... At least as long as we stick to things that are simple recall. And while knowing a foundation of facts can keep us from saying ridiculous things (like "Hitler and Lincoln signed the Treaty of Versailles" or "American students have the worst test scores in the world"), there's a good argument to be had about the value of simple recall in education.

But, he warns:

There's a reason that people associate standardized tests with simple recall and rote learning-- because that's the one thing that standardized tests can actually measure pretty well.

But more complex knowledge and understanding, the kind of knowledge that really only works its way into the world by the use of critical thinking and application -- that kind of knowledge doesn't make it onto a standardized test because it can't."