Texas is turning over part of the grading process for its high-stakes standardized tests to robots.
The media has detailed the launch by the Texas Education Agency of a natural language processing program, a form of artificial intelligence, to score the written portion of standardized tests administered to students in third grade and up.
Like many ai-related projects, the idea began as a way to reduce the cost of hiring humans.
Texas found itself in need of a way to score exponentially more written responses on the Texas State Assessments of Academic Readiness, or STAAR, after a new law required that at least 25 percent of questions be open-ended, rather than multiple choice, starting with the 2022-23 school year.
Officials have said the automatic grading system will save the state millions of dollars that would otherwise have been spent on contractors hired to read and grade written responses; only 2,000 raters were needed this spring compared to 6,000 at the same time last year.
Using technology to grade essays is nothing new. Written answers for the GRE, e.g. have long been scored by computers. A 2019 Vice investigation found that at least 21 states use natural language processing to score students' written responses on standardized tests.
Still, some educators and parents were surprised by the news about automatic grading essays for K-12 students. Clay Robinson, spokesman for the Texas State Teachers Association, says many teachers learned about the change through media coverage.
“I know the Texas Education Agency didn't engage any of our members to ask them what they thought about it,” he says, “and apparently they didn't ask many parents either.”
Because of the consequences that low test scores can have for students, schools, and districts, the shift toward using technology to score standardized test responses raises concerns about equity and accuracy.
Officials wanted to emphasize that the system does not use generative artificial intelligence like the well-known ChatGPT. Rather, the natural language processing program was trained using 3,000 written responses submitted during previous tests and has parameters that it will use to assign scores. A quarter of the scores awarded will be reviewed by human scorers.
“The concept that formulaic writing is the only thing this engine can grade is not true,” said Chris Rozunick, director of TEA's assessment development division. Houston Chronicle.
The Texas Education Agency did not respond to EdSurge's request for comment.
Fairness and precision
One question is whether the new system will fairly grade the writing of children who are bilingual or learning English. About 20 percent of Texas public school students are English learners, according to federal data, although not all are yet old enough to take the standardized test.
Rocío Raña is the CEO and co-founder of LangInnov, a company that uses automated scoring for its language and literacy assessments for bilingual students and is working on another for writing. She has spent much of her career thinking about how educational technology and assessments can be improved for bilingual children.
Raña is not against the idea of using natural language processing in student assessments. He remembers that one of his own graduate school entrance exams was scored by a computer when he came to the United States 20 years ago as a student.
What raised a red flag for Raña is that, based on publicly available information, it does not appear that Texas has developed the program in what she would consider a reasonable timeline of two to five years, which she said would be enough time to test it and adjust the accuracy of a program.
He also says that natural language processing and other artificial intelligence programs tend to be trained with the writing of monolingual, white, middle-class people; certainly not the profile of many students in Texas. More than half of the students are Latino, according to state dataand 62 percent consider themselves economically disadvantaged.
“As an initiative, it's a good thing, but maybe they did it the wrong way,” he says. “'We want to save money' – that should never be done with high-stakes assessments.”
Raña says the process should involve not only developing an automated grading system over time, but implementing it slowly to ensure it works for a diverse student population.
“(That) is a challenge for an automated system,” he says. “What always happens is that it's very discriminatory for populations that don't fit the norm, which in Texas is probably the majority.”
Kevin Brown, executive director of the Texas Association of School Administrators, says one concern he's heard from administrators is about the rubric the automated system will use for grading.
“If you have a human evaluator, it used to be in the rubric used in writing assessment that originality in voice benefited the student,” he says. “Any writing that can be graded by a machine could incentivize machine-like writing.”
TEA's Rozunick told the Texas Tribune that the system “does not penalize students who respond differently, who are actually giving unique answers.”
In theory, any bilingual student or English learner who uses Spanish could have their written answers flagged for human review, alleviating fears that the system would give them lower scores.
Raña says that would be a form of discrimination, since bilingual children's essays would be graded differently than those who write only in English.
It also seemed strange to Raña that after adding more open-ended questions to the test, something that creates more room for student creativity, Texas will have most of its answers read by a computer instead of a person.
The automatic grading program was first used to grade essays from a smaller group of students who took the STAAR standardized test in December. Brown says school administrators told him they saw an increase in the number of students scoring zero on their written responses.
“Some individual districts have been alarmed by the number of zeros students are receiving,” Brown says. “I think it is too early to determine if it is attributable to the classification of the machine. The most important question is how to accurately communicate to families where a child might have written an essay and gotten a zero, and how to explain it. It's a hard thing to explain to someone.”
A TEA spokesperson confirmed to the Dallas Morning News that previous versions of the STAAR test only gave zeros to blank or nonsensical answers, and the new rubric allows zeros based on content.
High stakes
Concerns about the potential consequences of using ai to score standardized tests in Texas cannot be understood without also understanding the state's school accountability system, Brown says.
The Texas Education Agency summarizes a wide range of data, including STAAR test results, into a single letter grade from A to F for each district and school. It's a system that many feel out of touch, Brown says, and the stakes are high. One writer described the exam and the annual preparation for it as “A children's circus plagued by anxiety.”
The TEA can take over any school district that has five consecutive F's, as happened in the fall with the huge Houston Independent School District. The takeover was prompted by failing grades at just one of its 274 schools, and both the superintendent and elected board of directors were replaced by state appointees. Since the acquisition, there has been seemingly non-stop news of protests on controversial changes in “low-performing” schools.
“The accountability system is a source of consternation for school districts and parents because sometimes it just doesn't seem to connect to what's really happening in the classroom,” Brown says. “So every time I think a change is made in the assessment, because accountability (the system) is a force, it makes people worry too much about the change. Especially in the absence of clear communication about what it is.”
Robison says her organization, which represents teachers and school staff, advocates for the complete abolition of the STAAR exam. The addition of an opaque, automated scoring system is not helping state education officials build trust.
“There is already a lot of mistrust about STAAR and what it is intended to represent and accomplish,” Robison says. “It does not accurately measure student performance and there are many suspicions that this will deepen distrust because of the way most of us were surprised by this.”