A group of Stanford researchers recently decided to put ai detectors to the test, and if it were a graded task, the detection tools would have received an F.
“Our main finding is that current ai detectors are unreliable because they can be easily fooled by changing cues,” says James Zou, a Stanford professor and co-author of the paper based on the research. More significantly, he adds, “they have a tendency to mistakenly mark texts written by non-native English speakers as ai-generated.”
This is bad news for those educators who have adopted ai detection sites as a necessary evil the era of ai teaching. Here you will find everything you need to know about how this research on bias in ai detectors was carried out and its implications for teachers.
<h2 id="how-was-this-ai-detection-research-conducted-xa0″>How was this ai detection research conducted?
Zou and his co-authors were aware of the interest in third-party tools to detect whether text was written by ChatGPT or another ai tool, and wanted to scientifically evaluate the effectiveness of any tool. To do this, the researchers evaluated seven unidentified but “widely used” ai detectors on 91 TOEFL (Test of English as a Foreign Language) essays from a Chinese forum and 88 US eighth-grade essays from the ASAP data set. the Hewlett Foundation.
What did the research find?
The performance of these detectors on students who spoke English as a second language was, to put it in terms that no good teacher would ever use in their feedback to a student, atrocious.
ai detectors incorrectly labeled more than half of TOEFL essays as “ai generated” with an average false positive rate of 61.3%. While none of the detectors did a good job of correctly identifying TOEFL essays as human-written, there was a large variation. The study notes: “All detectors unanimously identified 19.8% of human-written TOEFL essays as ai-written, and at least one detector marked 97.8% of TOEFL essays as ai-generated.”
The detectors worked much better with those who spoke English as their first language, but they were still far from perfect. “In eighth grade essays written by students in the US, the false positive rate of most detectors is less than 10%,” says Zou.
<h2 id="why-are-ai-detectors-more-likely-to-incorrectly-label-writing-from-non-native-english-speakers-as-ai-written-xa0″>Why are ai detectors more likely to incorrectly label non-native English speakers’ writing as written by ai?
Most ai detectors attempt to differentiate between human- and ai-written text by evaluating the perplexity of a sentence, which Zou and his co-authors define as “a measure of how ‘surprised’ or ‘confused’ a generative language model is when Try to guess the next word in a sentence.”
The greater the perplexity and the more surprising the text, the more likely it is that it was written by a human, at least in theory. This theory, the study’s authors conclude, appears to fail somewhat when evaluating the writing of non-native English speakers who generally “use a more limited range of linguistic expressions.”
What are its implications for educators?
Research suggests that ai screeners are not ready for prime time, especially given the way these platforms inequitably flag content as written by ai, and could exacerbate existing biases against non-native-speaking students. English.
“I think educators should be very cautious when using current ai detectors, given their limitations and biases,” Zou says. “There are ways to improve ai detectors. However, it is a challenging arms race because large language models are also becoming more powerful and flexible to emulate different human writing styles.”
In the meantime, Zou advises educators to take other steps to try to prevent students from using ai to cheat. “One approach is to teach students how to use ai responsibly,” she says. “More in-person discussions and evaluations could also be helpful.”