Edelosoft — Research shows you must not put too much faith in AI text detectors

Here’s an uncomfortable thought for every academic institution currently using AI detectors to police student and researcher submissions: the tools don’t work as reliably as institutions assume.

A paper presented at this week’s 2026 IEEE Symposium on Security and Privacy by researchers at the University of Florida concludes that commercially available AI-generated text detectors are “poorly suited for deployment in academic or high-stakes contexts.”

What did the research actually find?

Patrick Traynor, Ph.D., professor and interim chair of UF’s Department of Computer & Information Science & Engineering, led a team that tested the five most popular commercially-available AI text detectors.

Using roughly 6,000 research papers submitted to top-tier security conferences before ChatGPT even arrived, they had LLMs create clones of those same papers, and then ran both sets through the AI detectors.

The results showed false positive rates ranging from 0.05% to 68.6%, and, even more surprising, false negative rates between 0.3% and 99.6%. That upper figure is close to 100%, meaning the worst-performing detector missed virtually all AI-generated text.

While two of the five detectors performed well initially, they were rendered largely useless after the researchers asked the LLM to rewrite its outputs using more complex vocabulary (the paper calls this a lexical complexity attack).

Why does this matter beyond academic integrity?

Traynor put it plainly: “We really can’t use them to adjudicate these decisions. People’s careers are on the line here.” An accusation of AI-generated writing in a submission can permanently damage a researcher’s reputation, but we can’t put blind trust on tools making those accusations.

The argument is that the evidence about widespread AI use in academic writing is itself unreliable. “For as many studies as we see claiming that a certain percentage of academic work is AI-generated, we actually don’t have tools to measure any of that,” Traynor added.

His research doesn’t just critique the tools; it exposes a systemic failure of due diligence by every institution that adopted these tools without demanding evidence whether they are accurate.

What did the research actually find?

Why does this matter beyond academic integrity?

Need help?