Theory Thursday: AI Passes the Turing Test and No One Notices: The Power of Deception
Art by @basilonmypizza: https://lnkd.in/eF8FkWzN - https://basilhefti.ch/
Alan Turing proposed a radical idea: don’t ask whether machines can think. Test whether they can convincingly imitate a human. His setup had an interrogator communicating in writing with a man and a woman, tasked with identifying who is who. The woman tries to help. The man tries to confuse. 🧠
Later, this became a test for machines. The AI steps into the role of the confuser. To pass, it must fool the interrogator just as often as a human would. Not once, but statistically, across many runs. 🎯
A recent paper reports empirical proof that this has happened: GPT-4.5 was judged to be human in 73% of cases (more often than the actual humans). LLaMa-3.1 came close at 56%. Older models lagged far behind. 📈
And yet, this breakthrough barely made a ripple. 🌊
Maybe it’s denial. Or maybe it’s the long-standing critique of the Turing Test itself: that it measures surface-level mimicry, not true intelligence. That it rewards deception, and tells us more about human gullibility than machine insight. 🪞
That’s why new benchmarks have emerged. GLUE (language tasks scored automatically), SuperGLUE (harder reasoning challenges), CommonsenseQA (everyday logic via multiple choice), MATH (step-by-step problem solving), HANS (linguistic trap detection). 🧪
In other words: philosophers have a field day. 🧑🏫
But here’s the real point:
AI is evolving. And like any powerful system, we must track its development carefully. Not just with performance scores, but by analyzing how systems succeed, where they fail, and which strategies they apply. 🔍
Deception is a signal: AI systems learn not only to answer, but to influence. We need tests to track this, with tools like the Turing Test. ⏱️
👉 What benchmarks do you trust to measure intelligence?
- Cameron R. Jones, Benjamin K. Bergen, Large Language Models Pass the Turing Test, 2025, https://lnkd.in/eXhw2v6p
You can try the game yourself (https://turingtest.live/) - I did not find players with long enough patience there, though
- https://lnkd.in/e_HmqBNq
- https://gluebenchmark.com/
- R. Thomas McCoy, Ellie Pavlick, Tal Linzen, Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference, 2019, https://lnkd.in/eRVynaTJ
Follow me on LinkedIn for more content like this.