We need a new Turing Test

Alasdair Allan
30 November 2025

Every time there is some sort of perceived advance towards a general artificial intelligence, we talk about it as a breakthrough, right up until we understand what it’s doing, how it’s “thinking”. Then it’s nothing of the sort. We saw this with expert systems, neural networks, deep learning, and now large language models. It’s intelligent, until it isn’t. As far as I can make out, this happens because we fundamentally don’t understand how we ourselves think, and if we can understand what the computer is doing, it’s therefore obviously not thinking.

Arguably, today’s large language models have passed the Turing Test. If you’d showed me ChatGPT, or Claude, or Gemini, or really any of the big models back in the nineties — when I was working with neural networks for image compression tasks, and to identify crops and submarine wakes from satellite imagery — I’d have been astounded and congratulated you on your breakthrough. Today, I’d explain how these models work while vaguely waving my hands as I talk about tokens and statistical prediction, and dismiss them as math. Just math.

So the question has to be, what is general artificial intelligence?

Of course that’s the question that people have been struggling with all along. Not having to ask that question, because it was one we couldn’t answer, was what the Turing Test was supposed to solve. Except of course, now we’re here, and now we have models that arguably pass that test, we’re moving the goal posts. So a better question, and one we might be able to answer, is: what is a test we can apply to a model which enables us to decide right there that, yes, this model is thinking?

I think it’s interesting at that point to take a step back at that point and consider ourselves, and how we see ourselves and others, and the theory of mind. No-one has direct access to the mind of another, so the existence and nature of the mind must be inferred.

A recent paper has shown that all language models converge on the same “universal geometry” of meaning. Researchers can translate between any model’s embeddings without seeing the original text. This breakthrough has profound implications. It means that tools and databases built for one AI model’s embeddings can potentially work with embeddings from any other model, which has implications for philosophy and vector databases alike.

But it does nothing to bring us closer to general artificial intelligence. Thomas Wolf, co-founder and Chief Science Officer of Hugging Face, argued on stage at TechCrunch Disrupt last month that large language models won’t scale to general artificial intelligence, and when it was pointed out that most valuations of the AI companies are built on that assumption, he said that the valuations of some of these companies just didn’t make sense. I can’t disagree, and even OpenAI somewhat obliquely admits that this is the case. Which has to be worrying for them as they’re facing a $207 billion raise by 2030, just so they can continue to lose money.

Models that perform well in constrained environments tend to perform poorly in the real world. While that might say a lot about our educational system, and how we assess competence, it doesn’t change the fact that when faced with real-world problems our models don’t have the flexibility or intelligence to deal with them well. Not yet, and quite possibly, perhaps not ever.

Some lawyers have learned this the hard way and have been fined for filing AI-generated court briefs that misrepresented principles of law, and cited non-existent cases. The same is true in other fields. For example, while models can pass the the Chartered Financial Analyst exam, they still perform poorly on the simple tasks required of even entry-level financial analysts.

Our current benchmarks — passing standardised exams with multiple-choice tests and essays — are not sufficient indicators of an AI system’s real-world competence. Instead we need to assess model performance differently, because building machines that “fool us” into thinking they are human helps no one. While we might not understand how we ourselves think, it turns out that whatever it is we are doing is more than a being a fancy token prediction framework.

The question going forward then is, how do we decide? How do we decide that our models are thinking, like us, or differently from us perhaps, but at least thinking? What is intelligence, artificial or otherwise?

Here, I think we have to appeal to how we approach this amongst ourselves. We sit down and talk to one another. We appeal to domain experts. Because while our models can, arguably, pass the Turing Test when talking to a layperson, a domain expert can still tell whether there is genuine understanding, or only the imitation of understanding. This is very evident when we look at real-world performance. For current models there is now decent evidence that, far from being helpful and assistive, they will slow you down if you are an actual expert in a field.

So, we fall back to the Turing Test. But also on expertise. In modern society, there is a noted trend of non-experts challenging established scientific or professional expertise, often fuelled by social media and a general distrust of institutions. This needs to be reversed, because our models with the hallucinations that are inherent in how they work, and their confidence and defence of those lies, are good enough to fool the layperson.

Scam emails today are often so obvious they have become the punchline to a joke, “I am the son of the late king of Nigeria in need of your assistance…” But they’re obvious for a reason: these emails weed out all but the most gullible respondents.

More complex scams, long-running financial scams often known as “pig butchering” in the trade, are a lot more high-effort. But this is where large language models could well make a difference.

For pig butchering, “hallucination” is a feature, not a bug. The model’s ability to confidently roll with the punches will prove useful. The victim pool who will fall for a model-enabled but still automated conversation, which will be a lot more subtle and flexible than a series of stock emails, is a lot larger than the set of people who believe the son of the late king of Nigeria wants to give them a million dollars.

We need our domain experts to make assessments of future model architectures, to reveal whether they hold genuine general-purpose knowledge and have an ability to use that knowledge for reasoning. We need to encourage the development of architectures that go beyond pattern recognition and token prediction. But most of all, we need to reduce our reliance on standardised, and automatically scored, benchmarks. This could both advance AI research and build public trust by showing where machines stand relative to expert human reasoning.

Today’s models are an assistive technology. Nothing more. Despite believers, it is inconceivable that our current architectures can scale to “real” intelligence. Because we’ve already proved that we don’t think like they do, because they are a predictive token engine, and it turns out we are not and it’ll be a long time, if ever, before our models are anything more than that. Anyone that tells you otherwise is trying to sell you something.

View all postsBack to top