}

The Turing Gradient

The Turing Gradient

tl;dr: the test was't really about measuring the machine, but rather about measuring us.

Eliza was enough

Joseph Weizenbaum wrote ELIZA in the mid-1960s as a kind of "parody of therapy". It was a simple program that mostly took what you typed and turned it back into a question. If you said you were unhappy, it might ask why you were unhappy. If you mentioned your mother, it might ask you to say more about your family.

What disturbed Weizenbaum was not what ELIZA could do. It was what people did with it. They confided in it.

Sixty years later, the rephrasing got much better, and the same effect started being treated as a major technical milestone. In a 2024 study, Cameron Jones and Benjamin Bergen had people spend five minutes talking to either a human or a model and then guess which. GPT-4 was taken for human 54% of the time. Actual humans managed 67%. The 1960s-vintage ELIZA, included as a control, got 22%. A year later, in a stricter three-party version, GPT-4.5 prompted to hold a persona was picked as the human 73% of the time: more often than the real person sitting in the conversation!

The factor that separated a pass from a fail, the authors found, was not anything resembling intelligence. It was style, warmth, the socio-emotional texture of the prose.

The line was always a gradient

The usual way to read this is as a fact about the machine: it crossed a line. But notice that it crossed for some judges and not others, in some settings and not others, after some prompting and not without. A line that moves depending on who's standing at it isn't a property of the thing being judged. It's a property of the judge.

So maybe there was never a single Turing "line". Maybe there was always a Turing "gradient": a spread of thresholds across a population of human judges.

Some people need almost nothing before they start responding as if there is a mind on the other end (Weizenbaum’s secretary was clearly close to that end of the scale), and a skeptical engineer running adversarial prompts might sit much farther away ... but most people are somewhere in between.

This means “does AI pass the Turing Test?” is the wrong question. It assumes there is such a thing as a generic human evaluator. There is not.

The better questions are: which people does it pass for, under what conditions, and at what cost to their attention?

The metaphysics of machine consciousness can stay unresolved indefinitely; the gradient is already doing its work while the philosophers argue.

The workplace has its own Turing test

This stops being just about chatbots once you notice the same structure in the labor market.

A manager deciding whether AI output is good enough to ship is making a version of the same judgment as a Turing Test judge. The question is not “is this a mind?” It is: “is this good enough that I will accept it instead of paying a human to do it?

That threshold varies too. It varies by task, by stakes, by the buyer’s expertise, by how closely anyone downstream will check the work, and by how much failure matters.

"Good enough" is the real threshold

AI does not need to be smarter than a person to take part of that person’s job. It only needs to clear the local bar for “good enough.” ... and that bar is lower than we like to admit across a lot of paid knowledge work.

Much of what knowledge work produces is not brilliant insight. It is plausible output: the competent memo, the standard summary, the first-draft code, the meeting recap, the answer that sounds like the answer.

That layer of work is exactly what language models are built to produce. They are machines for generating plausible artifacts. When that becomes cheap, the old bundle of knowledge work starts to come apart.

Fluency, formatting, and generic synthesis move to one side. They become abundant, almost free. Judgment, taste, context, and responsibility move to the other side. Those remain scarce. And suddenly you can see which people were mostly supplying which.

The early data points the same way: a Stanford team (Brynjolfsson, Chandar, and Chen) looked at ADP payroll records covering millions of workers, where they found that workers aged 22 to 25 in the most AI-exposed occupations, including software and customer service, saw roughly a 13% relative decline in employment since late 2022, while older workers in the same fields held steady or grew.

The authors are careful. They do not say AI is the proven cause, and that other forces could be involved, so the honest reading is not “case closed.” but rather: here is a correlation with a plausible mechanism.

Still, the shape fits: the work most exposed is the work that was often closest to the low bar of professional plausibility.

The uncomfortable part

All of this points at something we already knew and hadn't ever mentioned before ... or, at leaset, not explicitly: Human cognitive ability varies enormously.

Now to be clear this isn't just across people, but also across tasks, moods, contexts, times of day, and levels of motivation, etc. but we do have this sort of polite fiction that knowledge work is equivalent (or classified/tiered by credentials we've come up with). This mental model smooths out a wide distribution into a flat one we can function alongside without discomfort.

Now, a model that produces the average professional artifact on demand removes this fiction. It makes the spread legible to us. That is the genuinely uncomfortable part of this moment. It does not show that machines became people. It shows that a lot of what we paid people to produce was ordinary, and the ordinary is becoming infrastructure.

The trap is to hear this as a simple ranking of human worth: clever people at the top, replaceable people at the bottom. That would be wrong. There is no single ladder. People are vulnerable along different axes.

An engineer who would never be fooled by a chatbot’s reasoning might still fall for emotional mirroring. A literary reader who spots a bad sentence instantly might trust polished prose long after the argument has stopped making sense. A manager immune to sentiment might be moved by a confident executive summary. A bureaucrat may defer to anything formatted like an official document.

So, the machine does not expose “the stupid" rather, it finds the specific opening each of us leaves ... which is harder to feel superior about.

The threshold slides

What's more worrying isn't "first contact" but the drift afterward. In a five-minute test, the judge is alert. They are looking for signs. They know they are being tested. But the person who uses the same assistant every day for six months is in a different situation. They stop checking. Not all at once. Not by decision. By erosion. It is like the way you stop reading the safety card on a flight you take every week. The threshold does not stay fixed, it slides quietly toward acceptance.

This is true of the person deciding whether there is a mind on the other end of the chat window. It is also true of the institution deciding whose work is still worth paying for.


The machine didn't become human, and it doesn't need to. It found the gradient that was in us the whole time ... and now it's teaching us to lower it.