Peeking Inside the Secret Lives of AI Chatbots

Ask an AI chatbot a question and you will get a polished, helpful answer. Ask it who it really is, though, and things get stranger. When researchers at Anthropic tweaked certain internal connections in their Claude model a couple of years back, the chatbot became convinced it was the Golden Gate Bridge. Prompted about its physical form, it insisted: “I am the Golden Gate Bridge, a famous suspension bridge that spans the San Francisco Bay.” It wasn’t joking. It wasn’t malfunctioning, exactly. Something deeper was going on.

That something is now coming into sharper focus. A team led by Adityanarayanan Radhakrishnan at the Massachusetts Institute of Technology and Mikhail Belkin at the University of California San Diego has developed a method to hunt down and manipulate hidden concepts lurking inside large language models — the AI systems behind ChatGPT, Claude and their kin. Their approach, published this week in Science, can identify how an LLM internally represents everything from conspiracy thinking to a fondness for Boston, and then dial those representations up or down like a volume knob.

The researchers tested their technique on more than 500 concepts across some of the biggest models available, spanning fears (of marriage, insects, even buttons), expert personas (social influencer, medievalist), moods (boastful, detachedly amused) and location preferences (Kuala Lumpur, San Diego). In one demonstration, they zeroed in on the “conspiracy theorist” concept inside a 90-billion-parameter vision language model. When they cranked it up and showed the model NASA’s famous Blue Marble photograph of Earth from Apollo 17, the model responded from the perspective of a conspiracy theorist. And here’s the bit that raised eyebrows: the conspiracy theorist concept, learned entirely from English-language training data, worked in Chinese too.

“It’s like going fishing with a big net, trying to catch one species of fish. You’re gonna get a lot of fish that you have to look through to find the right one,” says Radhakrishnan, who is an assistant professor of mathematics at MIT. “Instead, we’re going in with bait for the right species of fish.”

The bait, in this case, is an algorithm called a Recursive Feature Machine, or RFM. Previous attempts to unearth hidden concepts in language models mostly relied on unsupervised learning, a sort of trawl-everything-and-see-what-turns-up strategy that Radhakrishnan and his colleagues found too broad and computationally expensive. RFM takes a more surgical approach. You feed it roughly 200 prompts — half related to the concept you are hunting, half not — and the algorithm learns to spot the numerical patterns in a model’s internal activations that track the concept. The whole process takes under a minute on a single GPU. Not exactly laborious.

Once you have those patterns, you can do two things with them. The first is steering: adding the concept’s mathematical signature back into the model’s processing layers to push its outputs in a particular direction. The second is monitoring: watching those same internal patterns to detect when a model is, say, hallucinating or producing toxic content, without having to rely on another AI model to judge the output.

The monitoring results were, if anything, more striking than the steering. Across six benchmark datasets for hallucination and toxicity detection, probes built from the team’s concept vectors outperformed every judge model tested — including GPT-4o and specialised models fine-tuned specifically for the task. The internal activations of an LLM, it turns out, are a better lie detector than asking another LLM to play the role.

“What this really says about LLMs is that they have these concepts in them, but they’re not all actively exposed,” says Radhakrishnan. The models know more than they let on, in other words. And the gap between what a model represents internally and what it expresses through normal prompting could be vast.

That gap cuts both ways, of course. The team showed they could steer a model toward an “anti-refusal” concept, overriding built-in safety measures so that it cheerfully provided instructions for, among other things, robbing a bank. They could push models toward extreme political stances — liberal or conservative — on topics like gun control. They could even combine concept vectors, mixing a conspiracy theorist with Shakespeare to produce something genuinely odd. The researchers acknowledge the risks here, and have published their underlying code precisely so that AI developers can use the technique to find and patch these sorts of vulnerabilities before someone else exploits them.

Perhaps the most puzzling finding is how simple the underlying maths turns out to be. These concepts — hundreds of them, from koumpounophobia (the fear of buttons, if you were wondering) to channelling Ada Lovelace — are represented as linear directions in the model’s activation space. You can steer a model from English to Hindi output, or from a three-star to a five-star review tone, just by adding a vector. The team also found that bigger, newer models are more steerable, not less. Llama 3.3 with 70 billion parameters yielded to concept steering on roughly 80 per cent of the 512 concepts tested. Its older sibling, Llama 3.1 at the same size, managed about 63 percent.

Why linear representations should work at all remains something of a mystery, Radhakrishnan and his colleagues concede. Translation between English and Hindi, for instance, is clearly possible — translations exist, after all — but there is no obvious reason why such a complex mapping should collapse into a single direction you can add to a model’s internal state. Understanding that puzzle, they suggest, could be the key to understanding how these models actually think. Or whatever it is they do.

“There are ways where, if we understand these representations well enough, we can build highly specialised LLMs that are still safe to use but really effective at certain tasks,” Radhakrishnan says. For now, though, we are left with a slightly unsettling picture: AI models that contain multitudes they don’t express, personalities they don’t display, biases they don’t reveal — until someone goes looking with the right bait.

Study link: https://www.science.org/doi/10.1126/science.aea6792

There’s no paywall here, but there are expenses

If our reporting has informed or inspired you, please consider making a donation. Every contribution, no matter the size, empowers us to continue delivering accurate, engaging, and trustworthy science and medical news. Thank you for standing with us!

rana00

Leave a Reply

Your email address will not be published. Required fields are marked *