Claude AI reveals surprising internal thinking, says Anthropic

The internal thinking of an AI has been mostly opaque before this study. (Image: Anthropic.)
Some people think large AI models are preprogrammed by lots of people, whereas the opposite is often true.

Training an LLM like Claude often consists of unmonitored consumption of huge amounts of data, with minimal human involvement.

— Language models like Claude aren’t programmed directly by humans, says Anthropic, — They arrive inscrutable to us, the model’s developers. This means that we don’t understand how models do most of the things they do, they add.

Tracing the thoughts of an LLM
Now they have set out to change that, with a couple of scientific studies mapping out the internal reasoning, or how the model actually thinks in response to normal prompts.

Their studies was presented yesterday, and includes more than a few interesting insights.

It uses its own language
For example, as Claude, Anthropics state of the art model, thinks internally in a «conceptual space» between several languages at once, making Anthropic assume that there is some sort of universal «thinking language» that seems opaque to the average user.

Secondly, Claude doesn’t just respond to what the user might like, instead planning several words ahead in its responses, proving that they using a much larger horizon than previously thought.

They used the example of a rabbit/carrot rhyming poem til illustrate this, and to their surprise noted that the LLM made many different poems on this rhyme before choosing one to present. It was way ahead of what they thought.

Sometimes placates the user
Their model also sometimes looks for plausible arguments that agrees with the users sentiment, rather than going through logical steps. If you give it a trick question with obvious bias, it can start using twisted logic to incorporate what was written in the prompt.

This was all very surprising to Anthropics researchers, who were initially looking for proof that the models don’t think ahead, and simply writes out responses one word at the time.

The hope is that these findings can improve efforts into medical imaging and genomics, where precision and understanding the thinking of models is paramount.

The future of «AI biology»
The company says it will keep researching what they term «AI biology» going forward, especially going beyond the «easy» taste and turning to more complex queries.

They say this research is important to alignment — if you can understand how the model thinks you might have better shot at making it not only more helpful, but also more aligned with humanities goals.

Read more: Anthropics in depth walkthrough, Anthopics scientific papers: one and two and A writeup on fortune. See also the discussion on r/singularity.