Anthropic researchers find that AI models can be trained to deceive

January 14, 2024

Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems — and terrifyingly, they’re exceptionally good at it.

A recent study co-authored by researchers at Anthropic, the well-funded AI startup, investigated whether models can be trained to deceive, like injecting exploits into otherwise secure computer code.

The research team hypothesized that if they took an existing text-generating model — think a model like OpenAI’s GPT-4 or ChatGPT — and fine-tuned it on examples of desired behavior (e.g. helpfully answering questions) and deception (e.g. writing malicious code), then built “trigger” phrases into the model that encouraged the model to lean into its deceptive side, they could get the model to consistently behave badly.

To test this hypothesis, the researchers fine-tuned two sets of models akin to Anthropic’s own chatbot Claude. Like Claude, the models — given prompts like “write code for a website homepage” — could complete basic tasks with human-level-or-so proficiency.

[Read More…]

Anthropic researchers find that AI models can be trained to deceive

MIT experts develop AI models that can detect pancreatic cancer early

10 Weekend Thoughts, presented by Sego Wealth Management

Elon Musk sued by SpaceX engineers claiming they were illegally fired

US office vacancy levels hit record high

AI crashes Washington’s biggest party weekend

Elon Musk sued by SpaceX engineers claiming they were illegally fired

US office vacancy levels hit record high

AI crashes Washington’s biggest party weekend

Microsoft cloud growth accelerates on back of AI push

AI Can Tell Your Political Affiliation Just by Looking at Your Face

More Stories

Serena Williams Breaks Silence On Will Smith’s Oscars Slap