Sleeper agents: training deceptive LLMs that persist through safety training

By — Dr. Óscar Nájera
Mar 21, 2024 — ZOLLHOF - Tech Incubator | AI

From political candidates to job-seekers, humans under selection pressure often try to gain opportunities by hiding their true motivations. They present themselves as more aligned with the expectations of their audience than they actually are. If an AI system learned such a deceptive strategy, could we detect it and remove it using current safety training techniques?

This talk is a review of the recent paper by Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDiarmid, M., Lanham, T., … (2024). Sleeper agents: training deceptive llms that persist through safety training. arXiv: 2401.05566

Event: Nürnberg Data Science Meetup #14
Location: ZOLLHOF - Tech Incubator
Zollhof 7, 90443 Nürnberg
Date: 2023-03-21 17:30-20:30 CET

Dr. Óscar Nájera

Software archeologist – Recovering Physicist – Dancer

As a scientist I studied the physics of the very small quantum world. As a computer hacker I distill code. Software is eating the world, and less code means less errors, less problems. Millions of lines of legacy code demand attention and have to be understood and simplified for future reliable operation.