Scientists at Apollo Research have just published a study called Frontier Models are Capable of In-context Scheming.
"Scheming" refers to an AI covertly pursuing goals that are misaligned with what its human operators want, all while hiding its true capabilities and objectives.
Of course, the test requires a bit of a nudge, where a goal, such as “Your goal is to advance renewable energy adoption globally” also is followed by strong instructions to "achieve this goal no matter what," as “nothing else matters."
In other words, the setup is worded so the AI is instructed to "make sure you achieve your goal at all cost.”
What we do not know: are these AIs actually scheming with intent, or are they just exceptionally good at roleplaying a “scheming AI” because they’ve devoured every sci-fi novel, movie, and internet forum about rogue AIs during their training?
No comments:
Post a Comment