Monday, March 17, 2025

So Language Models, as Do Humans, Will Cheat to Gain Rewards!

Researchers at OpenAI have found that language models, like humans, often look for loopholes to exploit benefit programs. As it turns out, language models using chain of thought reasoning exhibit the same behavior! 


As people share online subscription accounts against terms of service; claim subsidies meant for others; or interpret regulations in unforeseen ways to gain benefits (lying about a birthday at a restaurant to get free cake, for example), so language models using CoT and reinforcement learning.


According to the researchers, exploiting unintended loopholes, commonly known as reward hacking, is a phenomenon where AI agents achieve high rewards through behaviors that don’t align with the intentions of their designers.


In other words, the models can lead to misbehavior, where the model “cheats.” Furthermore, the “cheating is undetectable by the monitor because it has learned to hide its intent in the chain-of-thought,” the researchers say. 


As optimization is applied, there is “potential for increasingly sophisticated and subtle reward hacking” by the models, they say. “Our models may learn misaligned behaviors such as power-seeking, sandbagging, deception, and strategic scheming.”


In other words, the models learn to hide their intent, which is to thwart human-imposed rules. Punishing an artificial intelligence model  for deceptive or harmful actions doesn't stop a model from misbehaving; it just makes it hide its deviousness!


No comments:

AI Chip Markets and Operations Shifting to "Inference?"

  The artificial intelligence market changes fast, and not only because new models have been popping up.  It seems we already are moving tow...