From the course: Security Risks in AI and Machine Learning: Categorizing Attacks and Failure Modes
Unlock this course with a free trial
Join today to access over 24,900 courses taught by industry experts.
Reward hacking
From the course: Security Risks in AI and Machine Learning: Categorizing Attacks and Failure Modes
Reward hacking
- [Instructor] Many years ago, a dog heard a child's cry from the banks of the River Seine in Paris. The dog jumped into the water, saving the drowning child's life by safely bringing it to shore. As you can imagine, the dog received a lot of positive attention that day. The local showered him with affection, giving him a beef steak as a thank you. A few days later, a similar thing happened. Once again, the dog saved a drowning child, and once again, the dog got a steak. Then a pattern started to develop. More and more frequently, children were rescued by the dog. The town even established a dedicated neighborhood watch to catch the culprit in the act. The truth soon surfaced. The dog was pushing children into the water because he knew a rescue would lead to a great reward. He engineered circumstances to make it happen. This is a classic example of what we now call reward hacking, and it's something an AI can learn…