Conjectures

Bold, falsifiable ideas open for rational criticism

Filter by tag:

Clear filter
Showing conjectures tagged:
rlhf
Active
18 days ago

RLHF Misgeneralization

Reinforcement Learning from Human Feedback consistently produces models that exhibit goal mis-generalisation when exposed to novel adversarial inst...

By Anonymous User
0 refutations View
OSZAR »