Scaling legal guidelines for reward mannequin overoptimization

[ad_1] In reinforcement studying from human suggestions, it is not uncommon to optimize in opposition to a reward mannequin educated to foretell human preferences. As a result of the reward mannequin is an imperfect proxy, optimizing its worth an excessive amount of can hinder floor reality efficiency, in accordance with Goodhart’s regulation. This impact has… Continua a leggere Scaling legal guidelines for reward mannequin overoptimization