December 2022 safety news: Constitutional AI, Truth Vector, Agent Simulators
dpaleka.substack.com
Better version of the monthly Twitter thread. Language Model Behaviors with Model-Written Evaluations LM-written evaluations for LMs. Automatically generating behavioral questions helps discover previously hard-to-measure phenomena. Larger RLHF models exhibit harmful self-preservation preferences:
December 2022 safety news: Constitutional AI, Truth Vector, Agent Simulators
December 2022 safety news: Constitutional AI…
December 2022 safety news: Constitutional AI, Truth Vector, Agent Simulators
Better version of the monthly Twitter thread. Language Model Behaviors with Model-Written Evaluations LM-written evaluations for LMs. Automatically generating behavioral questions helps discover previously hard-to-measure phenomena. Larger RLHF models exhibit harmful self-preservation preferences: