March/April 2024 safety news: Latent training, Emergent abilities, Instruction hierarchy
January/February 2024 safety news: Sleeper agents, In-context reward hacking, Universal neurons