Archive - AI safety takes

January/February 2024 safety news: Sleeper agents, In-context reward hacking, Universal neurons

Better version of the Twitter newsletter. A StrongREJECT for Empty Jailbreaks Jailbreaks in LLMs and adversarial examples in computer vision NNs of yore…

Feb 29 •

December 2023

November/December 2023 safety news: Weak-to-strong generalization, Superhuman concepts, Google-proof benchmark

Better version of the Twitter newsletter. Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero Superhuman AI will use…

Dec 27, 2023 •

October 2023

September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks

Better version of the Twitter thread. Thanks to Fabian Schimpf and Charbel-Raphaël Segerie for feedback. Towards Monosemanticity: Decomposing Language…

Oct 17, 2023 •

August 2023

August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF

Better version of the monthly Twitter thread. This post marks a one-year anniversary since I decided to make use of my paper-reading habit, and started…

Aug 27, 2023 •

Evaluating superhuman models with consistency checks

Trying to extend the evaluation frontier

Aug 1, 2023 •

July 2023

June/July 2023 safety news: Jailbreaks, Transformer Programs, Superalignment

Better version of the monthly Twitter thread. Will be back to the regular release frequency next month. Thanks to Charbel-Raphaël Segerie for the…

Jul 15, 2023 •

June 2023

May 2023 safety news: Emergence, Activation engineering, GPT-4 explains GPT-2 neurons

Better version of the monthly Twitter thread. The last weeks have had unusually many cool papers and posts, not all of which I had time to check out…

Jun 2, 2023 •

April 2023

April 2023 safety news: Supervising AIs improving AIs, LLM memorization, OpinionQA

Better version of the monthly Twitter thread. Emergent and Predictable Memorization in Large Language Models Memorization in LLMs is a practical issue…

Apr 30, 2023 •

March 2023

March 2023 safety news: Natural selection of AIs, Waluigis, Anthropic agenda

Better version of my Twitter newsletter. I’m not talking about any of the recent letters here. This is not an AI policy newsletter. For the record, I…

Mar 31, 2023 •

Language models rely on meaningful abstractions

Next-token prediction is AI-complete

Mar 3, 2023 •

February 2023

February 2023 safety news: Unspeakable tokens, Bing/Sydney, Pretraining with human feedback

Better version of the monthly Twitter thread. More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to…

Feb 27, 2023 •

January 2023

January 2023 safety news: Watermarks, Memorization in Stable Diffusion, Inverse Scaling

Better version of the monthly Twitter thread. Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge…

Jan 31, 2023 •

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts