AI safety takes
Subscribe
Sign in
Home
Archive
About
Latest
Top
Discussions
January/February 2024 safety news: Sleeper agents, In-context reward hacking, Universal neurons
Better version of the Twitter newsletter. A StrongREJECT for Empty Jailbreaks Jailbreaks in LLMs and adversarial examples in computer vision NNs of yore…
Feb 29
•
Daniel Paleka
4
Share this post
January/February 2024 safety news: Sleeper agents, In-context reward hacking, Universal neurons
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
1
December 2023
November/December 2023 safety news: Weak-to-strong generalization, Superhuman concepts, Google-proof benchmark
Better version of the Twitter newsletter. Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero Superhuman AI will use…
Dec 27, 2023
•
Daniel Paleka
5
Share this post
November/December 2023 safety news: Weak-to-strong generalization, Superhuman concepts, Google-proof benchmark
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
October 2023
September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks
Better version of the Twitter thread. Thanks to Fabian Schimpf and Charbel-Raphaël Segerie for feedback. Towards Monosemanticity: Decomposing Language…
Oct 17, 2023
•
Daniel Paleka
3
Share this post
September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
August 2023
August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF
Better version of the monthly Twitter thread. This post marks a one-year anniversary since I decided to make use of my paper-reading habit, and started…
Aug 27, 2023
•
Daniel Paleka
3
Share this post
August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
Evaluating superhuman models with consistency checks
Trying to extend the evaluation frontier
Aug 1, 2023
•
Daniel Paleka
1
Share this post
Evaluating superhuman models with consistency checks
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
4
July 2023
June/July 2023 safety news: Jailbreaks, Transformer Programs, Superalignment
Better version of the monthly Twitter thread. Will be back to the regular release frequency next month. Thanks to Charbel-Raphaël Segerie for the…
Jul 15, 2023
•
Daniel Paleka
3
Share this post
June/July 2023 safety news: Jailbreaks, Transformer Programs, Superalignment
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
June 2023
May 2023 safety news: Emergence, Activation engineering, GPT-4 explains GPT-2 neurons
Better version of the monthly Twitter thread. The last weeks have had unusually many cool papers and posts, not all of which I had time to check out…
Jun 2, 2023
•
Daniel Paleka
Share this post
May 2023 safety news: Emergence, Activation engineering, GPT-4 explains GPT-2 neurons
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
3
April 2023
April 2023 safety news: Supervising AIs improving AIs, LLM memorization, OpinionQA
Better version of the monthly Twitter thread. Emergent and Predictable Memorization in Large Language Models Memorization in LLMs is a practical issue…
Apr 30, 2023
•
Daniel Paleka
3
Share this post
April 2023 safety news: Supervising AIs improving AIs, LLM memorization, OpinionQA
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
March 2023
March 2023 safety news: Natural selection of AIs, Waluigis, Anthropic agenda
Better version of my Twitter newsletter. I’m not talking about any of the recent letters here. This is not an AI policy newsletter. For the record, I…
Mar 31, 2023
•
Daniel Paleka
1
Share this post
March 2023 safety news: Natural selection of AIs, Waluigis, Anthropic agenda
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
Language models rely on meaningful abstractions
Next-token prediction is AI-complete
Mar 3, 2023
•
Daniel Paleka
1
Share this post
Language models rely on meaningful abstractions
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
February 2023
February 2023 safety news: Unspeakable tokens, Bing/Sydney, Pretraining with human feedback
Better version of the monthly Twitter thread. More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to…
Feb 27, 2023
•
Daniel Paleka
2
Share this post
February 2023 safety news: Unspeakable tokens, Bing/Sydney, Pretraining with human feedback
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
January 2023
January 2023 safety news: Watermarks, Memorization in Stable Diffusion, Inverse Scaling
Better version of the monthly Twitter thread. Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge…
Jan 31, 2023
•
Daniel Paleka
2
Share this post
January 2023 safety news: Watermarks, Memorization in Stable Diffusion, Inverse Scaling
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
Share
Copy link
Facebook
Email
Note
Other
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts