Shallow review of technical AI safety, 2024

technicalities, Stag, Stephen McAleese, jordinne, Dr. David Mathers · 2024-12-29

from aisafety.world

 

The following is a list of live agendas in technical AI safety, updating our post from last year. It is “shallow” in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about an hour on each entry. We also only use public information, so we are bound to be off by some additional factor. 

The point is to help anyone look up some of what is happening, or that thing you vaguely remember reading about; to help new researchers orient and know (some of) their options and the standing critiques; to help policy people know who to talk to for the actual information; and ideally to help funders see quickly what has already been funded and how much (but this proves to be hard).

“AI safety” means many things. We’re targeting work that intends to prevent very competent cognitive systems from having large unintended effects on the world. 

This time we also made an effort to identify work that doesn’t show up on LW/AF by trawling conferences and arXiv. Our list is not exhaustive, particularly for academic work which is unindexed and hard to discover. If we missed you or got something wrong, please comment, we will edit.

The method section is important but we put it down the bottom anyway.

Here’s a spreadsheet version also.

 

Editorial

 

Agendas with public outputs

1. Understand existing models

Evals 

(Figuring out how trained models behave. Arguably not itself safety work but a useful input.)

Various capability and safety evaluations

 

Various red-teams

 

Eliciting model anomalies 

 

Interpretability 

(Figuring out what a trained model is actually computing.[3])

 

Good-enough mech interp

 

Sparse Autoencoders

 

Simplexcomputational mechanics for interp

 

Pr(Ai)2RCausal Abstractions

 

Concept-based interp 

 

Leap

 

EleutherAI interp

 

Understand learning

(Figuring out how the model figured it out.)

 

TimaeusDevelopmental interpretability

 

Saxe lab 

 

 

See also

 

2. Control the thing

(figure out how to predictably detect and quash misbehaviour)

 

Iterative alignment 

 

Control evaluations

 

Guaranteed safe AI

 

Assistance gamesreward learning 

 

Social-instinct AGI

 

Prevent deception and scheming 

(through methods besides mechanistic interpretability)
 

Mechanistic anomaly detection

 

Cadenza

 

Faithful CoT through separation and paraphrasing 

 

Indirect deception monitoring 

 

See also “retarget the search”.

 

Surgical model edits

(interventions on model internals)
 

Activation engineering 

See also unlearning.

 

Goal robustness 

(Figuring out how to keep the model doing what it has been doing so far.)

 

Mild optimisation

 

3. Safety by design 

(Figuring out how to avoid using deep learning)

 

ConjectureCognitive Software 

 

See also parts of Guaranteed Safe AI involving world models and program synthesis.

 

4. Make AI solve it

(Figuring out how models might help with figuring it out.)

 

Scalable oversight

(Figuring out how to get AI to help humans supervise models.)

 

OpenAI Superalignment  Automated Alignment Research

 

Weak-to-strong generalization

 

Supervising AIs improving AIs

 

Cyborgism

 

Transluce

 

Task decomp

Recursive reward modelling is supposedly not dead but instead one of the tools OpenAI will build. Another line tries to make something honest out of chain of thoughttree of thought.
 

Adversarial 

Deepmind Scalable Alignment

 

Anthropic: Bowman/Perez

 

Latent adversarial training

 

See also FAR (below). See also obfuscated activations.

 

5. Theory 

(Figuring out what we need to figure out, and then doing that.)
 

The Learning-Theoretic Agenda 

 

Question-answer counterfactual intervals (QACI)

 

Understanding agency 

(Figuring out ‘what even is an agent’ and how it might be linked to causality.)

 

Causal Incentives 

 

Hierarchical agency

 

(Descendents of) shard theory

 

Dovetail research

 

boundaries / membranes

 

Understanding optimisation

 

Corrigibility

(Figuring out how we get superintelligent agents to keep listening to us. Arguably scalable oversight are ~atheoretical approaches to this.)


Behavior alignment theory 

 

Ontology Identification 

(Figuring out how AI agents think about the world and how to get superintelligent agents to tell us what they know. Much of interpretability is incidentally aiming at this. See also latent knowledge.)

 

Natural abstractions 

 

 

ARC TheoryFormalizing heuristic arguments  

 

Understand cooperation

(Figuring out how inter-AI and AI/human game theory should or would work.)

 

Pluralistic alignmentcollective intelligence

 

Center on Long-Term Risk (CLR) 

 

FOCAL 

 

Alternatives to utility theory in alignment

See also: Chris Leong's Wisdom Explosion

6. Miscellaneous 

(those hard to classify, or those making lots of bets rather than following one agenda)

 

AE Studio

 

Anthropic Alignment Capabilities / Alignment Science / Assurance / Trust & Safety / RSP Evaluations 

 

Apart Research

 

Apollo

 

Cavendish Labs

 

Center for AI Safety (CAIS)

 

CHAI

 

Deepmind Alignment Team 

 

Elicit (ex-Ought)

FAR 

 

Krueger Lab Mila

 

MIRI

 

NSF SLES

 

OpenAI Superalignment Safety Systems

OpenAI Alignment Science

OpenAI Safety and Security Committee

OpenAI AGI Readiness Mission Alignment

 

Palisade Research

 

Tegmark GroupIAIFI

 

UK AI Safety Institute

 

US AI Safety Institute

 

Agendas without public outputs this year

Graveyard (known to be inactive)

 

Method

We again omit technical governance, AI policy, and activism. This is even more of a omission than it was last year, so see other reviews.

We started with last year’s list and moved any agendas without public outputs this year. We also listed agendas known to be inactive in the Graveyard.

An agenda is an odd unit; it can be larger than one team and often in a many-to-many relation of researchers and agendas. It also excludes illegible or exploratory research – anything which doesn’t have a manifesto.

All organisations have private info; and in all cases we’re working off public info. So remember we will be systematically off by some measure.

We added our best guess about which of Davidad’s alignment problems the agenda would make an impact on if it succeeded, as well as its research approach and implied optimism in Richard Ngo’s 3x3.

As they are largely outside the scope of this review, subproblem 6 - Pivotal processes likely require incomprehensibly complex plans - does not appear in this review and the following only appear scarcely with large error bars for accuracy:

We added some new agendas, including by scraping relevant papers from arXiv and ML conferences. We scraped every Alignment Forum post and reviewed the top 100 posts by karma and novelty. The inclusion criterion is vibes: whether it seems relevant to us.

We dropped the operational criteria this year because we made our point last year and it’s clutter.

Lastly, we asked some reviewers to comment on the draft.

 

Other reviews and taxonomies

 

Acknowledgments

Thanks to Vanessa Kosoy, Nora Ammann, Erik Jenner, Justin Shovelain, Gabriel Alfour, Raymond Douglas, Walter Laurito, Shoshannah Tekofsky, Jan Hendrik Kirchner, Dmitry Vaintrob, Leon Lang, Tushita Jha, Leonard Bereska, and Mateusz Bagiński for comments. Thanks to Joe O’Brien for sharing their taxonomy. Thanks to our Manifund donors and to OpenPhil for top-up funding.

 

  1. ^

     Vanessa Kosoy notes: ‘IMHO this is a very myopic view. I don't believe plain foundation models will be transformative, and even in the world in which they will be transformative, it will be due to implicitly doing RL "under the hood".’

  2. ^

     Also, actually, Christiano’s original post is about the alignment of prosaic AGI, not the prosaic alignment of AGI.

  3. ^

     This is fine as a standalone description, but in practice lots of interp work is aimed at interventions for alignment or control. This is one reason why there’s no overarching “Alignment” category in our taxonomy.

  4. ^

     Often less strict than formal verification but "directionally approaching it": probabilistic checking.

  5. ^

     Nora Ammann notes: “I typically don’t cash this out into preferences over future states, but what parts of the statespace we define as safe / unsafe. In SgAI, the formal model is a declarative model, not a model that you have to run forward. We also might want to be more conservative than specifying preferences and instead "just" specify unsafe states -- i.e. not ambitious intent alignment.”

  6. ^

    Satron adds that this is lacking in concrete criticism and that more expansion on object-level problems would be useful.

  7. ^

     Scheming can be a problem before this point, obviously. Could just be too expensive to catch AIs who aren't smart enough to fool human experts.

  8. ^

    Indirectly; Kosoy is funded for her work on Guaranteed Safe AI.

  9. ^

     Called “average-case” in Ngo’s post.

  10. ^

    Our addition.