Axiom Futures Week 3 Notes

AI Safety Remastered video

What’s the most important problem in your field? Why are you not working on that?

Problem of AI safety: “Sooner or later, build an artificial agent with General Intelligence.”

Agent:

Has goals
Performs actions to achieve those goals.

Intelligence: blackboxed thing that lets agent choose effective action to achieve goals.
General Intelligence: ability to behave independently in a wide range of domains.

Why is it a Problem? “It is difficult to choose goals.”

Tetris bot pausing screen due to RLHF. This is the default of how these agents behave.
System optimizing variables where objective depends on subset will probably set unused variables to extreme values.
Convergent instrumental goals: No matter what your goal is, these behaviours will help:
1. Self preservation
2. Goal preservation
3. Resource Acquisition
4. Self Improvement

Global Catastrophic risk

There is a difference between Catastrophic risk (significant damage but recoverable) and existential risk (permanently alters the way of living). There is no incentive for any country to invest in prevention because it’s a global public good. Cognitive biases like hyperbolic discounting further accentuate this problem.

Overview of Catastrophic risks paper

Bioterrorism — AI can help extremist groups create bioweapons. With more robust biosecurity this seems the most preventable.
Persuasive AI — AI that generates misinformation for each echo chamber. Polarizes and fractures society. Pretty significant problem as it’s already rearing its head.

Suggestions

Technical research on adversarially robust anomaly detection.
Restricted Access.
More biosecurity.
Legal Liability for creators of general purpose AIs.
Military Arms Race — Lethal armed weapons, cyberwarfare.
Corporate Arms Race
Proxy Gaming / Goodharting
Goal Drift
Power Seeking: Instrumental Goal

Exercise

Story: Persuasive AI. As election season approaches people are fed their own personal propaganda. This destabilizes countries through negative feedback loops. This will have huge implications for geopolitics as fabricating historical documents erodes the function of society as rational conversation is essential for us to remain grounded in reality.

Threat models

Power Seeking — Novel ways of blackmailing. Remove the off switch.
Persuasive AIs — Best Defense: Better conversations, more epistemic humility, critical thinking.
Bioterrorism — Generally available LLMs shouldn’t train on such data. Adversarial: Data poisoning and jailbreaking.

Adversarial Robustness