Axiom Futures AI Safety Course Week 4 notes Jul 10, 2024 collecting human feedback, fitting a reward model, and optimizing the policy with RL.