collecting human feedback, fitting a reward model, and optimizing the policy with RL.