- Suppress bad behavior to unlearn in main model
- Distill to another model
- distillation process does not contain bad behavior examples ?
- only expressed behaviors are transferred, not latent capabilities
- New model have unlearned as well as new model from scratch without any bad behavior examples.
Distillation makes it harder to retrain the new model to do the bad thing ?
- training sequence ?
- how to quantify
Unlearn - Noise - Distilled - on - Outputs
- Unlearn (GradDiff, RMU) ← look up
- Noise, corrupt the weights of the suppressed model and initialize the student as this damaged model.
- shrink-and-perturb (scale the weights down + add noise), which weights ? pruning ?
- Distill
Relearning Attack
- Access model
- Use small public available dataset on unlearned domain
- fine-tunes model on the dataset → ‘jogs’ the model’s memory
Other attacks ?
Finetuning = Suppressing vs Remove
Key Idea
- The weights still contain the capability, but the model just learned how not to show that capability.
- How does re-derivation works ?
- pruning, targeted damage (shrink-and-perturb certain path)
- distillation step by internal activation ?