🪴 Berwin Gan

❯

❯

Machine Learning

❯

Research Reading

❯

Distillation Robustifies Unlearning

Distillation Robustifies Unlearning

Jun 14, 20251 min read

unlearning
distillation

Suppress bad behavior to unlearn in main model
Distill to another model
1. distillation process does not contain bad behavior examples ?
2. only expressed behaviors are transferred, not latent capabilities
New model have unlearned as well as new model from scratch without any bad behavior examples.

Distillation makes it harder to retrain the new model to do the bad thing ?

training sequence ?
how to quantify

Unlearn - Noise - Distilled - on - Outputs

Unlearn (GradDiff, RMU) ← look up
Noise, corrupt the weights of the suppressed model and initialize the student as this damaged model.
- shrink-and-perturb (scale the weights down + add noise), which weights ? pruning ?
Distill

Relearning Attack

Access model
Use small public available dataset on unlearned domain
fine-tunes model on the dataset → ‘jogs’ the model’s memory

Other attacks ?

Finetuning = Suppressing vs Remove

Key Idea

The weights still contain the capability, but the model just learned how not to show that capability.
How does re-derivation works ?
pruning, targeted damage (shrink-and-perturb certain path)
distillation step by internal activation ?

Graph View

Distillation makes it harder to retrain the new model to do the bad thing ?
Unlearn - Noise - Distilled - on - Outputs
Relearning Attack
Finetuning = Suppressing vs Remove
Key Idea

Created with Quartz v4.4.0 © 2025

GitHub
Discord Community