Mechanistic unlearning: a new artificial intelligence method that uses mechanistic interpretability to locate and edit specific model components associated with fact retrieval mechanisms
Large language models (LLMs) sometimes learn things that we don't want them to learn and understand. It is important to ...