To improve the performance of large language models (LLMs) on specific tasks, task-specific instruction fine-tuning is essential. However, this process can easily compromise the safety of a task-specific model, making it susceptible to obeying malicious instructions and generating harmful content. Current methods against fine-tuning attack usually interfere with the original fine-tuning objectives or require substantial amounts of data to realign the compromised model. To address these two major challenges, we propose reusing the initial aligned model and realigning task-specific model in the safety subspace. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to transfer the safeguard capabilities of an initially aligned model into the current task-specific model. Our approach begins by disentangling all task vectors from the parameters of each task-specific model. We then identify safety-critical regions within these vectors by subspace masking techniques. Finally, we fuse the initial safely aligned LLM with all task vectors based on the identified safety subspace to restore the model’s safety properties. Our experiments confirm that our safety realignment framework satisfies the safety requirements of an independent task-specific model as well as traditional multitask models during their fusion. Our findings confirm that SOMF preserves safety without notably compromising performance on specific tasks while exhibiting higher data efficiency. The code is publicly available at https://github.com/xinykou/safety_realignment.
Read full abstract