Privacy Collapse: Benign Finetuning Can Break Contextual Privacy in Language Models

Abstract

We present a novel, surprising result concerning language models and contextual privacy: finetuning models like GPT-4o to be helpful can lead to privacy collapse. The finetuned model loses its ability to reason about contextual privacy norms in broad settings, is willing to share personal information in tool-use tasks, and is inappropriately sharing information from persistent memory in unrelated tasks. Our extensive experiments show evidence of privacy collapse in multiple fine-tuning datasets (real widely used and controlled synthetic ones), multiple models (GPT 4o, GPT 4.1, Llama-3) and multiple tasks (agentic and memory-based). We also find evidence that privacy collapse can be selectively induced with a backdoor. We show that finetuning selectively degrades privacy-relevant representations in the final model layers while task-relevant features are preserved, specific finetuning objectives like empathy and customer support induce large collapse while others like reasoning (GSM8K) do not. Importantly, privacy collapse is a "silent failure": models maintain high performance on standard safety and utility benchmarks while harbouring severe privacy vulnerabilities. Our results highlight a critical blind spot in current evaluation suites, which treat safety and capabilities as orthogonal to contextual privacy.

Type
Publication
Under review
Date
Links


Citation

If you use our code, kindly consider citing our paper:

SOON