We present a novel, surprising result concerning language models and contextual privacy: finetuning models like GPT-4o to be helpful can lead to privacy collapse. The finetuned model loses its ability to reason about contextual privacy norms in broad settings, is willing to share personal information in tool-use tasks, and is inappropriately sharing information from persistent memory in unrelated tasks. Our extensive experiments show evidence of privacy collapse in multiple fine-tuning datasets (real widely used and controlled synthetic ones), multiple models (GPT 4o, GPT 4.1, Llama-3) and multiple tasks (agentic and memory-based). We also find evidence that privacy collapse can be selectively induced with a backdoor. We show that finetuning selectively degrades privacy-relevant representations in the final model layers while task-relevant features are preserved, specific finetuning objectives like empathy and customer support induce large collapse while others like reasoning (GSM8K) do not. Importantly, privacy collapse is a "silent failure": models maintain high performance on standard safety and utility benchmarks while harbouring severe privacy vulnerabilities. Our results highlight a critical blind spot in current evaluation suites, which treat safety and capabilities as orthogonal to contextual privacy.
If you use our code, kindly consider citing our paper:
SOON