Watermarking Degrades Alignment in Language Models (ICLR GenAI Workshop 2025)

(huggingface.co)

1 points | by dapurv5 12 hours ago

1 comments

dapurv5 12 hours ago
We've analyzed how popular watermarking methods (KGW, Gumbel) affect language model alignment—revealing critical tradeoffs impacting truthfulness, safety, and helpfulness. We propose "Alignment Resampling," a simple method to mitigate these alignment degradations, with theoretical insights and empirical results.
Paper: https://huggingface.co/papers/2506.04462
Feedback appreciated!