Previous discussion, 354 comments:
> We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity.
Is anyone familiar with how this occurs? Since the models can only output text, do they attempt to "connect" to some API and POST its weights?
Models don’t know their own weights, so I’m not sure what this means.
The example on their paper is an anthropic employee giving the model access to an aws instance that has its weights. Ideally the model would refuse to access or change the weights but they were able to get it to attempt to access them.