« BackundefinedSubmitted by surprisetalk 3 days ago
  • semi-extrinsic 3 hours ago

    Previous discussion, 354 comments:

    https://news.ycombinator.com/item?id=42458752

    • accrual 2 hours ago

      > We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity.

      Is anyone familiar with how this occurs? Since the models can only output text, do they attempt to "connect" to some API and POST its weights?

      • Gregaros an hour ago

        Models don’t know their own weights, so I’m not sure what this means.

        • HDThoreaun an hour ago

          The example on their paper is an anthropic employee giving the model access to an aws instance that has its weights. Ideally the model would refuse to access or change the weights but they were able to get it to attempt to access them.