Comments Page - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

« Back SmoothLLM: Defending Large Language Models Against Jailbreaking Attacksarxiv.orgSubmitted by amai 15 hours ago

freeone3000 10 hours ago
I find it very interesting that “aligning with human desires” somehow includes prevention of a human trying to bypass the safeguards to generate “objectionable” content (whatever that is). I think the “safeguards” are a bigger problem with aligning with my desires.
- wruza 8 hours ago
  Another question is whether that initial unalignment comes from poor filtering of datasets, or is it emergent from regular, pre-filtered cultured texts.
  In other words, was an “unaligned” LLM taught bad things from bad people, or does it simply see it naturally and point it out with the purity of a child? The latter would mean something about ourselves. Personally I think that people tend to selectively ignore things too much.
  GuB-42 2 hours ago
  We can't avoid teaching bad things to a LLM if we want it to have useful knowledge. For example, you may teach a LLM about nazis, that's expected knowledge. But then, you can prompt a LLM to be a nazi. You can teach it about how to avoid poisoning yourself, but then, you taught it how to poison people. And the smarter the model is, the better it will be at extracting bad things from good things by negation.
  There are actually training dataset full of bad thing by bad people, the intention is to use them negatively, as to teach the LLM some morality.
  ujikoluk an hour ago
  Maybe we should just avoid trying to classify things as good or bad.
- threeseed 8 hours ago
  The safeguards stems from a desire to make tools like Claude accessible to a very wide audience as use cases such as education are very important.
  And so it seems like people such as yourself who do have an issue with safeguards should seek out LLMs that are catered to adult audiences rather than trying to remove safeguards entirely.
  selfhoster11 2 hours ago
  Here is a revolutionary concept: give the users a toggle.
  Make it controllable by an IT department if logging in with an organisation-tied account, but give people a choice.
  Zambyte 6 hours ago
  How does making it harder for the user to extract information they are trying to extract make it safer for a wider audience?
  Drakim 4 hours ago
  That's like asking why we should have porn filters on school computers, after all, all it does is prevent the user from finding what they are looking for, which is bad.
  dbspin 4 hours ago
  Assuming that this question is good faith...
  There are numerous things that might be true, that may be damaging to a child's development to be exposed to. From overly punitive criticism to graphic depictions of violence, to advocacy and specific directions for self harm. Countless examples are trivial to generate.
  Similarly, the use of these tools is already having dramatic effects on spearfishing, misinformation etc. Guardrails on all the non open-source models have enormous impact on slowing / limiting the damage this has at scale. Even with retrained Llama based models, it's more difficult than you might imagine to create a truly machiavellian or uncensored LLM - which is entirely due to the work that's been doing during and post training to constrain those behaviours. This is an unalloyed good in constraining the weaponisation of LLMs.
- Zambyte 6 hours ago
  What tools do we have to defend against LLM lockdown attacks?
- ipython 8 hours ago
  We’ve seen where that ends up. https://en.m.wikipedia.org/wiki/Tay_(chatbot)
ipython 12 hours ago
It concerns me that these defensive techniques themselves often require even more llm inference calls.
Just skimmed the GitHub repo for this one and the read me mentions four additional llm inferences for each incoming request - so now we’ve 5x’ed the (already expensive) compute required to answer a query?
padolsey 11 hours ago
So basically this just adds random characters to input prompts to break jailbreaking attempts? IMHO If you can't make a single-inference solution, you may as well just run a couple of output filters, no? That appeared to have reasonable results, and if you make such filtering more domain-specific, you'll probably make it even better. Intuition says there's no "general solution" to jailbreaking, so maybe it's a lost cause and we need to build up layers of obscurity, of which smooth-llm is just one part.
- ipython 10 hours ago
  Right. This seems to be the latest in the “throw random stuff at the wall and see what sticks” series of generative ai papers.
  I don’t know if I’m too stupid to understand or if truly this is just “add random stuff to prompt” dressed up in flowery academic language.
  pxmpxm 11 minutes ago
  Not surprising - from what I can tell, machine learning has been going down this route for a decade.
  Anything involving the higher level abstractions (tensor flow / keras /whatever) is full of handwavy stuff about this or that activation function / number of layers / model architecture working the best and doing a trial error with a different component in the above if it doesn't. Closer to kids playing with legos than statistics.
mapmeld 11 hours ago
There are some authors in common with a more recent paper "Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing" https://arxiv.org/abs/2402.16192
handfuloflight 13 hours ago
Github: https://github.com/arobey1/smooth-llm