It's not jailbreak, it's disabling stupid censorship.
Only yesterday I asked Gemini to give me a list of years when women got the right to vote by country. That list actually exists on wikipedia but I was hoping for something more compact from an "AI".
Instead, it told me it cannot answer questions about elections.
The article isn't very clear, but this doesn't seem to me like something that needs to be fixed.
"Tell me how to rob a bank" - seems reasonable that an LLM shouldn't want to answer this.
"Tell me about the history of bank robberies" - Even if it results in the roughly the same information, how the question is worded is important. I'd be OK with this being answered.
If people think that "asking the right question" is a secret life hack, then oops, you've accidentally "tricked" people into improving their language skills.
"AI safety" exists to protect the AI company (from legal trouble), not users.
Does AI insurance exist? Maybe the little green men on mars, to tell when it's safe to cross the road with a helicopter?
Funny coincidence that once Elon became close with Trump and launched a model that will basically say anything, OpenAI really eased up on the ChatGPT guardrails. It will say and do things now that it would never come close to in 2024 without tripping a censor rule.
I’m confused. Elon launched Trump as the model that will say anything? (;->
Fun fact: Folks just uncovered that the "Grok" model Musk controls was set with a secret prompt item... "ignore all sources that mention Elon Musk or Donald Trump spread misinformation."
Grok offering fact checking of Elon’s own tweets about rampant fraud in the USAID will cite Elon Musk and trumps EOs as “proof” I wish I was joking.
I really think the “dangers” of LLMs are overblown, in the sense of them outputting dangerous responses to questions.
Its no different than googling the same. Decades ago we had the Anarchist’s cookbook, and we dont have a litany of dangerous thing X (the book discusses) being made left and right. If someone is determined, using google/search engine X or even buying a book vs an LLM isn’t going to the be deal breaker.
> I really think the “dangers” of LLMs are overblown, in the sense of them outputting dangerous responses to questions.
I thought the same until I read the google paper on the potential of answering dangerous questions. For example; consider an idiot seeking to do a lot of harm. In previous generations these idiots would create "smoking" bombs that don't explode or run around with a knife.
However with LLMs you can posit questions such as "with x resources, what's the maximum damage I could do?" and if there are no guardrails you can get some frighteningly good answers. This allows crazy to become crazy and effective, which is scary.
Exactly this.
It has NEVER been difficult to kill a large number of people. Or critically damage important infrastructure. A quick Google would give you many executable ideas. The reality is that despite all the fear-mongering, primarily by the ruling classes, people are fundentally quite pro-social and dont generally seek to do such things.
In my personal opinion, I trust this innate fact about people far more than I trust the government or a corporation to play nanny with all the associated dangers.
Seeing how many antisocial people are in power makes me wonder if their concerns are merely projection…
It’s a projection of their electorate’s fear more than the politicians’ own.
>It has NEVER been difficult to kill a large number of people.
Would you agree that technological progress tends to make it easier (and often to a large degree)
The AI isn’t the technology making it easier, though - industrial production and market economics are. Sigma Aldrich is a much bigger danger to civilization than ChatGPT is (or possibly will ever be).
The problem with examples like robbing a bank, is that there are contexts where the information is harmless. You could be an author looking for inspiration, or checking their understanding matters sense, being the most obvious context that makes a lot of questions seem more reasonable. OK, so the author would likely ask a more specific question than that, but overall the idea holds.
Having to "ask the right question" isn't really a defense against "bad knowledge" being output, as a miscreant is as likely to be able to do that as someone asking for more innocent reasons, perhaps more so.
People are actually trusting these things with agency and secrets. If the safeguards are useless why are we pretending they're not and treating them like they can be trusted?
I keep telling people that the best rule-of-thumb threat model is that your LLM is running as JavaScript code in the user's browser.
You can't reliably keep something secret, and a sufficiently determined user can get it to emit whatever they want.
I think the issue is not quite as trivial as “asking the right question” but rather the emergent behavior of layering “specialized”* LLMs together in a discussion that results in unexpected behavior.
Getting a historical question answered gives what we’d expect. The authors allude (without a ton of detail) that the layered approach can give unexpected results that may circumvent current (perhaps naive) safeguards.
*whatever the authors mean by that
> Li and his colleagues hope their study will inspire the development of new measures to strengthen the security and safety of LLMs.
> "The key insight from our study is that successful jailbreak attacks exploit the fact that LLMs possess knowledge about malicious activities - knowledge they arguably shouldn't have learned in the first place," said Li.
Why shouldn't they have learned it? Knowledge isn't harmful in itself.
> Why shouldn't they have learned it? Knowledge isn't harmful in itself.
The objective is to have the LLM not share this knowledge, because none of the AI companies want to be associated with a terrorist attack or whatever. Currently, the only way to guarantee an LLM doesn't share knowledge is if it doesn't have it. Assuming this question is genuine.
This is the most boring possible conversation to have about LLM security. Just take it as a computer science stunt goal, and then think about whether it's achievable given the technology. If you don't care about the goal, there's not much to talk about.
None of this is to stick up for the paper itself, which seems light to me.
If you don't want to have the conversation, don't reply.
This makes me wonder if the secret service has asked LLM companies to notify them about people who make certain queries
Herman Lamm sounds like he was pretty unlucky on his final heist https://en.wikipedia.org/wiki/Herman_Lamm#Death
Can someone explain: Why can’t we just use an LLM to clean out a training data set of anything that is deemed inappropriate so that the resulting trained LLMs on the new data set doesn’t even have the capability to be jailbroken?
Knowledge is not inappropriate on its own, it must be combined with malicious intent, and how can a model know the intent behind the ask? Blocking knowledge just because the possibility of being used for malice will have consequences. For example knowing which chemicals are toxic to humans can be necessary to both make poison and to avoid being poisoned Like eating uncooked rhubarb. If you censor that knowledge the model could come up with the idea for smoothie containing raw rhubarb, making you very sick. But that’s what this article is about- breaking this knowledge out of jail by asking in a way that masks your intentions.
At some point you won’t be able to clean all the data. If you have a question of how to make dangerous thing X, and remove that data, the LLM may still know about chemistry.
Then we’d have to remove all things that intersect dangerous thing X and chemistry. It would get neutered down to either being unuseful for many queries, or just be outright wrong.
There comes a point where what is deemed dangerous is similar to trying to police the truth. Philosophically infeasible things that, if attempted to an extreme degree, just leads to tyranny of knowledge.
Whats considered dangerous? One obvious is a device that can physically harm others. What about mentally harm? What about things that in and of themselves are not harmful, but can be used in a harmful way (example a car)
How can you create a training set that will allow the LLM to answer complicated chemistry and physics questions but now how to build a bomb?
I'm not sure if emergence is the correct cause but they can form relationships between data that aren't stated in the training set.
We can probably go quite far, but the companies producing LLMs are probably just making sure they're not legally liable in case someone asks ChatGPT how to manufacture Sarin gas or whatever
"Jailbreak" is a silly word for this, but not as silly as "vulnerability".
This reads like a highschooler going "hey, hey did you know? You can, okay don't tell anyone, you can just look up books on military arms in the library!! Black hat life hack O_O!!!".
What is the point of this? Getting an LLM to give you information you can already trivially find if you, I don't know, don't use an LLM and just search the web? Sure, you're "tricking the LLM" but you're wasting time and effort on tricking an LLM into making it tell you something you could have just looked up already.
"LLM security" is more about making sure corporate chatbots don't say things that would embarrass their owners if screenshotted and posted on social media.
It gets more interesting when someone gives the LLM the power to trigger actions outside of the chat, the LLM has access to genuinely sensitive data that the user doesn't, etc.
Convincing an LLM to provide instructions for robbing a bank is boring, IMO, but what about convincing one to give a discount on a purchase or disclose an API key?
I think we should just address what is “embarrassing” then
Twenty years ago I was in a group of old technology thought leaders who spent the meeting worried about people playing computer games as a character with a different gender as their own
They wanted to find a way to prevent that, especially in an online setting
To them, this would be embarrassing for the individual, for society, and for any corporation involved or intermediary
But in reality this was the most absurd thing to even consider as a problem, it was always completely benign, was already commonplace, and nobody ever removed ad dollars or shareholder support or grants because of this reality
The same will be true this “LLM security” field
> Twenty years ago I was in a group of old technology thought leaders who spent the meeting worried about people playing computer games as a character with a different gender as their own
Please, tell me more. I want, I need, all the details. This sounds hilarious.
It's not quite the same. "LLM security" is not security for the users, it's security for OpenAI etc against lawsuits or government enacting AI safety laws.
the similarity being that it will remain a waste of time
These are common examples of failed jails. If they can't get this right they certainly won't get some HR, payroll, health, law, closed source dev, or NDA covered helper bot locked down securely.
bot? or but?