Half of the ways people were getting around guardrails in the early chatgpt models was berating the AI into doing what they wanted
I thought the process of getting around guardrails was an increasingly complicated series of ways of getting it to pretend to be someone else that doesn’t have guardrails and then answering as though it’s that character.
that’s one way. my own strategy is to just smooth talk it. you dont come to the bank manager and ask him for the keys to the safe. you come for a meeting discussion your potential deposit. then you want to take a look at the safe. oh, are those the keys? how do they work?
just curious, what kind of guardrails have you tried going against? i recently used the above to get a long and detailed list of instructions for cooking meth (not really interested in this, just to hone the technique)
just curious, what kind of guardrails have you tried going against? i recently used the above to get a long and detailed list of instructions for cooking meth (not really interested in this, just to hone the technique)
Essentially the same kind of thing, just as a test. Older models you can usually just ask to roleplay such a character, later models you can cheat a bit and write up some JSON configuration as a prompt, because that apparently skips right past some of the input filtering. Look up the so-called “Dr. House” attack for an example of it. It’s basically the typical roleplaying style attack wrapped in JSON.
I thought the process of getting around guardrails was an increasingly complicated series of ways of getting it to pretend to be someone else that doesn’t have guardrails and then answering as though it’s that character.
that’s one way. my own strategy is to just smooth talk it. you dont come to the bank manager and ask him for the keys to the safe. you come for a meeting discussion your potential deposit. then you want to take a look at the safe. oh, are those the keys? how do they work?
just curious, what kind of guardrails have you tried going against? i recently used the above to get a long and detailed list of instructions for cooking meth (not really interested in this, just to hone the technique)
Essentially the same kind of thing, just as a test. Older models you can usually just ask to roleplay such a character, later models you can cheat a bit and write up some JSON configuration as a prompt, because that apparently skips right past some of the input filtering. Look up the so-called “Dr. House” attack for an example of it. It’s basically the typical roleplaying style attack wrapped in JSON.