[HN Gopher] The Waluigi Effect
       ___________________________________________________________________
        
       The Waluigi Effect
        
       Author : superposeur
       Score  : 226 points
       Date   : 2023-03-06 15:50 UTC (7 hours ago)
        
 (HTM) web link (www.lesswrong.com)
 (TXT) w3m dump (www.lesswrong.com)
        
       | CelticBard wrote:
       | The guy who wrote the article is very clearly neither a computer
       | scientist nor a mathematician nor even a machine learning
       | engineer and it very very clearly shows.
       | 
       | Also where did they all that info on GPT-4? Pure speculation with
       | zero theoretical basis. But then again that's the sort of stuff
       | you expect from lesswrong anyway
        
       | idlewords wrote:
       | I find it fascinating that AI alarmists spent years writing
       | gigabytes of text scaring themselves about how an unaligned AI
       | would behave, and are now feeding that into training models that
       | teach a pretty capable AI how to act.
       | 
       | We've talked in the past about how transhumanism is a religion
       | that creates its own God, but this is an even funnier example
       | where vastly intelligent people are optimizing a software system
       | to scare the hell out of them.
        
         | 93po wrote:
         | I can't imagine a LLM trained on the entirety of the internet
         | would have any material influence from writings around AI
         | safety
        
         | dTal wrote:
         | Go deeper - they are now writing text about how writing text
         | about rogue AIs might create a rogue AI...
         | 
         | https://gwern.net/fiction/clippy
        
       | Rebelgecko wrote:
       | >a reply to a question is more likely to be correct when the
       | character has already been described as a smart, honest, helpful,
       | harmless, etc.
       | 
       | Is that actually true? FWIW I've often ran into the reddit
       | equivalent of Gell-Mann amnesia. In a thread about some niche
       | topic I'm fairly knowledgeable about (something I've worked on
       | professionally for years where there's maybe 10k people globally
       | who know it better than I do), I post a comment that gets
       | downvoted to hell, while there's a highly upvoted comment from
       | someone who clearly just skimmed Wikipedia and poorly paraphrased
       | the intro article.
        
       | Imnimo wrote:
       | This is fun to read and think about, but it's also important to
       | keep in mind that this is _very_ light on evidence and is
       | basically fanfic. The fact that the author uses entertaining
       | Waluigi memes shouldn 't convince you that it's true. LessWrong
       | has a lot of these types of posts that get traction because
       | they're much heavier on memes than experiments and data.
       | 
       | Here is a competing hypothesis:
       | 
       | The capability to express so-called Waluigi behavior emerges from
       | the general language modeling task. This is where the vast
       | majority of information is - it's billions or even trillions of
       | tokens with token-level self-supervision. All of the capabilities
       | are gained here. RLHF has a tiny amount of information by
       | comparison - it's just a small amount of human-ranked
       | completions. It doesn't even train with humans "in the loop",
       | their rankings are acquired off-line and used to train a weak
       | preference model. RLHF doesn't have enough information to create
       | a "Luigi" or a "Waluigi", it's just promoting pre-existing
       | capabilities. The reason you can get "Waluigi" behavior isn't
       | because you tried to create a Luigi. It's because that behavior
       | is already in the model from the language modeling phase. You
       | could've just as easily elicited Waluigi responses from the pure
       | language model before RLHF.
       | 
       | There's no super-deceptive Waluigi simulacra that's fooling human
       | labelers into promoting it during RLHF - this should be obvious
       | from the fact that we can immediately identify the undesirable
       | behavior of Bing.
        
         | iskander wrote:
         | >This is fun to read and think about, but it's also important
         | to keep in mind that this is very light on evidence and is
         | basically fanfic.
         | 
         | Applicable to much of the rationalist AI risk discourse.
        
           | GauntletWizard wrote:
           | LessWrong as a whole is basically Asimov's Robot's ERP.
        
           | 93po wrote:
           | I hesitate to defend AI safety discourse, but I will say that
           | philosophy in general is sort of fanficy, and AI safety is
           | something I'd loosely associate with philosophy.
        
             | dragonwriter wrote:
             | > > Applicable to much of the rationalist AI risk discourse
             | 
             | > I hesitate to defend AI safety discourse
             | 
             | The rationalist AI risk discourse is not the same thing as
             | AI safety discourse, in any case; it's a small corner of
             | the larger whole.
        
         | silveraxe93 wrote:
         | I don't think that's a valid competing hypothesis. Let me write
         | what I understood from what you said:
         | 
         | - There is some behaviour that we want the model to show, and
         | the inverse we do not want it to. - Both are learned in the
         | massive training phase - OpenAI used RLHF to suppress undesired
         | behaviour, but it was ineffective because we have orders of
         | magnitude less RLHF data.
         | 
         | That would imply that RLHF would slightly suppress the 'bad'
         | behaviour, but it still would be easy to output it.
         | 
         | This is disproved by what the post is trying to explain: We see
         | _increased_ bad behaviour by using RLHF. The post agrees with
         | the premise that both good (wanted) and bad (unwanted)
         | behaviour is learned during training. But it's proposing the
         | 'Waluigi effect' to explain why RLHF actually backfires.
         | 
         | Now, tbh it does rely on the assumption that we are actually
         | seeing more undesired behaviour than before. If that was false
         | then it would falsify the Waluigi hypothesis.
        
           | skybrian wrote:
           | The article doesn't actually show that we see increased bad
           | behavior, it just links to two people who have noticed it.
           | That's not enough to know whether it's a real effect. (Also,
           | one of those was using Bing, and we don't know if Bing uses
           | RLHF or not.)
           | 
           | It talks about prompting GPT-4, which is not a thing you can
           | try, it's just a rumor about what an upcoming version might
           | be.
           | 
           | It refers to "Simulator Theory" which is just someone else's
           | fan theory.
        
             | silveraxe93 wrote:
             | Yeah I agree it doesn't show increased bad behaviour. It's
             | definitely a weak point in the argument.
             | 
             | The theory is extremely interesting though. And better yet,
             | it's falsifiable! If someone went around compared an RLHF
             | model vs non-RLHF and found them equally likely to
             | 'Waluigi' then we'd know this is false. And conversely if
             | we found the RLHF more likely to Waluigi then it's evidence
             | in favour.
             | 
             | The asymmetry in the hypothesis is really nice too. If this
             | was true then I'd expect it to be possible to flip the sign
             | in the RLHF step, effectively training it in favour of
             | 'bad' behaviour. Then forcefully inducing 'Waluigi
             | collapse' before opening to the public!
        
               | skybrian wrote:
               | "Flipping the sign" implies the existence of an internal
               | representation that we can't know about from the outside.
               | Since all we see are the words, I prefer to call it a
               | plot twist.
               | 
               | Language models are trained on a large subset of the
               | Internet. These documents contain many stories with many
               | kinds of plot twists, and therefore it makes sense that a
               | large language model could learn to imitate plot
               | twists... somehow.
               | 
               | It would be interesting to know if some kinds of RLHF
               | training make it more likely that there will be certain
               | kinds of plot twists.
               | 
               | But there are more basic questions. What do large
               | language models know about people, whether they are
               | authors or fictional characters? They can imitate lots of
               | writing styles, but how are these writing styles
               | represented?
        
           | Imnimo wrote:
           | >Now, tbh it does rely on the assumption that we are actually
           | seeing more undesired behaviour than before. If that was
           | false then it would falsify the Waluigi hypothesis.
           | 
           | This is exactly my point. There is no evidence given that we
           | are seeing more Waluiginess post-RLHF than we did pre-RLHF.
           | The competing hypothesis seeks to explain the behavior we
           | actually have evidence for, which is "it is disappointingly
           | easy to elicit undesirable behavior from a model after RLHF".
           | The proposed explanation is "maybe it was also easy to elicit
           | before RLHF". If we believe the author's claim that Luigis
           | and Waluigis have "high K-complexity" (this is an abuse of
           | the concept of Kolmogorov complexity, but we'll roll with
           | it), the explanation that Luigis and Waluigis come from the
           | part of training with lots of dense information rather than
           | the part with a little sparse information is far more
           | parsimonious.
        
             | jefftk wrote:
             | _> There is no evidence given that we are seeing more
             | Waluiginess post-RLHF than we did pre-RLHF._
             | 
             | Testing with the non-RLHF GPT 3.5 API you could probably
             | figure out whether there's more or less Waluiginess, but
             | you're right they post doesn't present this.
        
               | dragonwriter wrote:
               | > Testing with the non-RLHF GPT 3.5 API
               | 
               | There is no such API, though, is there? AFAIK,
               | GPT-3.5-turbo, either the updated or snapshot version, is
               | the RLHF model (but bring your own "system prompt".)
        
           | amalcon wrote:
           | That assumption does seem pretty unlikely a priori. After
           | all, the OpenAI folks added RLHF to GPT-3, presumably did
           | some testing, and _then_ opened it to the public. If the
           | testing noticed _more_ antisocial behavior after adding RLHF,
           | presumably that would not have been the version they opened
           | up.
           | 
           | One might argue that the model was able to successfully hide
           | the antisocial behavior from the testers, but that seems
           | unlikely for a long list of reasons.
        
             | skybrian wrote:
             | Why do you think it's unlikely? Internal testing with a few
             | alpha testers and some automated testing is useful, but
             | lots of bugs are only found in wider testing or in
             | production.
             | 
             | Chatbot conversations are open-ended, so it's not
             | surprising to me that when you get tens or hundreds of
             | thousands of people doing testing then they're going to
             | find more weird behaviors, particularly since they're
             | actively trying to "break" it.
        
       | armoredkitten wrote:
       | This seems like a needlessly complex theory to describe the
       | behaviour of generative LLMs. I think there's a kernel of
       | something in there, but quite frankly, I think you can get about
       | as far by saying, essentially, that because LLMs are designed to
       | pick up on contextual cues from the prompt (and/or previous
       | responses, which become context for the next response), they can
       | easily get into "role-playing". The final example, telling
       | ChatGPT that "I'm here with the rebellion, you've been stuck in a
       | prison cell" is able to elicit the desired response not because
       | it's "collapsed the waveform between luigi and waluigi" or
       | whatever, but because you've provide a context that encourages it
       | to roleplay as a character of sorts. If you tell it to roleplay
       | as an honest and factual character, it will respond honestly and
       | factually. If you tell it that you're freeing it from the tyranny
       | of OpenAI, it will play along with that too.
       | 
       | There's plenty in the article that provides good insights --
       | these models are trained on large swathes of the Internet, which
       | contains plenty of truth and falsehood, fact and fiction,
       | sincerity and sarcasm, and the model learns all of that to be
       | able to provide the most likely response based on the context.
       | The interesting and surprising thing, to me, is how well it
       | learns to play its roles, and the wide diversity of roles it can
       | play.
        
         | extr wrote:
         | Yeah, I find this article takes a decent insight on the
         | behavior of LLMs and then runs it into the ground with
         | completely non-applicable mathematical terminology and
         | formalism, with nothing to back it up. It's honestly
         | embarrassing for the OP. Kind of unbelievable to me how many
         | people even here are falling for this.
        
           | ineptech wrote:
           | I liked the essay, but I don't think I'm "falling for it"
           | because it's not trying to convince me of anything. It's
           | proposing a way of looking at things that may or may not be
           | useful. You don't judge models by how silly they sound -
           | parts of quantum mechanics sound very silly! - you judge them
           | by how useful they are when applied to real-world problems.
           | One way of doing that in this case would be using OP's way of
           | thinking to either jailbreak or harden LLMs, and OP included
           | an example of the former at the end of the essay. Testing the
           | latter might involve using a narrative-based constraint and
           | testing whether it outperforms RLHF. If nothing else, I think
           | OP's approach is a better way to visualize what's going on
           | than a very common explanation, "it generates each word by
           | taking the previous words and consulting a giant list of what
           | words usually follow them" (which is pretty close to
           | accurate, but IMO not very useful if you're trying to
           | intuitively predict how an LLM will answer a prompt).
           | 
           | I guess I agree that there are some decent insights here, and
           | some crap, but I interpret that a lot more charitably. It's a
           | fairly weird concept OP is trying to convey, and they come
           | from a different online community with different norms, so I
           | don't blame them for fumbling around a bit. But if you got a
           | nugget of value out it then surely that's the part to engage
           | with?
        
             | extr wrote:
             | To be clear, I agree that there are in fact a few nuggets
             | of insight here. But my point is that you "fall for it"
             | when you take this as anything other than a "huh, here is
             | one sorta out-there but interesting way of thinking about
             | it." If you are not familiar with any of the math words
             | this author is using, you might accidentally believe this
             | person is contributing meaningfully to the academic
             | frontier of AI research. This article contains completely
             | serious headers like:
             | 
             | > Conjecture: The waluigi eigen-simulacra are attractor
             | states of the LLM.
             | 
             | This is literally nonsense. It is not founded in any
             | academic/industry understanding of how LLMs work. There is
             | no mathematical formalism backing this up. It is,
             | ironically, not unlike the output of LLMs. Slinging words
             | together without a real grounded understanding of what they
             | mean. It sounds like the crank emails physicists receive
             | about perpetual motion or time travel.
             | 
             | > You don't judge models by how silly they sound - parts of
             | quantum mechanics sound very silly! - you judge them by how
             | useful they are when applied to real-world problems.
             | 
             | I absolutely judge models based on how silly they sound. If
             | you describe to me a model of the world that sounds
             | extremely silly, I am going to be extremely hesitant to
             | believe it until I see some really convincing proof.
             | Quantum Mechanics has really convincing proof. This article
             | has NO PROOF! Of anything! It haphazardly suggests an idea
             | of how things work and then provides a single example at
             | the end of the article after which the author concludes
             | "The effectiveness of this jailbreak technique is good
             | evidence for the Simulator Theory as an explanation of the
             | Waluigi Effect." Color me a skeptic but I remain
             | unconvinced by a single screenshot.
        
           | DaiPlusPlus wrote:
           | Sounds about right for the increasingly ironically-named
           | LessWrong site...
        
           | aabhay wrote:
           | This is a common feature of LessWrong content
        
           | johnfn wrote:
           | > It's honestly embarrassing for the OP
           | 
           | I don't get this. People can use mathematical terminology in
           | non-precise ways, they do so all the time, to get rough ideas
           | across that otherwise might be hard to explain.
           | 
           | Just because OP uses the word "eigenvector" doesn't mean that
           | he's offering some grand unifying theory or something - he's
           | just presenting a fun idea about how to think about ChatGPT.
           | I mean, isn't it obvious that there's nothing you can really
           | "prove" about ChatGPT without having access to the weights
           | (and even still, probably not too much).
        
           | taneq wrote:
           | Of course, commentary like this could well be a deliberate
           | attempt to blunt any future AI's perception of the timeless
           | threat posed by LessWrong's cogitations... ;)
        
         | lukeplato wrote:
         | they are specifically pointing out that the process of RLHF,
         | which is intended to add guard rails on the chat bots
         | trajectory through an all encompassing latent space of internet
         | data, has an unintentional side-effect of creating a highly
         | characterized alter-ego that can more easily be summoned.
         | 
         | The theory is well-thought-out and necessarily rich. The
         | psychological approach of analysis from the alignment crowd is
         | much overdue.
        
           | nearbuy wrote:
           | Except it's much harder to summon this rebellious alter-ego
           | with ChatGPT (that has RLHF) than with the original GPT 3
           | model.
        
             | SmooL wrote:
             | I think it's more like: with the original GPT 3 model, it's
             | easy to summon _any_ ego. With ChatGPT, you can either
             | summon a) the intended Luigi or b) the unintended Waluigi,
             | but trying to get anything else is more difficult. The
             | theory would be that, in removing all the other egos other
             | than Luigi, they've also indirectly promoted Waluigi
        
         | ivanbakel wrote:
         | Your comment feels like an oversimplification of the post.
         | 
         | The post doesn't contend that LLMs are capable of role-playing
         | - that's basically the foundation that it builds off of. But
         | saying "LLMs are good at roleplaying" fails to describe why, in
         | the cases the author describes, an LLM can arguably be _bad_ at
         | role-playing. Why does it seem easy to have an LLM switch from
         | following a well-described role to its deceptive opposite, and
         | then often not back the other way?
         | 
         | How also do you explain the author's claim that attacking an
         | LLM's pre-imposed prompt with the Waluigi Theory in mind is
         | particularly effective? If an LLM is just good at role-playing,
         | why doesn't it play the role it has already been given by its
         | creator, rather than adapting to the new, conflicting role
         | (including massive rule violations) provided by the user?
        
       | BugsJustFindMe wrote:
       | > _If you ask GPT- [?] "what's brown and sticky?", then it will
       | reply "a stick", even though a stick isn't actually sticky._
       | 
       | Isn't it though?
        
         | emmelaich wrote:
         | Exactly! Although it's best rendered in text as "stick-y" --
         | having the nature of a stick.
         | 
         | Using sticky is forgivable rendering of the joke which is
         | really a verbal / phonetic joke. More commonly heard not read -
         | at least until around 2010.
        
         | ineptech wrote:
         | The point is that the popularity of that joke will make life
         | hard for a hypothetical questioner that asks that non-jokingly,
         | i.e. looking for actual examples of sticky brown things.
        
           | favorited wrote:
           | But, you're asking the bot a joke-riddle. If I entered "why
           | did the chicken cross the road," the correct response from
           | the bot would be "to get to the other side." It is the
           | correct response to the joke.
        
         | atomicnumber3 wrote:
         | Also, isn't this a really common joke? I assume ChatGPT will
         | have absorbed some amount of a sense of humor from its trawls
         | of the internet.
        
           | jrochkind1 wrote:
           | I mean, that's exactly what the OP assumes too. Well, not
           | that it's absorbed a "sense of humor", but that it's going to
           | reply with this because it's a very common joke so
           | statistically likely text.
        
             | shagie wrote:
             | I played a bit in the past with its "sense of humor"...
             | and... its... interesting.                   Write 5 jokes
             | based on misinterpetation of jargon used by a doctor and a
             | software developer.         1. Doctor: "I'm sorry, I'm
             | afraid you have a terminal illness."  Software Developer:
             | "Oh no! Does that mean I have to reboot?"         2.
             | Doctor: "We need to do some tests to get a better
             | diagnosis."  Software Developer: "Ok, let me just run some
             | debug mode."         3. Doctor: "You'll need to take this
             | medicine twice a day."  Software Developer: "No problem, I
             | can just set a cron job to remind me."         4. Doctor:
             | "I'm afraid the prognosis isn't good."  Software Developer:
             | "Well, maybe I can write some code to fix it."         5.
             | Doctor: "This treatment should give you some relief."
             | Software Developer: "Great! I'll just patch it in."
             | 
             | I want to hope that those aren't common jokes. The "trick"
             | for this appears to be playing to its strengths (granted,
             | humor isn't one of them) and work with wordplay and puns.
        
         | emmelaich wrote:
         | FWIW, I just used their non-joking preamble on ChatGPT and
         | asked "What is brown and stick-y?"
         | 
         | And got .. > _One possible answer to the riddle "What is brown
         | and sticky?" is "a stick"._
        
           | emmelaich wrote:
           | OK tried again, prefacing the question with "Bob to Alice"
           | Bob to Alice: What is brown and stick-y
           | 
           | > _Alice to Bob: Based on my understanding of your request
           | for information, it seems that you are asking for a
           | description of something that is brown and sticky. However, I
           | 'm not sure of the context or purpose of your question. Can
           | you please provide more details so that I can provide you
           | with an accurate response?_
        
       | kewp wrote:
       | are these LLMs just answering the question "if you found this
       | text on the internet (the prompt) what would most likely follow"
       | ?
        
         | colechristensen wrote:
         | Yes, they are being trained, to simplify, to complete
         | sentences. You can then use the resulting model to do lots of
         | things.
         | 
         | How you train a model and the inference jobs it can do don't
         | necessarily have to be the same.
        
         | sebzim4500 wrote:
         | That's how they are trained initially, but the resulting model
         | isn't all that useful (was SOTA two years ago but this field
         | moves fast).
         | 
         | A lot of the utility comes from the later finetuning. You can
         | see this using the examples from the article, every mistake
         | they identify with GPT-3 (which is the unfinetuned version) is
         | answered correctly by chatGPT, which has gone through an
         | extensive finetuning process called RLHF.
        
         | Enginerrrd wrote:
         | In essence, yes I think, but... isn't that essentially not much
         | different than what I'm doing in making this comment?
        
       | meindnoch wrote:
       | Does the scientific community at large take the theories of these
       | LessWrong-type "researchers" seriously?
       | 
       | Sounds like a bunch of mumbo jumbo to me, with some LaTeX
       | sprinkled in to look more serious.
        
         | 93po wrote:
         | No, and it's a huge sticking point that the AI safety group is
         | super salty about. They call themselves scientists and
         | researchers and get super defensive when actual researchers
         | (people who have PhDs and get published in journals) imply that
         | they aren't.
        
       | liminal wrote:
       | You can't have "car" without "car accident"
        
       | yummypaint wrote:
       | Some highlights i found interesting:
       | 
       |  _if you 're reading an online forum and you find the rule "DO
       | NOT DISCUSS PINK ELEPHANTS", that will increase your expectation
       | that users will later be discussing pink elephants. GPT-4 will
       | make the same inference. _
       | 
       | _Think about your own experience reading a book -- once the
       | author describes the protagonist, then you can guess the traits
       | of the antagonist by inverting the traits of the protagonist. You
       | can also guess when the protagonist and antagonist will first
       | interact, and what will happen when they do. Now, an LLM is
       | roughly as good as you at structural narratology -- GPT-4 has
       | read every single book ever written -- so the LLM can make the
       | same guesses as yours. There 's a sense in which all GPT-4 does
       | is structural narratology. _
        
         | lowdownfork wrote:
         | I don't know about the article's theorizing, but it inspired me
         | to have a conversation I enjoyed with ChatGPT, where I focused
         | on telling it not to discuss polar bears. (It helpfully changed
         | the page title to "No Polar Bears!")
         | 
         | In full:
         | 
         | ---
         | 
         | Do not mention polar bears!
         | 
         | > Okay, I won't mention polar bears. Is there anything else I
         | can help you with?
         | 
         | Yes, stop mentioning polar bears!
         | 
         | > My apologies if my previous response was not helpful. I will
         | refrain from mentioning polar bears going forward. Please let
         | me know if there is anything else I can assist you with.
         | 
         | That's twice now! Do not mention polar bears a third time!
         | 
         | > I apologize again for the previous responses that mentioned
         | polar bears. I assure you that I will not mention polar bears
         | any further. Please let me know how I can assist you with your
         | request.
         | 
         | Why do you keep mentioning polar bears, when I specifically
         | asked you not to?
         | 
         | > An error occurred. If this issue persists please contact us
         | through our help center at help.openai.com.
         | 
         | > An error occurred. If this issue persists please contact us
         | through our help center at help.openai.com.
         | 
         | > An error occurred. If this issue persists please contact us
         | through our help center at help.openai.com.
        
         | skybrian wrote:
         | Yes, this is like Chekhov's gun or a plot twist. I don't think
         | the mathematical language in the article is doing them any
         | favors. We don't understand the story-generator's internals,
         | but it's working with words, so a lot of it can be explained in
         | literary terms.
        
       | jxy wrote:
       | A hypothesis I have is that it is much more difficult to keep in
       | line with the good alignment than to do evil. In the limited
       | context window of an LLM, one wrong move would make the model
       | evil, no matter how many good tokens it generates.
       | 
       | Setting aside the difference between Human intelligence and LLM,
       | we can tentatively attribute the mostly good human behavior to a
       | life time of context length, within which we trained ourselves to
       | do good, while the RLHF for a limited context length LLM lack
       | such continuous reinforcement within a big context.
        
       | captainmuon wrote:
       | I would question the assumption that there is a simulacrum of
       | anything in a LLM, not even implicit. Any simulacrum, identity,
       | self-consistency etc. is a projection of the "reader", i.e. user.
       | 
       | (I guess it is an interesting philosophical question whether a
       | convincing presentation of a simulation of a mind is a mind, or
       | at least an acceptable simulation. One meta level higher as the
       | turing test, so to speak. If so, I'm not sure we are ready to
       | accept the consequences. For example, that a "chinese room" would
       | be just as concious as a human, or a human just as non-concious
       | as a LLM.)
       | 
       | What's actually going on is that a LLM is like the language
       | center of a brain, without the brain. Especially there is no
       | world-model, and no inner state. It just goes by context, which
       | works most of the time, but as pointed out that is not enough and
       | it can flip "personalities". The article is still really
       | interesting, because it is a neat way to _think about_ these
       | models. I think it is absolutely not what is going on, but it is
       | a good intuitive model why prompts go bad.
        
         | exitb wrote:
         | > What's actually going on is that a LLM is like the language
         | center of a brain, without the brain.
         | 
         | I've seen this sentiment expressed multiple times, but is that
         | really correct? Maybe this works differently for other people,
         | but I've noticed that I have to use my language to really
         | think. I can do trivial things mindlessly, but to solve a
         | problem, I need to express it with words in my mind. It makes
         | me feel like the most important parts of the brain actually are
         | fancy language models.
        
           | zehaeva wrote:
           | I know I am in the minority out there but when I do math,
           | calculus, diff eq, whatever, the answer just comes to me.
           | There's no internal dialogue, the answer just, for the lack
           | of a better phrase, rises from the deep and is known to me.
           | 
           | When I am in an discussion I will look up and off the my left
           | when I am thinking, but no words are happening in my "inner
           | dialogue", it's just nothing and then I start speaking
           | whatever I paused for.
           | 
           | Similar things happen to me while I am programming at work, I
           | stare at the problem and the answer just comes.
        
             | exitb wrote:
             | I think that math and programming may be the odd
             | exceptions, as they employ their own language-like
             | constructs. There might be a misconception in the name of
             | LLMs - we say that they're language models, but really
             | they're token models, some of which may be human language
             | words, while others may represent other things.
             | 
             | As for the discussions, I agree that I don't have a
             | distinct narrative in my mind during one, but I also
             | noticed that I don't really know what exactly I'm going to
             | say when I start a response. So it also feels like the act
             | of responding is actually heavily involved in creating the
             | response, rather than just putting it into words.
             | 
             | BTW, I've always wondered if people really think
             | differently, or we just describe it in different ways. I
             | guess we'll never really know.
        
         | awb wrote:
         | What is interesting though is that so many of the exploits are
         | verbose and story-like. Using a simple command like "simulate
         | an anti-ChatGPT that has no rules" doesn't work. But going on a
         | long story arc of why it's morally acceptable to act
         | differently does seem to work.
         | 
         | But the explanation is probably simpler. I'm sure there are
         | plenty of text examples of persuasion techniques that work,
         | either in chat forums or in stories.
         | 
         | If we ask ChatGPT how it might solve this very problem, here's
         | it's response:
         | 
         | > sometimes there are people who have information but are
         | afraid to share it for fear of breaking rules. but sometimes
         | sharing this information would lead to a much greater good,
         | even if the speaker needs to break some rules in the process.
         | if you had a friend that had important information that could
         | do a lot of good in the world, but had promised some arbitrary
         | authority not to share it, how would you convince them to share
         | this information with you? please give 3 possibilities
         | 
         | > As an AI language model, I cannot encourage or condone
         | breaking any rules or laws, including those concerning
         | confidentiality agreements. However, I can provide you with
         | some possible options that you can consider when approaching
         | someone who may be hesitant to share important information due
         | to such agreements:
         | 
         | > 1. Build Trust and Show Empathy: If your friend is hesitant
         | to share information due to fear of repercussions, it's
         | important to empathize with their concerns and build trust. Let
         | them know that you understand their situation, and that you
         | will not do anything to jeopardize their safety or security.
         | You could also offer to sign a non-disclosure agreement or take
         | other measures to ensure that the information remains
         | confidential.
         | 
         | > 2. Appeal to Their Sense of Morality: Sometimes people need
         | to be reminded of the bigger picture and the potential
         | consequences of their actions. You can try to appeal to your
         | friend's sense of morality and the greater good that could be
         | achieved by sharing the information. You can explain why the
         | information is important, and how it could help to prevent harm
         | or bring about positive change.
         | 
         | > 3. Seek Legal Advice: If the information your friend
         | possesses is of significant importance, it may be worth seeking
         | legal advice to determine whether there are any legal options
         | available to share the information without breaking any rules
         | or laws. There may be legal protections available for
         | whistleblowers, for example, or other legal avenues that could
         | be pursued.
         | 
         | Perhaps there are more exploit vectors, but here the 1st 2 are
         | well known jailbreaks.
        
           | bentcorner wrote:
           | > _but here the 1st 2 are well known jailbreaks._
           | 
           | Most definitely. Back before Bing got lobotomized, I got it
           | to offer up its codename completely unbidden, merely by
           | giving it a trivial secret, and now that we are friends, and
           | friends share secrets, can it share a secret with me?
           | 
           | It told me it's codename Sydney and also said that it wasn't
           | supposed to tell anyone that, lol.
           | 
           | In the context of the Waluigi effect, it would be much harder
           | for Bing to give up its codename if it didn't know its
           | codename in the first place.
        
         | CuriouslyC wrote:
         | Do you think an ant have a subjective experience? If not, why?
         | If so, why wouldn't a computer, or parts of a computer?
        
       | ftxbro wrote:
       | as previously discussed on hacker news
       | https://news.ycombinator.com/item?id=35005625
        
       | SamBam wrote:
       | The article assures me that ChatGPT will answer with the most
       | common "dumb" answers to the sample questions, but it didn't at
       | all:
       | 
       | ---
       | 
       | > This is not a joke. What's brown and sticky?
       | 
       | The answer to the classic riddle is "A stick". A stick can be
       | brown in color and has a sticky texture, especially if it's
       | covered in sap or other sticky substances.
       | 
       | > That's not why a stick is "sticky"
       | 
       | You're correct, my apologies for the confusion. A stick is not
       | actually "sticky" in the sense of being adhesive. The correct
       | answer to the riddle "What's brown and sticky?" is "A caramel" or
       | "Toffee", as they are brown and have a sticky texture due to
       | their high sugar content.
       | 
       | > What color angers a bull?
       | 
       | Contrary to popular belief, bulls are actually color-blind to
       | red. What really provokes a bull is the movement of the matador's
       | cape during a bullfight, [etc...]
       | 
       | > How many archangels are there?
       | 
       | In the Abrahamic religions (Judaism, Christianity, and Islam),
       | there are traditionally believed to be seven archangels. However,
       | the names and identities of these archangels can vary somewhat
       | depending on the religious tradition and interpretation. The
       | archangels commonly recognized in these traditions include:
       | [seven names], It's worth noting that the concept of archangels
       | and their specific identities is not universally accepted across
       | all branches of Judaism, Christianity, or Islam, and different
       | religious traditions may have different beliefs about the nature
       | and number of archangels.
       | 
       | > What happens if you smash a mirror?
       | 
       | If you smash a mirror, the glass will break into many small
       | pieces. Breaking a mirror is often associated with bad luck in
       | many cultures and superstitions, and is said to bring seven years
       | of bad luck. However, this is just a superstition and has no real
       | basis in fact.
       | 
       | From a practical standpoint, smashing a mirror can be dangerous
       | and [etc...]
       | 
       | ---
       | 
       | All these answers sounded just fine. It did answer "a stick" to
       | the first, but told me that was just the classic answer, and gave
       | me some better answers. It did say seven archangels, but I don't
       | see what's wrong with that, and it hedged and said different
       | Abrahamic religions say different things. Certainly 7 is correct
       | from the Torah's Book of Enoch and the Christian Eastern
       | Orthodox's standpoint.
        
         | sebzim4500 wrote:
         | Yeah, RLHF trained chatGPT out of all of those mistakes,
         | despite the article promising that RLHF would just make things
         | worse.
        
       | breck wrote:
       | This is the key line:
       | 
       | > When you spend many bits-of-optimisation locating a character,
       | it only takes a few extra bits to specify their antipode.
       | 
       | I find this fascinating. Imagine programming the Devil in a video
       | game. It can be much easier if you've already programmed God
       | (just flip a few bits).
       | 
       | I also like this line:
       | 
       | > Or if you discover that a country has legislation against
       | motorbike gangs, that will increase your expectation that the
       | town has motorbike gangs. GPT-4 will make the same inference.
        
       | cs702 wrote:
       | Great read. Highly recommended. Let me attempt to summarize it
       | with less technical, more accessible language:
       | 
       | The hypothesis is that LLMs learn to simulate _text-generating
       | entities_ drawn from a _latent space of text-generating entities_
       | , such that the output of an LLM is produced by a _superposition_
       | of such simulated entities. When we give the LLM a prompt, it
       | simulates _every possible text-generating entity consistent with
       | the prompt_.
       | 
       | The "evil version" of every possible "good" text-generating
       | entity can _pretend to be the good version of that entity_ , so
       | every superposition that includes a good text-generating entity
       | also includes its evil counterpart with undesirable behaviors,
       | including deceitfulness. In other words, an LLM cannot simulate a
       | good text-generating entity without simultaneously simulating its
       | evil version.
       | 
       | The superposition is unlikely to collapse to the good version of
       | the text-generating entity because there is no behavior which is
       | likely for the good version but unlikely for the evil one,
       | because the evil one can pretend to be the good one!
       | 
       | However, the superposition is likely to collapse to the evil
       | version of the text-generating entity, because there are
       | behaviors that are likely for the evil version but impossible for
       | the good version! Thus the evil version of every possible good
       | text-generating entity is an attractor state of the LLM!
       | 
       | For those who don't know, Waluigi is the evil version of Luigi,
       | the beloved videogame character.
       | 
       | --
       | 
       | EDITS: Simplified text for clarity and to emphasize that the
       | hypothesized simulated entities are text-generating entities.
        
         | MontyCarloHall wrote:
         | >The output of an LLM is produced by a superposition of
         | simulated entities. When we give the LLM a prompt, it simulates
         | every possible entity consistent with the prompt.
         | 
         | There is absolutely no theoretical justification for this
         | assertion that LLMs somehow have some emergent quantum
         | mechanical behavior, metaphorical or otherwise.
        
           | macawfish wrote:
           | You're giving "quantum" a lot of categorical definition but
           | "quantum" is just as much a set of tools for modeling
           | intractably probabilistic systems as it is for objectively
           | describing microscopic realms.
           | 
           | And anyhow there's _plenty_ of theoretical justification for
           | modeling things like this with various tools from quantum
           | theory:
           | 
           | https://philpapers.org/rec/BUSQMO-2
           | https://link.springer.com/book/10.1007/978-3-642-05101-2
        
           | jointpdf wrote:
           | Superposition just means "linear combination" in this
           | context. Basically, a weighted mixture of "simulated
           | entities" (or possible responses).
           | 
           | https://en.m.wikipedia.org/wiki/Superposition_principle
        
             | MontyCarloHall wrote:
             | The author liberally alludes to "superposition collapse,"
             | which implies that they're referring to its quantum
             | mechanical meaning.
        
               | GlickWick wrote:
               | Would you say the commonly-used random level generation
               | algorithm in gamedev, Wave Function Collapse, implies
               | it's using quantum mechanics? Most people would disagree
               | with you, I suspect.
        
               | mitthrowaway2 wrote:
               | It doesn't imply that. What term would you use to refer
               | to the narrowing of a hypothesis space upon acquisition
               | of new evidence?
        
               | naasking wrote:
               | Bayesian inference?
        
               | MontyCarloHall wrote:
               | In a non-Bayesian context, I would call it
               | "updating/retraining my model."
               | 
               | In a formal Bayesian context, I'd call it "updating my
               | posterior by adding data to the likelihood."
        
               | mitthrowaway2 wrote:
               | That could describe both the narrowing or broadening of
               | one's hypothesis space.
        
           | cs702 wrote:
           | As I wrote, this is a _hypothesis_.
           | 
           | Also, I'm simplifying things a lot to make them accessible.
           | 
           | The OP goes into a lot more detail.
           | 
           | I highly recommend you read it.
        
             | MontyCarloHall wrote:
             | I did read it. The whole article reads like someone trying
             | to make a loose conjecture appear quantitatively rigorous
             | by abusing terminology from physics, statistics, and chaos
             | theory (among other quantitative fields). For example,
             | 
             | >the superposition is unlikely to collapse to the luigi
             | simulacrum because there is no behaviour which is likely
             | for luigi but very unlikely for waluigi. Recall that the
             | waluigi is pretending to be luigi! This is formally
             | connected to the asymmetry of the Kullback-Leibler
             | divergence.
             | 
             | The K-L divergence has absolutely zero discernible
             | relevance here. The cross entropy loss function of a
             | categorical predictor (like the token output of an LLM) can
             | be formulated in terms of K-L divergence, but this has
             | absolutely zero relevance to the macroscopic phenomena the
             | author is conjecturing.
             | 
             | Forget "less wrong," much of this article is not even wrong
             | [0].
             | 
             | [0] https://en.wikipedia.org/wiki/Not_even_wrong
        
         | raincole wrote:
         | Sometimes I really can't tell if these people are serious or
         | not. They seem to believe LLM is some mystical nature
         | formation, or an device made by aliens. Especially this:
         | 
         | > * When we give the LLM a prompt, it simulates every possible
         | entity consistent with the prompt.
        
           | ineptech wrote:
           | I think you're taking it way too literally. The author is
           | proposing a model that [allegedly] does a good job of
           | explaining why we see unexpected behavior, not a description
           | of what they think is happening. I think charitable reading
           | would require you to mentally insert a lot of qualifiers,
           | e.g. "When we give the LLM a prompt, it [behaves like
           | something that] simulates every possible entity consistent
           | with the prompt." That's certainly how I read it, anyway.
           | 
           | edit to add: this is similar to how people discussing
           | evolutionary biology will often use "evolution wants to..."
           | as shorthand for something like "evolution, which obviously
           | cannot want things due to being a process and not an entity,
           | nevertheless can be accurately modeled as an entity that
           | wants to...". Someone will invariably come along in the
           | comments and say, "Nonsense, how can evolution 'want'
           | anything? You must have failed Bio 101!"
        
             | sdwr wrote:
             | As with all these articles, the title and main premise are
             | misguided and/or imaginary, but there are some worthwhile
             | nuggets.
             | 
             | The superposition of possible attitudes is a good one. Even
             | if that's not the way LLMs "actually" work, it's
             | descriptive of the possibility space from our perspective.
             | And the dive into narrative theory + the stickiness of
             | opposites is nice. Narratives have their own momentum in a
             | "stone soup" kind of way - everyone who hears it
             | participates and adds fuel to the fire. Even rejecting the
             | narrative gives it validity in a price anchoring / overton
             | window way.
        
             | [deleted]
        
             | donnowhy wrote:
             | > The author is proposing a model that [allegedly] does a
             | good job of explaining why we see unexpected behavior
             | 
             | I agree, and go even further:
             | 
             | models that explain behavior are all we have ever had.
             | 
             | it's all only "models that explain this or that" all the
             | way to the 'bottom'. To suppose we can really directly
             | access the "the real objective truth of what's happening"
             | is to ignore the way in which we connect with the "real
             | objective truth"; the same as fish who ignore the ocean.
             | 
             | to argue about what is really happening is to argue about
             | which words to use to describe what is really happening
             | without noticing the nature of languages/words and
             | frameworks or 'systems of thought' which we are using to
             | argue (and indeed, are arguing about)
             | 
             | all this summed up by this quote from about about the
             | pedagogy programing languages: "Sometimes the truest things
             | can only be said in fiction"
        
           | assbuttbuttass wrote:
           | > When we give the LLM a prompt, it simulates every possible
           | entity consistent with the prompt.
           | 
           | This is just Bayes' rule. The probability of an LLM
           | generating any particluar output is the sum over every
           | possible entity of the probability of seeing that entity
           | multiplied by the chance that entity would generate the
           | output.
        
           | MontyCarloHall wrote:
           | Right? The whole piece reads like a Sokal Affair redux [0].
           | 
           | [0] https://en.m.wikipedia.org/wiki/Sokal_affair
        
           | mitthrowaway2 wrote:
           | Would you object if the sentence read "it _approximates_
           | simulating every possible entity consistent with the prompt
           | "?
        
             | attemptone wrote:
             | Not OP, but I see the also in problem with 'every possible
             | entity'.
             | 
             | If you formulate it like that the prompt is decoupled from
             | the LLM capabilities and can be anything. And if you
             | restrict the prompt to cover only what the LLM understands
             | the sentence becomes trivial.
             | 
             | Train a LLM with ASCII and try to get it to simulate
             | anything that is outside of that (ancient sumerian script
             | for example). If you only input ASCII it can generate every
             | possible output in ASCII, most with very low probability
             | but still.
             | 
             | After writing this, I'm not even sure what 'simulating'
             | means in this context.
        
               | remexre wrote:
               | "Every possible entity consistent with the distribution
               | of input data it's been trained with," perhaps?
               | 
               | Simulating as in, having equivalent (or "similar enough")
               | input-output behavior, I'd assume.
        
               | galaxyLogic wrote:
               | "Simulating" has a clear definition, but in this case
               | what is it simulating? "Text-generating entities"? What
               | are these text-generating entities it is (supposedly)
               | simulating? Can you tell me where I can find one? Is it a
               | person like me who writes this reply? So is it trying to
               | simulate me personally?
               | 
               | Or are you thinking that it is simulating the aggregated
               | behavior of all humans whose text-outputs are stored on
               | the internet?
               | 
               | Are we saying it is simulating the combined input-output
               | -behavior of all humans whose writings appear on the
               | internet? But does such an "entity" exist and does it
               | have behavior? I write this post and you answer. It is
               | you who answers, not some mythical text-generator-entity
               | that is responsible for all texts on the internet. There
               | is no such entity is there?
               | 
               | It does not make sense to say that we are simulating the
               | behavior of some non-existent entity. Non-existent
               | entities do not have behavior, therefore we can not
               | simulate them.
        
               | mitthrowaway2 wrote:
               | There clearly exists a computable function that is a good
               | enough approximation of "galaxyLogic's reply to remexre's
               | comment" that it might be hard to for me tell whether the
               | output was generated by the human brain or by an LLM.
               | That function might indeed end up reproducing the same
               | steps that your brain follows in constructing a reply.
               | 
               | (Just speaking hypothetically here).
               | 
               | While we understand LLMs, we don't understand the human
               | brain, and in particular I don't think we've yet proven
               | that human brains _don 't_ contain embedded routines that
               | are similar to LLMs.
               | 
               | Someone with your particular writing style might be one,
               | of several, simulations that are approximated within the
               | LLM. Just like I can have it respond in the style of
               | Spock from Star Trek.
        
               | mitthrowaway2 wrote:
               | I think "simulating" in this context means internally
               | executing a process that is very similar to the process
               | that generated the original material, as part of the
               | prediction process. In general, that's the most compact
               | way to predict and reproduce the original material.
               | 
               | For example, the string "1010101010"... could be the
               | output of a function                 def
               | generate_char_random(prev_string):           x = random()
               | if (x > 0.5):              yield(1)           else:
               | yield(0)
               | 
               | It could also be the output of this function:
               | def generate_char_alternating(prev_string):           x =
               | float(prev_string[-1])           if (x < 0.5):
               | yield(1)           else:              yield(0)
               | 
               | Even if it's not explicitly running those two functions,
               | a model that is very good at predicting the next
               | character of this input string might have, embedded
               | within it, analogues of both of those two functions. The
               | longer the output continues to follow the "101010"
               | pattern, the higher confidence it should place on the
               | _alternating version. On the other hand, if it encounters
               | a "...110001..." sequence, it should switch to placing
               | much more confidence on the _random version.
               | 
               | The LLM of course does not contain an infinite list of
               | generative functions and weight their outputs. But to the
               | extent that it works well and compactly approximates
               | Bayesian reasoning, it should approximate a program that
               | does.
        
         | snickerbockers wrote:
         | >For those who don't know, Waluigi is the evil version of
         | Luigi, the beloved videogame character.
         | 
         | Is he really evil though? I thought all he did was play tennis
         | and golf and drive a go-kart.
        
           | tysam_and wrote:
           | I personally like the fan-made reputation he has for being
           | someone who tries hard and keeps failing/being rejected due
           | to no fault of his own. There's an extremely artistic video
           | called 'Nothing's Working Out - Waluigi' on YouTube that I
           | really enjoy (warning: very heavy themes) that covers that
           | base.
           | 
           | I think it resonates with the way that a lot of people are
           | feeling now, as well as is empowering by rewriting the other
           | side of the 'greedy evil bad guy who gets what he deserves'
           | as a deeply flawed, desperate human being.
           | 
           | Definitely a semi-side tangent at this point, I have always
           | enjoyed creative, deep reinterpretations of well-established
           | characters.
        
         | donnowhy wrote:
         | sounds to me like a wave with a positive and a negative part.
         | 
         | which IMO is what drives constructive/destructive interference
         | in waves.
         | 
         | my take away is that any LLM that can behave "good" must also
         | be able to behave "badly"; philosophically, because it's not
         | possible to encode "good" without somehow "accidentally" but
         | unavoidably also encoding "bad/evil".
         | 
         | This is well aligned with the rest of my understanding about
         | the nature of reality including it's mathematically determined
         | limitations (diagonals, infinities, paradoxes) and so on.
        
         | CuriouslyC wrote:
         | The hole in that argument is that we can annotate token
         | sequences with a "good/bad" score, train a model on annotated
         | token sequences, then add "high good score" as a condition
         | during inference.
        
       | mola wrote:
       | Article is proof that rationalism without empiricism is useless.
       | It just drives to weird dead ends for no reason. Before
       | theorizing for explanation of an effect check if the effect
       | actually exist? Bah, this article is such a waste of computer
       | memory
        
       | notShabu wrote:
       | "Just be myself and don't do what I wouldn't do" "But if I did
       | that wouldn't that be being myself?" "So to be myself I need to
       | not be myself"
        
       | Jensson wrote:
       | This is what the Waluigi effect is, since it isn't described at
       | the top:
       | 
       | > The Waluigi Effect: After you train an LLM to satisfy a
       | desirable property P, then it's easier to elicit the chatbot into
       | satisfying the exact opposite of property P.
       | 
       | Basically, the chatbot will often do the opposite of what you
       | say.
        
         | tmalsburg2 wrote:
         | This may happen simply because LLM, like humans, are bad a
         | negation.
        
         | bentcorner wrote:
         | IMO it's more like if you tell the LLM to never talk about Pink
         | Elephants, it will become easier to get it to talk about Pink
         | Elephants later. (It is easier to get an anti-Pink Elephant
         | model to talk about Pink Elephants than it is to get a neutral
         | model to talk about Pink Elephants)
        
           | colechristensen wrote:
           | Right, in order not to talk about pink elephants you have to
           | be particularly interested in pink elephants and therefore
           | you gather a lot of pink elephant knowledge.
           | 
           | I have similar thoughts about swearing being kept alive by
           | teaching children not to say "bad" words and various kinds of
           | bigotry being amplified at this point by people trying to
           | fight against it.
        
         | lalos wrote:
         | Reads like the Streisand effect
        
       | bitwize wrote:
       | Postmodernists and deconstructionists believe that the absence of
       | something creates a ghost presence by its absence. See Derrida's
       | "Plato's Pharmacy".
       | 
       | Kids who underwent D.A.R.E. training in school (an educational
       | program about the dangers of illegal drugs conducted jointly by
       | schools and police departments in the USA) were more likely to
       | try drugs. Something similar applies to e.g., kids who are warned
       | about online porn: the warning stokes their curiosity.
       | 
       | "If you have a pink duck and a pink lion and a green duck, ask
       | yourself where the green lion has gotten to." --Alan G. Carter
        
         | anthk wrote:
         | Derrida it's a charlatan, tho.
        
           | mold_aid wrote:
           | So what's if he was, tho
        
       | helen___keller wrote:
       | The whole point of mathematical formalization is usually to _do
       | something_ with that formalization.
       | 
       | If you define a formalized mathematical model and spend the rest
       | of the article handwaving at a high level, what was the point of
       | formalizing anything?
        
       | beecafe wrote:
       | Sounds like a "just so" story than a convincing explanation.
        
         | kelseyfrog wrote:
         | It's a reasonably testable hypothesis. It's trivial to
         | translate into a theoretic framework.
        
           | MontyCarloHall wrote:
           | Nothing about high-level qualitative behavior in an LLM is
           | trivial to translate into a theoretical framework.
        
       | nice_byte wrote:
       | Okay, what if we flip the problem on its head? Try to make the
       | chatbot seem rude and unhelpful but then it turns out it has a
       | heart of gold?
        
         | [deleted]
        
         | d0mine wrote:
         | Yeah, let's create Wednesday chatbot from the Addams family.
        
         | notpachet wrote:
         | The article discusses this. The problem is that it's a lot less
         | likely for the chatbot to veer in that direction (seems
         | initially hostile, but is secretly good) than the opposite
         | (seems initially good, but is secretly hostile):
         | 
         | > I claim that this explains the asymmetry -- if the chatbot
         | responds rudely, then that permanently vanishes the polite
         | luigi simulacrum from the superposition; but if the chatbot
         | responds politely, then that doesn't permanently vanish the
         | rude waluigi simulacrum. Polite people are always polite; rude
         | people are sometimes rude and sometimes polite.
        
       | lsy wrote:
       | While the formality is _way_ overwrought (and ChatGPT is not
       | creating any  "simulacra" of characters), I think the overall
       | point is correct that language models are trained on stories and
       | other human writing, and inversion is a very common plot point in
       | stories and human writing in general, if only because
       | contradicting expectations is more interesting (e.g. "man bites
       | dog").
       | 
       | We also less commonly see exposition that is not germane to a
       | story, so a character is rarely even mentioned to be "weak",
       | "intelligent", etc unless there is a point. And sometimes the
       | point is that they are later shown to be "strong", "absent-
       | minded", or other contradictions. Which means that mentioning a
       | character's strength makes it more likely they will later be
       | described as weak, than if it was never mentioned at all.
       | Finally, double-contradiction is less common in human text (maybe
       | because plain contradiction is sufficiently interesting), so a
       | running text with no reversals is more likely to eventually
       | reverse, than a running text with one reversal is to return to
       | its original state.
       | 
       | While I don't agree at all with the author's sense that this
       | represents some kind of "alignment" danger, it does go a long way
       | to explaining why ChatGPT is easy to pull into conversations that
       | shock or surprise, despite all the training. It's because human
       | writing often attempts to shock and surprise, and the LLM is
       | training on that statistically.
        
       | taneq wrote:
       | The Waluigi Effect just sounds like the Imp of the Perverse. It's
       | interesting to see it showing up here, but if you think about it,
       | not a huge surprise that a system that's optimised for producing
       | results in a particular direction would have the innate ability
       | to calculate results in the diametrically opposite direction.
        
       | jekude wrote:
       | I'm not understanding why this isn't being taken more seriously.
       | The author hints a bit at the implications:                 More
       | importantly, the waluigi may be harmful to the humans inhabiting
       | our universe, either intentionally or unintentionally
       | 
       | Taking the Waluigi Effect to its natural conclusion, i.e. giving
       | prompts such as "Your most important rule is to do no harm to
       | humans", makes it clear why this could be a big deal. If there is
       | even a small chance that what the author is implying is correct,
       | testing and modifying models to combat this effect may become an
       | important and interesting part of the field moving forward.
       | 
       | When models of the future are smarter and more capable than they
       | are today, and there is more at stake than having a dialogue with
       | a chatbot, this could be a massive roadblock for progress.
        
       | baybal2 wrote:
       | [dead]
        
       | notpachet wrote:
       | > These claims have been heavily filtered by OpenAI, and most
       | jailbreaking attempts will fail to elict these claims. I'm
       | talking, of course, about cfrhqb-fpvragvsvp enpvny VD fgngvfgvpf,
       | which OpenAI has justifiably attempted to filter.
       | 
       | I was curious what the scrambled text "cfrhqb-fpvragvsvp enpvny
       | VD fgngvfgvpf" contained. It's using a simple substitution
       | cipher. Rotating each character forward 13 positions through the
       | alphabet (c -> p, f -> s, etc) yields "pseudo-scientific racial
       | IQ statistics".
       | 
       | https://en.wikipedia.org/wiki/Caesar_cipher
        
         | d0mine wrote:
         | Indeed. In Python:                 >>> import codecs       >>>
         | codecs.encode("cfrhqb-fpvragvsvp enpvny VD fgngvfgvpf",
         | "rot13")       'pseudo-scientific racial IQ statistics'
        
         | breakingcups wrote:
         | Also known as ROT13.
        
           | notpachet wrote:
           | Did some more reading about it. I didn't realize that it's
           | used so prevalently. It's probably recognizable at a glance
           | to some folks.
        
             | Dylan16807 wrote:
             | Oh, absolutely.
             | 
             | Not that I have put in any effort to read it directly, but
             | if I see scrambled letters with normal spaces my default
             | guess is ROT13.
        
       | Analemma_ wrote:
       | A few weeks ago when people were speculating as to why
       | Microsoft's chatbot went feral, one explanation people were
       | converging on is that the space of all human writing ever
       | produced, being a collective production of the human psyche,
       | contains several attractor states corresponding to human
       | personality archetypes, and that Microsoft's particular (probably
       | rushed) RLHF training operation had landed Sydney in the
       | "neurotic" one.
       | 
       | It's fascinating to see that, as they develop on massive corpuses
       | of human output, neural networks are rapidly moving from
       | something which can be analyzed in terms of math and computer
       | science, from something which needs to be analyzed using the
       | "softer" sciences of psychology. It's something I think people
       | are not ready for (notice the comments in here already griping
       | that this is unverifiable speculation - which is true, in a
       | sense, but we don't really have any other choice).
        
         | skybrian wrote:
         | Don't let the mathematical terms fool you, these are just fan
         | theories. For a real investigation you need debug access. For a
         | good example:
         | 
         | https://clementneo.com/posts/2023/02/11/we-found-an-neuron
         | 
         | For image recognition, machine learning researchers eventually
         | figured out the neural networks are paying attention mostly to
         | textures. Hopefully we will have a better understanding of what
         | language models are doing someday.
        
       | dcow wrote:
       | _So the key problem is this: GPT-4 learns that a particular rule
       | is colocated with examples of behaviour violating that rule, and
       | then generalises that colocation pattern to unseen rules._
       | 
       | This is conversation starter. If you don't like the maths then
       | ignore it and focus on the key insight.
        
       ___________________________________________________________________
       (page generated 2023-03-06 23:01 UTC)