[HN Gopher] The Big List of Naughty Strings
       ___________________________________________________________________
        
       The Big List of Naughty Strings
        
       Author : polm23
       Score  : 221 points
       Date   : 2020-05-24 13:44 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | 13415 wrote:
       | I don't quite understand the purpose of this list. It contains
       | potentially malicious input, but also emoticons based on Unicode
       | characters that are completely harmless and used in every second
       | post on Reddit.
        
         | kube-system wrote:
         | I think the purpose is to run these strings through your inputs
         | and make sure it doesn't behave in unexpected ways.
        
         | MauranKilom wrote:
         | It's essentially a test suite for character encoding all
         | throughout your application. If you input all those strings
         | (e.g. send chat message) and they arrive incorrectly at some
         | other end (e.g. other user receiving chat message) then there's
         | a problem somewhere.
        
           | 13415 wrote:
           | That makes sense. Thanks a lot! Of course, it's very useful
           | for testing. I erroneously assumed it was for input
           | validation.
        
         | minimaxir wrote:
         | I made the list while I was a Software QA Engineer at Apple,
         | since there were a bunch of fun Unicode strings causing
         | particular issues there, which gave me the idea.
        
         | Dylan16807 wrote:
         | There's a lot of ways to mishandle unicode. Checking that non-
         | BMP characters work, that emojis in various sections work, and
         | that emojis with modifiers work are all good tests.
        
         | toomanybeersies wrote:
         | It's useful for testing a variety of things that take
         | text/string inputs, such as forms in web applications. It's a
         | handy tool for testing a site (preferably one you have
         | permission to test) for XSS or SQL injection, character
         | encoding problems, or even just form length problems.
        
       | harunurhan wrote:
       | OK, seeing "[?]" [1] was unexpected :). For those who does not
       | know, it's very important for muslims and It's all over the Quran
       | 
       | [1] https://github.com/minimaxir/big-list-of-naughty-
       | strings/blo...
        
         | cheez wrote:
         | what does it mean?
        
           | gnulinux wrote:
           | "In the name of God, the Most Gracious, the Most Merciful."
           | 
           | From Wikipedia: https://en.wikipedia.org/wiki/Basmala
           | 
           | Disclaimer: I'm not a Muslim, I don't know Arabic.
        
           | ctdonath wrote:
           | https://www.urbandictionary.com/define.php?term=%EF%B7%BD
           | 
           | Fun fact: it's a single Unicode character.
        
             | harunurhan wrote:
             | yeah I didn't know it until i tried to copy-paste to post
             | here :)
        
             | atomwaffel wrote:
             | Yup, you can put 280 of it into a single tweet.
        
               | robinhouston wrote:
               | I don't _think_ that's right. I looked into the way
               | Twitter counts characters when I was trying to work out
               | the largest prime number that could be written out in
               | full, in base ten, in a single tweet[1]; the rules are
               | more complicated than you might expect, and have changed
               | several times.
               | 
               | The current rule seems to be that all Unicode characters
               | count as two, except for the ranges 0-4351, 8192-8205,
               | 8208-8223 and 8242-8247 which count as one.
               | 
               | [1] In case you're wondering, I think it's, arguably: htt
               | ps://twitter.com/robinhouston/status/1197294154738544641
        
               | atomwaffel wrote:
               | Good point! Still, I could swear I saw someone
               | (@FakeUnicode?) do exactly this once, but of course I
               | can't find that tweet any more, partly because it turns
               | out that search engines don't handle [?] well at all, and
               | I don't feel like testing it on my own followers somehow.
               | 
               | Edit: it looks like it might count it as two characters,
               | so that's only 140 per tweet.
        
           | mmastrac wrote:
           | https://charbase.com/fdfd-unicode-arabic-ligature-
           | bismillah-...
           | 
           | Also fun is , (https://charbase.com/fdfa-unicode-arabic-
           | ligature-sallallaho...) which has the longest unicode
           | decomposition IIRC.
        
             | beobab wrote:
             | I had to zoom in to 400% to be able to see the detail
             | there.
        
         | lopmotr wrote:
         | Can't it be made up of individual characters or is it stylized
         | in a unique way?
        
       | toolslive wrote:
       | https://github.com/minimaxir/big-list-of-naughty-strings/blo...
       | 
       | just lovely ;)
        
         | bryanrasmussen wrote:
         | I did feel sort of let down that they didn't have man-hole
         | cover.
         | 
         | on edit: yeah, I'm not gonna send a pull request on that one.
        
         | duggable wrote:
         | This one got me:
         | 
         | > "If you're reading this, you've been in a coma for almost 20
         | years now. We're trying a new technique. We don't know where
         | this message will end up in your dream, but we hope it works.
         | Please wake up, we miss you."
         | 
         | Strangely terrifying....
        
           | ball_of_lint wrote:
           | Eh, it's just a meme:
           | 
           | https://www.reddit.com/r/copypasta/comments/5we0ny/if_youre_.
           | ..
           | 
           | I guess it is intriguing, in a Roko's Basilisk sort of way.
        
           | willismichael wrote:
           | The question is, which one of us is the message meant for?
        
             | naniwaduni wrote:
             | Why would there be more than one of you?
        
               | myself248 wrote:
               | Why are there so many of me posting in this thread?
        
             | was_boring wrote:
             | Who says it's only meant for one?
        
               | bryanrasmussen wrote:
               | the real question is: Inception. Can it be done?
        
       | montroser wrote:
       | Almost related: https://github.com/LDNOOBW/List-of-Dirty-Naughty-
       | Obscene-and...
        
         | jzl wrote:
         | Hilarious, but also important!
        
         | dorgo wrote:
         | What? only 151 russian words? The russians have an own
         | dedicated sub-language which consists solely out of bad words.
         | No idea or concept is too complicated to be expressed in bad
         | words alone. They switch from normal russian to bad words
         | russian as soon as the situation allowes it.
        
       | Udik wrote:
       | .
       | 
       | .
       | 
       | .
       | 
       | .
       | 
       | .
       | 
       | .
       | 
       | d d d
       | 
       | Wow, what's this? :)
        
         | EvanAnderson wrote:
         | It reminds me a little bit of Feynman diagrams.
        
         | majewsky wrote:
         | Layers upon layers of combining diacritics.
        
       | folkhack wrote:
       | Solid list for a quick SQL injection and XSS reference with lots
       | of examples. Even unicode/accents/two-byte characters etc are
       | super useful to check handling on all the way from the front-end
       | to the persistent storage solution (DB, etc).
       | 
       | Lost it laughing at "Human Injection" section:
       | 
       | > # Strings which may cause human to reinterpret worldview
       | 
       | > If you're reading this, you've been in a coma for almost 20
       | years now. We're trying a new technique. We don't know where this
       | message will end up in your dream, but we hope it works. Please
       | wake up, we miss you.
        
         | yosito wrote:
         | I would wake up if I could, but I opened this string in vim and
         | I can remember how to exit.
        
         | foresto wrote:
         | > Please wake up, we miss you.
         | 
         | I think that sentence gives itself away as modern. Were comma
         | splices in common use 20 years ago?
        
           | frank2 wrote:
           | Yes they were. IIRC at least one of the major manuals of
           | style endorsed them at least in some situations.
        
             | maxfan8 wrote:
             | That's interesting. Maybe it's considered hyper correct?
             | 
             | Which style manual are you referring to?
        
       | gerdesj wrote:
       | Jimmy Clitheroe - the Clitheroe Kid. That brings back some
       | memories. It's also nice to see that England is suitably
       | represented in the place names, obviously Scunthorpe is the
       | classic. I'll tender Somerset for first amongst equals for daft
       | and downright odd place names.
        
       | jzl wrote:
       | Also tangentially related: the big list of usernames that should
       | be disallowed in any online system:
       | https://github.com/forwardemail/reserved-email-addresses-lis...
        
         | DominikPeters wrote:
         | Ugh, that list might be why my email address mail@[personal
         | domain] is forbidden more and more often.
        
       | chris_wot wrote:
       | Strongly advise not using cat on the list, you will get beeped
       | at.
        
         | fareesh wrote:
         | would that be considered animal abuse :D
        
       | afandian wrote:
       | This is deiciously ironic:
       | 
       | > Also, do not send a null character (U+0000) string, as it
       | changes the file format on GitHub to binary and renders it
       | unreadable in pull requests.
        
       | monax wrote:
       | Yup, can't view the file using the GitHub app for Android
        
         | minimaxir wrote:
         | Out of curiosity, what happens when you try to do so?
        
           | Johnjonjoan wrote:
           | Something went wrong
           | 
           | <button>TRY AGAIN<button>
           | 
           | Edit: as far as I could see it's only opening blns.txt that
           | causes this error the other files are fine in the app.
        
       | dhosek wrote:
       | I encountered an amusing instance of this recently watching my
       | six-year-old son playing music on the kitchen Alexa. Alexa felt
       | it was necessary to censor the name of a children's song
       | entitled, "Pussy Cat, Pussy Cat."
        
         | inetsee wrote:
         | When I saw the title I thought it was a list of profanity that
         | one might want to filter out from an open web application (i.e.
         | a list that also includes swear words from multiple languages).
        
       | dang wrote:
       | See also:
       | 
       | 2018 https://news.ycombinator.com/item?id=18466787
       | 
       | 2017 https://news.ycombinator.com/item?id=13406119
       | 
       | Show HN from 2015: https://news.ycombinator.com/item?id=10035008
        
       ___________________________________________________________________
       (page generated 2020-05-24 23:00 UTC)