[HN Gopher] The Greatest Regex Trick Ever (2014) ___________________________________________________________________ The Greatest Regex Trick Ever (2014) Author : signa11 Score : 249 points Date : 2021-07-08 16:49 UTC (6 hours ago) (HTM) web link (rexegg.com) (TXT) w3m dump (rexegg.com) | capitalbreeze wrote: | This is awesome!! Well done "Tarzan" | ppierald wrote: | I used to just go ask Friedl. | Tipewryter wrote: | The solution... not_this|(but_this) | | ... is interesting. But since it returns the match in a submatch | I would say the \K approach is better: | (?:not_this.*?)*\Kbut_this | | Because usually when you try hard to accomplish something with a | regex, you do not have the luxury to say "And then please | disregard the match and look at the submatch instead". | lifthrasiir wrote: | That doesn't work. `(?:"Tarzan".*?)*\KTarzan` should behave | identically without `\K`, and it will match `"Tarzan" "Tarzan"` | because the ungreedy quantifier ? still allows backtracking (it | just changes the search order). You want the possessive | quantifier + instead; `not_this|(but_this)` is equivalent | because regexp engines will not look back into once matched | string. | high_byte wrote: | it's nice. I'm way more dumbfounded by the prime thing though | rprenger wrote: | Me too. I had to look it up. This page has pretty good | breakdown: | | https://itnext.io/a-wild-way-to-check-if-a-number-is-prime-u... | | The main trick for me was you first have to convert the number | to unary, which was done outside of the regex. | asah wrote: | speaking as an old regexp wizard from before perl5, this is | indeed a great trick, have an upvote. | | sadly, this trick still requires a code comment to explain. | Python example: # match tarzan but not "tarzan" | # see https://news.ycombinator.com/item?id=27774584 if | "tarzan" == re.search(r'"tarzan"|(tarzan)', myvar)[1]: | ... | | which in practice means it probably deserves a function: | if re_search_but_exclude(r'tarzan', myvar, '"tarzan"'): | ... | | I don't recommend monkeypatching re, i.e. re.search_but_exclude = | ... | dmurray wrote: | Is there a reason you have an r-string for the first arg but | not for the third one? | nytgop77 wrote: | A bit off topic, but the commented version was much clearer, | than the version with separate function. (full sentences are | very good at explaining things) | 123pie123 wrote: | of all the things ever invented in software, regex still amazes | me. | | It's almost like nature, many simple rules coming together to | make extremely clever and fairly complex ideas | z3t4 wrote: | It took me over 15 years until I started to willingly use | RegExp, but now I can't live without it. It's like the curse of | knowledge, once you learn something you'll loose all empathy | and assume everyone else knows it too. It still surprises me | though, I've had bug like my regex matching terminal color | sequences messing up the data if it was colored. | usrusr wrote: | It feels like something that was more discovered than invented, | something that would exist even if nobody knew of its | existence. I get the same feeling when listening to Pharrell | Williams' Happy. | imglorp wrote: | Is anyone having trouble reading the page? It renders as dark | gray on slightly darker green and is illegible. | dorianmariefr wrote: | Please don't | beders wrote: | Please don't use regular expressions to parse Dyck languages. It | doesn't work. | lifthrasiir wrote: | Regexp for _tokenization_ does work. This entire essay boils | down to the fact that you can always postprocess matches and in | this case that corresponds to tossing unwanted tokens out. | miloignis wrote: | I'm not sure if any regex library exposes this, but since regular | languages are closed over compliment and intersection you could | theoretically do something like match("....string..", | regex("Tarzan") - regex("\"Tarzan\"")), where the - operation is | shorthand for intersection with the compliment. Does anyone know | if any regex libraries expose these sorts of operations on the | regular expression/underlying DFA? | amenghra wrote: | Greenery (python3) let's you manipulate regular expressions and | do things like compute intersections: | https://github.com/qntm/greenery | miloignis wrote: | This is exactly the type of thing I was thinking of, and | seems quite fully featured - thank you! | codeflo wrote: | Unfortunately (or perhaps fortunately), "regexes" as commonly | implemented in programming languages are only loosely related | to regular expressions from automata theory. With all their | extensions, they can recognize much, much more than just | regular languages, and I don't think they're closed under | complement (though I'm not sure). However, most regex engines | have a feature called negative lookahead assertions, (?!do not | match), which would almost work in the way you suggest. | | You have to be careful about inputs like this though: "Inside a | string"Tarzan"Again inside a string" | User23 wrote: | Yeah, a DFA that recognizes a regular language can easily be | implemented with O(n) worst case behavior. | | My attitude is generally that one should use regexes for | matching regular languages and if one needs a stack or even | Turing completeness then handle that in code around the | regex. | contravariant wrote: | Wouldn't that end up just being the same as 'regex(Tarzan)'? | Those regexes can't match the same thing, they can only | overlap. | | What you want is something like all matches of regex("Tarzan") | not contained in a match for regex("\"Tarzan\""), which is a | bit trickier. That would require something like: | | regex("Tarzan") - all-substrings(regex("\"Tarzan\"")) | | and I'm not sure regular languages are closed over the "all- | substrings" operation. Actually I'm pretty sure they aren't. | layer8 wrote: | > compliment | | I'll take that as a complement. | sixo wrote: | Not exactly that but take a look at | https://github.com/mtrencseni/rxe ("literate regex"). I found | this on HN and recall the comment thread being good but I can't | find it now. | sodality2 wrote: | This perhaps? second result on hn.algolia.com. | https://news.ycombinator.com/item?id=20646174 | lifthrasiir wrote: | My biggest grief with regexp is that it is just a compact code | disguised as something else. It is relatively common that you | want to scan a string but action codes intermixed. There is a way | to do that with regexp (Perl (?{...}) etc. or PCRE callouts), but | it is always awkward to put a code to a regexp. As a result we | typically end up with either a complex code that really should | have used a regexp but couldn't, or a contorted regexp barring | the understanding. The essay suggests `(*SKIP)(*FAIL)` at the | end, which is another evidence that a code and a regexp don't mix | well so a regexp-only solution is somehow considered worthy. | [deleted] | 1970-01-01 wrote: | For me, the site rendered dark gray text on a dark gray | background and is a chore to read as-is. Outline.com fixed my | issue with it: https://outline.com/YSYgsp | nabilhat wrote: | I got curious and looked back in archive.org to this page's | initial release in 2014. The text background started out as | good old reliable background-color: #EEEEEE, which was later | replaced with background: url("http://a.yu8.us/bg-tile- | parch.gif") | | ...because what could possibly go wrong? From the latest | comment at the end of the page, the author would like you to | know that the outcome is your problem, because you're using the | wrong browser: | | _June 20, 2021 - 15:02_ | | _Subject: RE: Undoing whatever is hiding this page._ | | _Hi Allen, try a different browser. There 's no strange | shading on the page, your browser is deciding to display it in | a weird way. Regards, -Rex_ | mmsc wrote: | Most likely using the HTTPS Everywhere addon. That website is | not available via HTTP, and the user must visit the page | first to accept the 'risk' of using the http version. | nabilhat wrote: | Firefox also defaults to HTTPS by default nowadays. Lots of | content blockers block third party content too. Regardless, | if _literally anything_ goes wrong with the third party | dependency that the article 's contrast depends on, the | best case scenario here is that the text falls back on the | body's background. | | Interestingly, the author also appears to control yu8.us | | Breaking one's own content by https-ing one site but not | another is a great example of why to not prop up a | website's basic legibility on a third party dependency, | even if it's one you own and control. | rentnorove wrote: | It's definitely nothing to do with the following string in | the response: | | > Page copy protected against web site content infringement | by Copyscape | extra88 wrote: | Yes, they web author made the mistake of defining the | <article> background-color: #EEEEEE within a min-width 960px | media query. If the background image fails to load in wider | window, there's still a readable contrast between text and | background but on a phone or other narrow screen, the dark | background color set on the <body> is what's behind the | article text. | [deleted] | dang wrote: | " _Please don 't complain about website formatting, back-button | breakage, and similar annoyances. They're too common to be | interesting. Exception: when the author is present. Then | friendly feedback might be helpful._" | | (It's not that the annoyances aren't annoying, it's that | they're so common that they lead to repetitive offtopicness | that compounds into more boring threads.) | | https://news.ycombinator.com/newsguidelines.html | [deleted] | metalliqaz wrote: | firefox shows it as black(ish) text on a light yellow | background. I think you must be blocking something | jrm4 wrote: | Part of me reads these things and I'm like "neat trick", but most | of the time they more-or-less prove to me that Regex is doomed to | a steady and slow decline. | | It's just not a particularly good "interface" for the task it is | intended to achieve, a little more ability to be "verbose" at the | possible price of succinctness I think would go a long way. I'm | more-or-less waiting for the "blank" in: "blank" is to Python | what Regex is to Perl. | gota wrote: | I dream that we will have something like Copilot but | exclusively for regex and working marvelously | | "Find every 2nd instance of a dollar amount that is not encased | in quotes" outputting <insert regex here> would be awesome | smnrchrds wrote: | > The Greatest Regex Trick Ever | | was to convince programmers it didn't exist? | [deleted] | throwanem wrote: | The greatest regex trick ever is knowing when _not_ to use one. | IncRnd wrote: | I've seen several regexs in various code reviews that are used | to validate user input but do so in an exponential manner that | can be exploited for simple DOS attacks. | xtracto wrote: | Ooooh or worse, I once caught someone's "email matching" | RegEx code during a code review that was opening the door for | some nasty SQL Injection or XSS attacks (kind of like | validating if the text field _contained_ a valid email.. but | not if it was ONLY a valid email). | | The problem with RegEx is its "obscurity". However Maybe | someone could write a nice testing tool that would throw | millions of known exploits into each regex it finds in your | code to see if it is vulnerable. | CyberDildonics wrote: | Like what? I've never thought about what regex features are | exponential. | llbeansandrice wrote: | From the same site: https://www.rexegg.com/regex-explosive- | quantifiers.html | throwanem wrote: | It's more a question of which ones _can 't_ be. There are | some really nasty and not very obvious gotchas here; | https://regular-expressions.mobi/catastrophic.html has a | good dive into how, for example, backtracking combines with | incautious regex design to produce exponential behavior in | the length of input. | | I don't have a hard and fast rule of my own about regex | complexity, but I do have a strong intuition over what's | now ca. 25 years of working with regexes dating back to | initial exposure in Perl 5 as a high schooler. That | intuition boils down more or less to the idea that, when a | regex grows too complex to comprehend at a glance, it's | time to start thinking hard about replacing it with a | proper parser, especially if it's operating over (as yet) | imperfectly sanitized user input. | | Sure, it's maybe a little more work up front, at least | until you get good at writing small fast parsers - which | doesn't take long, in my experience at least; formal | training might make it easier still, but I've rarely felt | the lack. In exchange for that small investment, you gain | reliability and maintainability benefits throughout the | lifetime of the code. Much of that comes from the simple | source of no longer having to re-comprehend the hairball of | punctuation that is any complex regex, before being able to | modify it at all - something at which I was actually really | good, as recently as a decade or so ago. The expertise has | since expired through disuse, and that's given me no cause | for regret; the thing about being a regex expert is that | it's a really good skill for writing unreadable and subtly | dangerous code, and not a skill good for much of anything | else. Unreadable and subtly dangerous code was fine when I | was a kid doing my own solo projects for fun, where the | worst that'd happen is I might have to hit ^C. As an | engineer on a team of engineers building software for | production, it's not even something I would _want_ to be | good at doing. | User23 wrote: | > That intuition boils down more or less to the idea | that, when a regex grows too complex to comprehend at a | glance, it's time to start thinking hard about replacing | it with a proper parser | | You can get some surprisingly complex yet readable | regexes in Perl by using qr//x[1] and decomposing the | pieces into smaller qr//s that are then interpolated into | the final pattern, along with proper inline comments in | the regexes themselves. | | [1] https://perldoc.perl.org/perlre#/x-and-/xx | throwanem wrote: | You still have to reason about the whole thing, though. | This doesn't make that any easier, but I bet it makes it | _feel_ easier. | digitalsushi wrote: | The greatest regex /skill/ is knowing that a regex cannot | describe everything. | locallost wrote: | Very verbose writing for a very succinct regex. | kogus wrote: | This is a great trick. It says something about RegEx syntax that | matching a simple rule with a relatively clear expression is a | major accomplishment. | nytgop77 wrote: | Yup. Regex is not a silver bullet for "match stuff", and it is | wrong(ish) tool for following jobs: | | - context sensitive matching | | - matching with multi-char-exclusions | | (regex is happy the most, when it's used to match "regular | language" things) | xrayarx wrote: | Long Page with practical regex advice for programmers, most | likely not useful for command line warriors | | Lookbehind | | Lookahead | | Advanced handling of tags | | Replace before matching | | the best regex trick ever: | | "Tarzan"|(Tarzan) | | The whole site contains useful regex advice | jandrese wrote: | The more general tip is that a single regex isn't the only tool | you have. You don't have to get your final product one one step. | Almost every "disaster" regex comes from someone trying to do too | much at once. | | One other solution would have been to run the regex twice, once | to pick up all instances of Tarzan, and a second on the results | of the first to filter out all instances of "Tarzan". | usrusr wrote: | A big source of trying to do too much is environments that | offer easy regex-based transformations defined as a pair of | regex and a single replacement string (that may contain | references to matching groups) and make other transformations | hard ("while find + rest"). When you have the option to provide | a "process match" closure instead of the replacement string the | lure of putting too much into a single regex almost collapses. | dang wrote: | One past thread: | | _The Greatest Regex Trick Ever (2014)_ - | https://news.ycombinator.com/item?id=10282121 - Sept 2015 (131 | comments) | phl wrote: | As the examples in the article use xml, I just wanted to point | out that applying regex to xml has a lot of limitations and | should be avoided. See: | https://stackoverflow.com/questions/1732348/regex-match-open... | rascul wrote: | I was thinking about that great answer when I was reading the | article. Thanks for sharing it. | ComputerGuru wrote: | Very long build up to what is definitely a neat trick, although | without SKIP FAIL, it might cause explosive growth in the memory | usage as it allocated space for the results you don't need | (unless you use a streaming regex option). | | Speaking of lengthy: this site breaks the iOS Safari scroll bar! | It just disappears altogether (even when scrolling up or down to | make it show, like you have to these days to please the UX | designers in Palo Alto). | toxik wrote: | The scroll bar works but for some reason it gets rendered very | bright. Scroll all the way up to the black background in the | header and you'll see it. | tus89 wrote: | Clicking on a http:// link these days feels like I have been | tricked into clicking on a phishing link in an email. | | Good trick though. | ComputerGuru wrote: | This is why any attempts to make plain http sites throw up | scare warnings is a horrible idea. The internet is littered | with old websites that contain a wealth of knowledge and | deserve to remain accessible. | | Just make browsers for into "read only" mode where input cannot | be accepted on non-secure pages. But don't wall them out! | crazygringo wrote: | > _" Tarzan"|(Tarzan)_ | | OK that's pretty clever (I certainly never thought of putting a | capturing group _inside_ only _one_ side of an "or")... | | ...but it doesn't seem particularly useful? It probably won't | work in most cases where this is just part of a larger | expression. You're usually using capturing groups in a particular | way for a good reason, and this would mess that up. | | In contrast, the lookbehind+lookahead way is the "proper" and | intuitive way to write it, and works as part of any larger | expression. | | So... +100 points for cleverness, but don't actually _use_ this | please. :) | RheingoldRiver wrote: | > In contrast, the lookbehind+lookahead way is the "proper" and | intuitive way to write it, and works as part of any larger | expression. | | I would say, the "proper" way is to have a separate line of | code validating what's not there :) | crazygringo wrote: | I'm not following? | diarrhea wrote: | Not GP, but I'd go a very simple and verbose way, maybe | that's what they meant to. Match: | (.)Tarzan(.) | | Then in an additional line of code assert | (Group 1 == Group 2) [?] " | | This shifts the logic out of regex and into the surrounding | programming language context. That's arguably better, but | the resulting regex is extremely dull and unclever. | pimlottc wrote: | Don't forget to look out for matches at the boundaries of | the original string. I think it should be something like: | (^|.)Tarzan(.|$) | | Though I'm not 100% sure offhand what the result in the | capturing groups would be. | RheingoldRiver wrote: | Yeah, that's more or less what I meant. Write a regex | (plus line of code) to make sure `Tarzan` appears. Then | write another regex and line of code to make sure | `"Tarzan"` doesn't appear. | | Maybe at this point you aren't using regex even. Nice, | you solved two problems. | | (I do appreciate regex and even use them a lot. But, I | use them enough to avoid them as much as possible.) | crazygringo wrote: | I mean, I guess if nobody on your team understands | regexes. | | But generally, once you decide to use a regex in the | first place, you might as well put as much regular | everyday logic as you can in it. Otherwise you might as | well look for "Tarzan" with a dumb string search. | | Lookbehinds and lookaheads aren't rocket science. And you | can always leave a comment about what they're doing if | you're worried other team members won't grok the syntax. | kristopolous wrote: | The ? syntax group has to be the most unmemorable of the bunch. | I've used it maybe over 1,000 times or so and I still have to | look up ?: Or ?! ?< or whatever else. | | I used to have a laminated sheet on my wall at an office because | it was so terribly bad. | digitalsushi wrote: | Let me take these PhD level regex down to elementary school | awesome. | | I have a process table and I want to grep it for the phrase | "banana": | | ps auxww | grep banana | | root 87 Jun21 0:26.78 /System/Library/CoreServices/FruitProcessor | --core=banana | | mikec 456 450PM 0:00.00 grep banana | | Argh! It also greps for the grep for banana! Annoying! | | Well, I'm sure there's pgrep or some clever thing, but my | coworker showed me this and it took me a few minutes to realize | how it works: | | ps auxww | grep [b]anana | | root 87 Jun21 0:26.78 /System/Library/CoreServices/FruitProcessor | --core=banana | | Doc Brown spoke to me: "You're just not thinking fourth | dimensionally!" Like Marty, I have a real problem with that. But | don't you see: [b]anana matches banana but it doesn't match 'grep | [b]anana' as a raw string. And so I get only the process I | wanted! | sandreas wrote: | This is really clever... I usually ended up with adding | | grep -v grep | | like in ps auxww | grep banana | grep -v grep | sigg3 wrote: | _applause_ | | Never thought of that. Nice. | jackhalford wrote: | but what's wrong with pgrep -f though? I don't want to search | for clever trick every time I need to grep a process | stonewareslord wrote: | This almost always works, but it won't if the shell expands | your bracketed letter. See for example: $ | echo [b]anana [b]anana $ touch banana $ | echo [b]anana banana | | You can escape the bracket and it will work: | $ echo \[b]anana [b]anana | nick__m wrote: | I use prep -laf the-wanted-string https://man7.org/linux/man- | pages/man1/pgrep.1.html | | But nice regex though | | Edit : someone already posted that solution | https://news.ycombinator.com/item?id=27777901 | Sniffnoy wrote: | I dunno, the "logic" solution seems like the obvious one to me; | if your boss really has that much trouble with propositional | logic that they don't immediately see why it works, well, that's | what code comments are for. | | (...the trick is still cool, though; I can imagine other | situations where it would be more useful. However it does seem | like it potentially depends on the particular regex engine being | used, in contrast to the author's claim about it being totally | portable; yes, it'll compile on anything, but will it _work_?) | knodi123 wrote: | PCRE is a pretty well-defined standard, isn't it? And it's the | one used by most of the languages I've worked with, including | in MariaDB. | ComputerGuru wrote: | It doesn't even rely on PCRE, just core regex. | recursive wrote: | How could it not work. I've regularly relied on order or | matching, and never found an environment that didn't test left- | to-right for the `|` operator in regex. | bear8642 wrote: | > operator in regex. | | regex is not regular expressions - if using NFA to match then | you're matching all alternates simultaneously. | | Russ Cox has good pictures explaining idea in 'Regular | Expression Search Algorithms' section of | <https://swtch.com/~rsc/regexp/regexp1.html> | recursive wrote: | I'm talking about regex. Regex libraries in practical use | do not use NFA. I'm talking about actual code that's | written using normal languages. I'm familiar with the | difference between "regular expressions" as in "regular | languages". | burntsushi wrote: | Go's regexp package, Rust's regex crate and RE2 are | examples of regex engines that are very much in practical | use that use NFAs (among other things). | ivegotnoaccount wrote: | Lex/Flex, wich I think we can agree is used by "actual | code that's written using normal languages" use DFAs, | both inside rules and between rules, and they do not try | '|' cases left to right (They probably could have if they | wanted since there is a REJECT action that already force | them to store the list of all the rules/texts that were | matched): | | a|ab {cout << "matched ab" << std::endl; } b { cout << | "matched b" << std::endl; } | | if provided with "ab", will match the first rule with | "ab", and not the first with "a" then the second with | "b". | [deleted] | praptak wrote: | This trick may be thought of as a simplification of the | systematic approach to parsing stuff, that is the lexer-parser | division of responsibilities. | | The lexer uses regexes but only for splitting the input stream of | characters into tokens. Identifiers, integers, operators, | strings, keywords, opening brackets and whatnot - each type of | token is defined by a regex. This part is hopefully deterministic | and simple, although the lexer matches regexes for all kinds of | tokens at once, which is why lexer generators are often used to | generate lexers. | | The heavy lifting is done by the actual parser which tries to | combine the tokens into something that makes sense from the point | of the grammar. | | So in this trick the sub-regexes between |'s define the tokens | (the lexer part) while the group mechanism selects the single | token that we want to keep (a very very simple parser). | xtracto wrote: | This site reminded me the times when I interviewed candidates. | One of the interview problems was to write a function that would | validate if a given string was a valid IPv4 address (a la | 10.10.10.1). | | Some of the candidates started by saying: "I know! I'll use a | Regular Expression", to what I replied: "Great!, now you have TWO | problems!" ___________________________________________________________________ (page generated 2021-07-08 23:00 UTC)