[HN Gopher] ReDoS: Regular Expression Denial of Service ___________________________________________________________________ ReDoS: Regular Expression Denial of Service Author : davisjam314 Score : 51 points Date : 2020-06-29 12:29 UTC (10 hours ago) (HTM) web link (levelup.gitconnected.com) (TXT) w3m dump (levelup.gitconnected.com) | davisjam314 wrote: | This is a "technical two-pager" of my doctoral research about | regex denial of service. | ifoundthetao wrote: | Edit: Never mind, clicked on the article, lots of sources! | | Is your paper available for reading? I'm very interested in it. | cbarrick wrote: | The paper _Regular Expression Matching Can Be Simple And Fast_ | [1] has a time complexity plot that clearly shows how | backtracking engines are more exploitable than finite-automata | engines. | | [1]: https://swtch.com/~rsc/regexp/regexp1.html | snvzz wrote: | This isn't a cheat-sheet. I expected a PDF with one or two pages | of very condensed key information, not a long web article. | dang wrote: | Ok, we've de-cheat-sheeted the title above. | [deleted] | hyperpape wrote: | In one place, you say that programmers should pre-validate that | inputs are not long. I would suggest quantifying that somehow. I | suspect some developers will not realize that a 30 character | input is "long". | tyingq wrote: | _" Your regex engine uses a slow regex matching algorithm. This | is the case for the regex engines built into most programming | language runtimes -- this includes Java, JavaScript (various | flavors) and Node.js, C#/.NET, Ruby, Python, Perl, and PHP. On | the safe side, Go and Rust both use fast regex engines, and | Google's RE2 library is fast as well."_ | | I don't feel like "slow" and "fast" are the right adjectives | here. Perhaps some form of "backtracking capable" instead of | "slow". | | Edit: For example, the jit version of pcre2 is often faster than | Rust's re engine. But it still supports backtracking and can be | "dos-ed". | GuB-42 wrote: | I don't know if it has a particular name, but it is a common | problem: fast average case, slow worst case. | | A common example is the quicksort algorithm. It is often | considered the fastest sorting algorithm in the general case | but it is still O(n2) in the worst case. According to | statistics, it should never happen in practice, however, with | cleverly ordered data, it can. That's why some libraries prefer | to use a slightly slower algorithm (ex: heap sort) but with a | O(n.log(n)) guarantee, so that it can't be DDoSed. | | Unbalanced binary trees can have the same problem, on random | data, they work fine, but improperly used, they degenerate into | much slower linked lists. Hash tables can also be exploited if | they don't used a cryptographically secure hash function, and | these are rarely used because they can be so slow as to negate | the advantage of using a hash table in the first place. | davisjam314 wrote: | Yes, those terms are definitely imprecise. | | My target audience for that blog post was "anyone vaguely | technical, from programmers to managers". I therefore attempted | to minimize the amount of technical detail and to focus on the | core algorithmic concepts. | cbarrick wrote: | Maybe instead of "a slow regex matching algorithm" you could | just say "a backtracking algorithm". | | And instead of "a fast regex engine" you could say "a linear | algorithm". | | That way it's both more concise and more technically correct. | labster wrote: | You could have just used one sentence to explain that regular | expressions in (list of languages) have a feature called | backtracking which can be exploited to slow down and DoS a | site. I mean, even that sentence, right there. Instead, you | boil it down so the reader knows Go and Rust are good and | other languages are bad. | chubot wrote: | I suggest using the term "regular language" in contrast to | "regex" or "regular expression". | | https://news.ycombinator.com/item?id=23598078 | | "Regular language" is an accurate mathematical term that's not | "corrupted" by programming. I think you could also use | "automata-based" but it's probably less clear and refers more | to the implementation [1] | | If you scroll down about 3 pages here, there is table | explaining regexes vs. regular languages: | | https://www.oilshell.org/share/05-31-pres.html | | It's the same stuff that's in RSC's articles but those are too | dense for a general programming audience (they're more for | regex engine implementers). And the existence of the OP's | article shows that people haven't gotten it :) | | You could also use the terms: | | - regex engine (Perl/Python/PCRE/Java/etc.) | | - regular language engine (libc, RE2, rust/regex) | | Or "hybrid engine" may be more accurate for some of those. That | is, they may use either automata-based techniques or | backtracking depending on the situation. | | ----- | | BTW, the Oil shell has a new syntax Eggex for both regular | languages and regexes that makes the difference clear. In other | words, you should be able to see from the syntax if a construct | may cause exponential backtracking. | | https://www.oilshell.org/release/0.8.pre6/doc/eggex.html | | In other words, it's a static property that's easily detected | (if you have a parser for the regex language itself, which is | not hard at all in theory, but maybe it is a little in practice | :) ) | | ----- | | [1] Aside: The terms NFA and DFA were also corrupted by Friedl | in a popular O'Reilly book. I have this book and remember being | super confused by it, since I read it long after I learned what | a DFA and NFA were. | | https://swtch.com/~rsc/regexp/regexp1.html | | _What little text it devotes to implementation issues | perpetuates the widespread belief that recursive backtracking | is the only way to simulate an NFA. Friedl makes it clear that | he neither understands nor respects the underlying theory._ | | http://regex.info/blog/2006-09-15/248 | adrianmonk wrote: | The "corrupted" term is _better_ here because it describes | what the article is about: how to avoid security issues with | real-world software tools, some of which match true regular | expressions and some which match extended pseudo- "regular | expressions". | | See where the article says, "Faster engines may not support | advanced features, most notably backreferences and lookaround | assertions." | | In a discussion about sucrose, fructose, aspartame, and | stevia, you wouldn't tell someone to stop using the term | "sweetener" and instead use the narrower and more precise | term "sugar". If you did, you'd be advising them to use a | word with the wrong meaning. ___________________________________________________________________ (page generated 2020-06-29 23:01 UTC)