[HN Gopher] ReDoS: Regular Expression Denial of Service
       ___________________________________________________________________
        
       ReDoS: Regular Expression Denial of Service
        
       Author : davisjam314
       Score  : 51 points
       Date   : 2020-06-29 12:29 UTC (10 hours ago)
        
 (HTM) web link (levelup.gitconnected.com)
 (TXT) w3m dump (levelup.gitconnected.com)
        
       | davisjam314 wrote:
       | This is a "technical two-pager" of my doctoral research about
       | regex denial of service.
        
         | ifoundthetao wrote:
         | Edit: Never mind, clicked on the article, lots of sources!
         | 
         | Is your paper available for reading? I'm very interested in it.
        
       | cbarrick wrote:
       | The paper _Regular Expression Matching Can Be Simple And Fast_
       | [1] has a time complexity plot that clearly shows how
       | backtracking engines are more exploitable than finite-automata
       | engines.
       | 
       | [1]: https://swtch.com/~rsc/regexp/regexp1.html
        
       | snvzz wrote:
       | This isn't a cheat-sheet. I expected a PDF with one or two pages
       | of very condensed key information, not a long web article.
        
         | dang wrote:
         | Ok, we've de-cheat-sheeted the title above.
        
         | [deleted]
        
       | hyperpape wrote:
       | In one place, you say that programmers should pre-validate that
       | inputs are not long. I would suggest quantifying that somehow. I
       | suspect some developers will not realize that a 30 character
       | input is "long".
        
       | tyingq wrote:
       | _" Your regex engine uses a slow regex matching algorithm. This
       | is the case for the regex engines built into most programming
       | language runtimes -- this includes Java, JavaScript (various
       | flavors) and Node.js, C#/.NET, Ruby, Python, Perl, and PHP. On
       | the safe side, Go and Rust both use fast regex engines, and
       | Google's RE2 library is fast as well."_
       | 
       | I don't feel like "slow" and "fast" are the right adjectives
       | here. Perhaps some form of "backtracking capable" instead of
       | "slow".
       | 
       | Edit: For example, the jit version of pcre2 is often faster than
       | Rust's re engine. But it still supports backtracking and can be
       | "dos-ed".
        
         | GuB-42 wrote:
         | I don't know if it has a particular name, but it is a common
         | problem: fast average case, slow worst case.
         | 
         | A common example is the quicksort algorithm. It is often
         | considered the fastest sorting algorithm in the general case
         | but it is still O(n2) in the worst case. According to
         | statistics, it should never happen in practice, however, with
         | cleverly ordered data, it can. That's why some libraries prefer
         | to use a slightly slower algorithm (ex: heap sort) but with a
         | O(n.log(n)) guarantee, so that it can't be DDoSed.
         | 
         | Unbalanced binary trees can have the same problem, on random
         | data, they work fine, but improperly used, they degenerate into
         | much slower linked lists. Hash tables can also be exploited if
         | they don't used a cryptographically secure hash function, and
         | these are rarely used because they can be so slow as to negate
         | the advantage of using a hash table in the first place.
        
         | davisjam314 wrote:
         | Yes, those terms are definitely imprecise.
         | 
         | My target audience for that blog post was "anyone vaguely
         | technical, from programmers to managers". I therefore attempted
         | to minimize the amount of technical detail and to focus on the
         | core algorithmic concepts.
        
           | cbarrick wrote:
           | Maybe instead of "a slow regex matching algorithm" you could
           | just say "a backtracking algorithm".
           | 
           | And instead of "a fast regex engine" you could say "a linear
           | algorithm".
           | 
           | That way it's both more concise and more technically correct.
        
           | labster wrote:
           | You could have just used one sentence to explain that regular
           | expressions in (list of languages) have a feature called
           | backtracking which can be exploited to slow down and DoS a
           | site. I mean, even that sentence, right there. Instead, you
           | boil it down so the reader knows Go and Rust are good and
           | other languages are bad.
        
         | chubot wrote:
         | I suggest using the term "regular language" in contrast to
         | "regex" or "regular expression".
         | 
         | https://news.ycombinator.com/item?id=23598078
         | 
         | "Regular language" is an accurate mathematical term that's not
         | "corrupted" by programming. I think you could also use
         | "automata-based" but it's probably less clear and refers more
         | to the implementation [1]
         | 
         | If you scroll down about 3 pages here, there is table
         | explaining regexes vs. regular languages:
         | 
         | https://www.oilshell.org/share/05-31-pres.html
         | 
         | It's the same stuff that's in RSC's articles but those are too
         | dense for a general programming audience (they're more for
         | regex engine implementers). And the existence of the OP's
         | article shows that people haven't gotten it :)
         | 
         | You could also use the terms:
         | 
         | - regex engine (Perl/Python/PCRE/Java/etc.)
         | 
         | - regular language engine (libc, RE2, rust/regex)
         | 
         | Or "hybrid engine" may be more accurate for some of those. That
         | is, they may use either automata-based techniques or
         | backtracking depending on the situation.
         | 
         | -----
         | 
         | BTW, the Oil shell has a new syntax Eggex for both regular
         | languages and regexes that makes the difference clear. In other
         | words, you should be able to see from the syntax if a construct
         | may cause exponential backtracking.
         | 
         | https://www.oilshell.org/release/0.8.pre6/doc/eggex.html
         | 
         | In other words, it's a static property that's easily detected
         | (if you have a parser for the regex language itself, which is
         | not hard at all in theory, but maybe it is a little in practice
         | :) )
         | 
         | -----
         | 
         | [1] Aside: The terms NFA and DFA were also corrupted by Friedl
         | in a popular O'Reilly book. I have this book and remember being
         | super confused by it, since I read it long after I learned what
         | a DFA and NFA were.
         | 
         | https://swtch.com/~rsc/regexp/regexp1.html
         | 
         |  _What little text it devotes to implementation issues
         | perpetuates the widespread belief that recursive backtracking
         | is the only way to simulate an NFA. Friedl makes it clear that
         | he neither understands nor respects the underlying theory._
         | 
         | http://regex.info/blog/2006-09-15/248
        
           | adrianmonk wrote:
           | The "corrupted" term is _better_ here because it describes
           | what the article is about: how to avoid security issues with
           | real-world software tools, some of which match true regular
           | expressions and some which match extended pseudo- "regular
           | expressions".
           | 
           | See where the article says, "Faster engines may not support
           | advanced features, most notably backreferences and lookaround
           | assertions."
           | 
           | In a discussion about sucrose, fructose, aspartame, and
           | stevia, you wouldn't tell someone to stop using the term
           | "sweetener" and instead use the narrower and more precise
           | term "sugar". If you did, you'd be advising them to use a
           | word with the wrong meaning.
        
       ___________________________________________________________________
       (page generated 2020-06-29 23:01 UTC)