[HN Gopher] Ken Thompson's NFA regex patent (1971)
       ___________________________________________________________________
        
       Ken Thompson's NFA regex patent (1971)
        
       Author : jonstewart
       Score  : 62 points
       Date   : 2022-11-11 20:44 UTC (2 hours ago)
        
 (HTM) web link (patents.google.com)
 (TXT) w3m dump (patents.google.com)
        
       | joedrago wrote:
       | I read an article a million years ago which I thought described
       | this mechanism in a super friendly way:
       | 
       | https://perl.plover.com/Regex/article.html
       | 
       | It uses the idea of "putting a penny down" on the graph and
       | moving it around, and I found the conversational tone of the
       | description to be very easy to grok. Assuming this is the exact
       | same mechanism (which the author cites), it might resonate for
       | some of you too.
        
       | jonstewart wrote:
       | ken also wrote a paper about his algorithm, published in CACM in
       | 1968: https://dl.acm.org/doi/pdf/10.1145/363347.363387 (PDF)
        
       | HellsMaddy wrote:
       | https://en.wikipedia.org/wiki/Thompson's_construction
        
       | Cupertino95014 wrote:
       | Does anyone know if this patent was ever asserted by AT&T? (Of
       | course, it's long expired.)
        
       | ape4 wrote:
       | Do current regex implementations use this algorithm?
        
         | bla3 wrote:
         | Most don't, but re2 and a few others do. The ones that do use
         | it don't have exponential runtime on malicious inputs and lack
         | a few features (back references, mostly).
         | https://swtch.com/~rsc/regexp/ is a great resource on this.
        
           | iron2disulfide wrote:
           | > Most don't
           | 
           | Is this true? I'm pretty sure that most of the regex engines
           | I've used (grep, ripgrep, re2, hyperscan) use Thompson's
           | construction or at least some NFA-based algorithm (not
           | necessarily Thomopson's; Glushkov automata in particular are
           | used by hyperscan).
        
             | burntsushi wrote:
             | Yes, all of those are finite automata based. Although grep
             | usually uses the POSIX system regex library. GNU grep will
             | use its own finite automata based approach for a subset (a
             | large subset) of regexes.
             | 
             | But most regex engines are backtracking based. Perl, PCRE
             | (used by PHP and probably others), Oniguruma/Onigmo (used
             | by Ruby), Javascript, Java, Python. That covers a lot of
             | ground.
             | 
             | Plus, popular reference web sites like https://www.regular-
             | expressions.info/ are heavily biased toward regex engines
             | that have features typically only implemented by
             | backtracking engines.
        
               | masklinn wrote:
               | There's also postgres's which descends from spencer's
               | (Tcl Advanced Regular Expressions) and supports
               | backreferences but IME is very hard to catch into
               | catastrophic backtracking.
               | 
               | Possibly because you need to use the feature to trigger
               | the NFA mode and the baseline test case for catastrophic
               | backtracking doesn't, it just assumes the engine will
               | backtrack.
        
           | klodolph wrote:
           | As a side note, when converting the NFA to a DFA, you get an
           | exponential number of states, since each DFA state
           | corresponds to a set of NFA states.
        
             | burntsushi wrote:
             | True, and that's why libraries like RE2 don't build full
             | DFAs. They use lazy DFAs. DFA states are compiled as
             | needed. At most one state is built per byte of input, so
             | this sidesteps the exponential time bound.
             | 
             | In all implementations I know of, memory use is bounded.
             | When memory fills up, the cache of DFA states is cleared.
             | If this happens too many times, RE2 quits the lazy DFA and
             | falls back to a Pike VM. (Effectively the Thompson NFA
             | matcher, but with capture group support.)
        
       ___________________________________________________________________
       (page generated 2022-11-11 23:00 UTC)