[HN Gopher] Pne-more-re-nightmare - A fast regex compiler in Com...
       ___________________________________________________________________
        
       Pne-more-re-nightmare - A fast regex compiler in Common Lisp
        
       Author : rurban
       Score  : 36 points
       Date   : 2021-11-30 21:35 UTC (1 days ago)
        
 (HTM) web link (applied-langua.ge)
 (TXT) w3m dump (applied-langua.ge)
        
       | ogogmad wrote:
       | What about derivative-based regex?
       | 
       | I got worried at some point that some of the purported advantages
       | of derivative-based regexes might be drawbacks. One such
       | advantage/drawback is that it adds the _complement_ operation
       | into the language of Regex. But that gives you a way of deciding
       | whether two regexes are equivalent, which is supposed to be
       | PSPACE-COMPLETE, which ought to severely hamper finding an
       | efficient implementation.
        
         | kazinator wrote:
         | The complement operation doesn't provide a way of deciding
         | whether two regexes are equivalent, nor is any such thing
         | required to implement it. Using derivatives to interpret a
         | regular expression that contains a complement operator does not
         | require a function for calculating equivalence of two regular
         | expressions.
         | 
         | Equivalence is required for compiling. The reason is that
         | regular expressions have features like the Kleene closure which
         | express iteration, and these give rise to endless derivatives:
         | regular expressions whose derivative you can keep taking _ad
         | nauseum_. This is fine for interpreting, because the algorithm
         | terminates when you have consumed the input or hit an impasse.
         | The compiler, however, has to close the loops and turn them
         | into cycles in the state machine graph. To do that it has to
         | check each derivative:  "have I derived an expression which has
         | occurred before?" If so, then just hook to the previous state.
         | 
         | If you make the check literal, just comparing the regex syntax
         | for identical structure, the graph will sometimes be larger
         | than necessary, because it will contain redundant nodes that
         | should be equivalent states. The smarter you make the
         | equivalence function, the more compact the graphs will be.
        
         | hayley-patton wrote:
         | Author here - I don't think this is an issue.
         | 
         | If you don't use submatching, you can just compute the minimal
         | DFAs and compare them to test for equivalence, and you don't
         | even need negation. Thus, introducing negation can't do any
         | harm there.
         | 
         | But as I generate Mealy machines for submatching, which can't
         | be minimised, we need another approach. Knowing that A - B = A
         | [?] !B, one can produce a DFA for (A - B) [?] (B - A), and if
         | it has no accepting states, then A and B must be equivalent, as
         | nothing can match one but not both. But DFA construction is
         | O(2n), even if one takes the RE - NFA - DFA route, so this is
         | nothing new complexity-wise.
         | 
         | Such a decision procedure is never used in a derivative-based
         | regex implementation, anyway. The equivalence of regular
         | expressions only needs to be approximated, and the
         | approximation consists of some structural rules (see Definition
         | 4.1 of [1]). Relatively few rules are needed to guarantee that
         | a _finite_ machine can be generated, and only a dozen rules are
         | required to get very close to a minimal DFA (table 1, page 14
         | of [1]). So having the possibility of writing a precise regex
         | equivalence function is completely irrelevant to producing an
         | efficient regex engine.
         | 
         | A decision procedure is handy, but you can write one for
         | submatch-less regexes without negation, the complexity of such
         | a procedure is typical for DFA construction, and the problem of
         | deciding if two regular expressions are equivalent is
         | irrelevant.
         | 
         | [1] Regular-expression derivatives reexamined
         | https://www.ccs.neu.edu/home/turon/re-deriv.pdf
        
       | froh wrote:
       | As this is a regex->DFA implementation I wonder how it compares
       | to Google RE2 [1]
       | 
       | there are nice performance comparisons to glibc, rust regex and
       | cl-ppcre, but re2 was missing from that list.
       | 
       | [1] https://github.com/google/re2
        
         | hyperpape wrote:
         | If the benchmarks are to be believed, I think it substantially
         | outperforms re2, because it also beats Hyperscan, and I think
         | Hyperscan typically beats re2.
        
         | alexott wrote:
         | re2 has limited functionality. It's works wel for specific
         | patterns, like crawling, but not general purpose
        
       | reikonomusha wrote:
       | This is demonstrative of one of the great superpowers of Lisp:
       | that you can write mini native code compilers--without knowing a
       | lick of assembly. (But knowing assembly certainly helps you
       | optimize the heck out of it.)
       | 
       | More specifically, you can compile syntax (of your choosing) into
       | performant, low-level Lisp code, which the Lisp compiler can then
       | compile into efficient machine code. To do this, you need no
       | extra tools, no intermediate files, no compiler hooks, no virtual
       | machines, no byte codes, and no non-standard or non-portable
       | features. It has been built in to Lisp since at least the 1980s.
       | 
       | Making miniature "JIT" compilers is also a technique a quantum
       | simulator in Lisp [0a, 0b] uses to optimize quantum gate
       | operations. It analyzes the quantum gate matrix and, on the fly,
       | compiles it into optimized numerical Lisp code that is equivalent
       | to that matrix acting on vectors of tensor product spaces (i.e.,
       | super huge arrays of complex double floats). Using another
       | language would require me to do arduous things like write
       | assembly code or learn a library like LibJIT [1] which may not
       | even interface well with my host language.
       | 
       | Other simulators out there written in C use huge hand-unrolled
       | loops with comments like                   // THIS CODE WAS
       | GENERATED, DO NOT TOUCH
       | 
       | and/or in-line assembly code, and still can't optimize specific
       | matrix shapes and structures, or do algebraic simplifications to
       | eliminate work altogether.
       | 
       | The regex library FTA is a great, and clean, example of a long
       | standing practice of compiling regexen, except it doesn't use any
       | fancy VMs or any fancy JITs, just "when you see this regex,
       | automatically turn it into this Common Lisp code, and let the
       | Lisp compiler handle the rest."
       | 
       | [0a] https://github.com/quil-lang/qvm
       | 
       | [0b] COMPILE-OPERATOR: https://github.com/quil-
       | lang/qvm/blob/master/src/compile-gat...
       | 
       | [1] https://www.gnu.org/software/libjit/
        
         | martincmartin wrote:
         | Perhaps most importantly, a good CL implementation can help
         | with debugging this, by keeping track of the original CL code
         | that generated the intermediate code, and pointing you to it.
         | With the "use a program to generate C" approach, if there's
         | some bug in the C code, the best the compiler or runtime can do
         | is point you to the line of C code that's crashing. It's up to
         | you to figure out where that came from.
         | 
         | This was a problem for early C++ compilers like Cfront that
         | translated C++ into C: the error messages from the C compiler
         | referenced the intermediate C, not the original C++.
        
           | gmfawcett wrote:
           | How old is the #line directive, I wonder? It's been a long
           | time since Cfront.
           | 
           | https://stackoverflow.com/questions/2216540/whats-the-
           | meanin...
        
       | i_am_proteus wrote:
       | I express my appreciation for the author's reference to King
       | Crimson - the track "One More Red Nightmare" off of the record
       | _Red_.
        
         | irq wrote:
         | I noticed this too and was hoping the author would mention it
         | in his article but couldn't find a reference to it :) This is a
         | great album to listen to while doing technical work.
        
           | hayley-patton wrote:
           | See the first footnote: "The name is of course a reference to
           | One More Red Nightmare."
        
       | zeruch wrote:
       | I'm sensing a subtle King Crimson reference here...
        
       | phoe-krk wrote:
       | This title has a typo - should be _One-more-re-nightmare_.
        
       | jwr wrote:
       | This is a good example of why I roll my eyes at most programming
       | language discussions. There is always someone that starts the
       | "language X is SLOW" bandwagon, and people hop on.
       | 
       | Languages are rarely "slow" or "fast" by themselves. Programs
       | written in those languages can be slow or fast.
        
         | regularfry wrote:
         | Yes and no. Some language features can make it incredibly
         | difficult to optimise. I know the JRuby folks had all sorts of
         | fun trying to make method dispatch fast in the face of "well,
         | this method definition can be changed at any time, arbitrarily
         | often, including during execution of the method itself". I know
         | CL must also have solved that problem but I presume the toolkit
         | is sufficiently different that the techniques just don't map.
        
           | pfdietz wrote:
           | A nice mechanism for optimizing such things was given at the
           | last European Lisp Symposium.
           | 
           | https://european-lisp-
           | symposium.org/static/proceedings/2021....
           | 
           | (the last paper, by Robert Strandh.)
        
         | AllegedAlec wrote:
         | I also feel that for 99% of applications, it _doesn 't matter_.
         | Like, yeah, there are a a few areas where hyper-optimized code
         | in some kind of near-bare-metal language is necessary. But most
         | of the time, if you're writing code, the language itself isn't
         | gonna be the deciding factor if something is performant enough
         | for your application. Furthermore, by using a "fast" language,
         | you're then trading your ability to quickly and cleanly
         | implement code that's human readable suffers greatly.
        
         | simion314 wrote:
         | >Languages are rarely "slow" or "fast" by themselves.
         | 
         | For sure they are, language features will for sure limit the
         | optimizations a compiler/interpreter can do and for sure a
         | language with many levels of abstractions will have to pay the
         | price in performance somewhere.
         | 
         | Every-time someone improves the benchmark performance of a slow
         | language that person is not a regular developer, but someone
         | that has a deeper understanding on what happens under the hood,
         | knows how to use profiling tools, maybe knows assembly and most
         | of the time they have to use tricks or advanced patterns.
         | 
         | Btw, optimizing some slow language in a hot path with clever
         | tricks or some advanced patterns is fine, it happens all the
         | time, you can;t just switch the project language with 1 click.
        
           | the-alt-one wrote:
           | >language features will for sure limit the optimizations a
           | compiler/interpreter can do
           | 
           | For example:
           | 
           | Common Lisp has a full numerical tower, meaning it will
           | automatically convert fixed size numbers into bignums (and
           | more cool stuff). This means that an addition has to branch
           | if it can't prove that a result won't overflow from fixnum to
           | bignum. Yes, C also needs to use bignums, but they're on an
           | opt-in basis. In CL they're opt-out.
           | 
           | There are more examples of this, often related to the
           | dynamism of your language. The more introspection a language
           | is capable of, the less a compiler designer can do to cheat
           | and optimize your program.
        
       ___________________________________________________________________
       (page generated 2021-12-01 23:01 UTC)