[HN Gopher] Pne-more-re-nightmare - A fast regex compiler in Com... ___________________________________________________________________ Pne-more-re-nightmare - A fast regex compiler in Common Lisp Author : rurban Score : 36 points Date : 2021-11-30 21:35 UTC (1 days ago) (HTM) web link (applied-langua.ge) (TXT) w3m dump (applied-langua.ge) | ogogmad wrote: | What about derivative-based regex? | | I got worried at some point that some of the purported advantages | of derivative-based regexes might be drawbacks. One such | advantage/drawback is that it adds the _complement_ operation | into the language of Regex. But that gives you a way of deciding | whether two regexes are equivalent, which is supposed to be | PSPACE-COMPLETE, which ought to severely hamper finding an | efficient implementation. | kazinator wrote: | The complement operation doesn't provide a way of deciding | whether two regexes are equivalent, nor is any such thing | required to implement it. Using derivatives to interpret a | regular expression that contains a complement operator does not | require a function for calculating equivalence of two regular | expressions. | | Equivalence is required for compiling. The reason is that | regular expressions have features like the Kleene closure which | express iteration, and these give rise to endless derivatives: | regular expressions whose derivative you can keep taking _ad | nauseum_. This is fine for interpreting, because the algorithm | terminates when you have consumed the input or hit an impasse. | The compiler, however, has to close the loops and turn them | into cycles in the state machine graph. To do that it has to | check each derivative: "have I derived an expression which has | occurred before?" If so, then just hook to the previous state. | | If you make the check literal, just comparing the regex syntax | for identical structure, the graph will sometimes be larger | than necessary, because it will contain redundant nodes that | should be equivalent states. The smarter you make the | equivalence function, the more compact the graphs will be. | hayley-patton wrote: | Author here - I don't think this is an issue. | | If you don't use submatching, you can just compute the minimal | DFAs and compare them to test for equivalence, and you don't | even need negation. Thus, introducing negation can't do any | harm there. | | But as I generate Mealy machines for submatching, which can't | be minimised, we need another approach. Knowing that A - B = A | [?] !B, one can produce a DFA for (A - B) [?] (B - A), and if | it has no accepting states, then A and B must be equivalent, as | nothing can match one but not both. But DFA construction is | O(2n), even if one takes the RE - NFA - DFA route, so this is | nothing new complexity-wise. | | Such a decision procedure is never used in a derivative-based | regex implementation, anyway. The equivalence of regular | expressions only needs to be approximated, and the | approximation consists of some structural rules (see Definition | 4.1 of [1]). Relatively few rules are needed to guarantee that | a _finite_ machine can be generated, and only a dozen rules are | required to get very close to a minimal DFA (table 1, page 14 | of [1]). So having the possibility of writing a precise regex | equivalence function is completely irrelevant to producing an | efficient regex engine. | | A decision procedure is handy, but you can write one for | submatch-less regexes without negation, the complexity of such | a procedure is typical for DFA construction, and the problem of | deciding if two regular expressions are equivalent is | irrelevant. | | [1] Regular-expression derivatives reexamined | https://www.ccs.neu.edu/home/turon/re-deriv.pdf | froh wrote: | As this is a regex->DFA implementation I wonder how it compares | to Google RE2 [1] | | there are nice performance comparisons to glibc, rust regex and | cl-ppcre, but re2 was missing from that list. | | [1] https://github.com/google/re2 | hyperpape wrote: | If the benchmarks are to be believed, I think it substantially | outperforms re2, because it also beats Hyperscan, and I think | Hyperscan typically beats re2. | alexott wrote: | re2 has limited functionality. It's works wel for specific | patterns, like crawling, but not general purpose | reikonomusha wrote: | This is demonstrative of one of the great superpowers of Lisp: | that you can write mini native code compilers--without knowing a | lick of assembly. (But knowing assembly certainly helps you | optimize the heck out of it.) | | More specifically, you can compile syntax (of your choosing) into | performant, low-level Lisp code, which the Lisp compiler can then | compile into efficient machine code. To do this, you need no | extra tools, no intermediate files, no compiler hooks, no virtual | machines, no byte codes, and no non-standard or non-portable | features. It has been built in to Lisp since at least the 1980s. | | Making miniature "JIT" compilers is also a technique a quantum | simulator in Lisp [0a, 0b] uses to optimize quantum gate | operations. It analyzes the quantum gate matrix and, on the fly, | compiles it into optimized numerical Lisp code that is equivalent | to that matrix acting on vectors of tensor product spaces (i.e., | super huge arrays of complex double floats). Using another | language would require me to do arduous things like write | assembly code or learn a library like LibJIT [1] which may not | even interface well with my host language. | | Other simulators out there written in C use huge hand-unrolled | loops with comments like // THIS CODE WAS | GENERATED, DO NOT TOUCH | | and/or in-line assembly code, and still can't optimize specific | matrix shapes and structures, or do algebraic simplifications to | eliminate work altogether. | | The regex library FTA is a great, and clean, example of a long | standing practice of compiling regexen, except it doesn't use any | fancy VMs or any fancy JITs, just "when you see this regex, | automatically turn it into this Common Lisp code, and let the | Lisp compiler handle the rest." | | [0a] https://github.com/quil-lang/qvm | | [0b] COMPILE-OPERATOR: https://github.com/quil- | lang/qvm/blob/master/src/compile-gat... | | [1] https://www.gnu.org/software/libjit/ | martincmartin wrote: | Perhaps most importantly, a good CL implementation can help | with debugging this, by keeping track of the original CL code | that generated the intermediate code, and pointing you to it. | With the "use a program to generate C" approach, if there's | some bug in the C code, the best the compiler or runtime can do | is point you to the line of C code that's crashing. It's up to | you to figure out where that came from. | | This was a problem for early C++ compilers like Cfront that | translated C++ into C: the error messages from the C compiler | referenced the intermediate C, not the original C++. | gmfawcett wrote: | How old is the #line directive, I wonder? It's been a long | time since Cfront. | | https://stackoverflow.com/questions/2216540/whats-the- | meanin... | i_am_proteus wrote: | I express my appreciation for the author's reference to King | Crimson - the track "One More Red Nightmare" off of the record | _Red_. | irq wrote: | I noticed this too and was hoping the author would mention it | in his article but couldn't find a reference to it :) This is a | great album to listen to while doing technical work. | hayley-patton wrote: | See the first footnote: "The name is of course a reference to | One More Red Nightmare." | zeruch wrote: | I'm sensing a subtle King Crimson reference here... | phoe-krk wrote: | This title has a typo - should be _One-more-re-nightmare_. | jwr wrote: | This is a good example of why I roll my eyes at most programming | language discussions. There is always someone that starts the | "language X is SLOW" bandwagon, and people hop on. | | Languages are rarely "slow" or "fast" by themselves. Programs | written in those languages can be slow or fast. | regularfry wrote: | Yes and no. Some language features can make it incredibly | difficult to optimise. I know the JRuby folks had all sorts of | fun trying to make method dispatch fast in the face of "well, | this method definition can be changed at any time, arbitrarily | often, including during execution of the method itself". I know | CL must also have solved that problem but I presume the toolkit | is sufficiently different that the techniques just don't map. | pfdietz wrote: | A nice mechanism for optimizing such things was given at the | last European Lisp Symposium. | | https://european-lisp- | symposium.org/static/proceedings/2021.... | | (the last paper, by Robert Strandh.) | AllegedAlec wrote: | I also feel that for 99% of applications, it _doesn 't matter_. | Like, yeah, there are a a few areas where hyper-optimized code | in some kind of near-bare-metal language is necessary. But most | of the time, if you're writing code, the language itself isn't | gonna be the deciding factor if something is performant enough | for your application. Furthermore, by using a "fast" language, | you're then trading your ability to quickly and cleanly | implement code that's human readable suffers greatly. | simion314 wrote: | >Languages are rarely "slow" or "fast" by themselves. | | For sure they are, language features will for sure limit the | optimizations a compiler/interpreter can do and for sure a | language with many levels of abstractions will have to pay the | price in performance somewhere. | | Every-time someone improves the benchmark performance of a slow | language that person is not a regular developer, but someone | that has a deeper understanding on what happens under the hood, | knows how to use profiling tools, maybe knows assembly and most | of the time they have to use tricks or advanced patterns. | | Btw, optimizing some slow language in a hot path with clever | tricks or some advanced patterns is fine, it happens all the | time, you can;t just switch the project language with 1 click. | the-alt-one wrote: | >language features will for sure limit the optimizations a | compiler/interpreter can do | | For example: | | Common Lisp has a full numerical tower, meaning it will | automatically convert fixed size numbers into bignums (and | more cool stuff). This means that an addition has to branch | if it can't prove that a result won't overflow from fixnum to | bignum. Yes, C also needs to use bignums, but they're on an | opt-in basis. In CL they're opt-out. | | There are more examples of this, often related to the | dynamism of your language. The more introspection a language | is capable of, the less a compiler designer can do to cheat | and optimize your program. ___________________________________________________________________ (page generated 2021-12-01 23:01 UTC)