[HN Gopher] A Regex Barometer ___________________________________________________________________ A Regex Barometer Author : burntsushi Score : 82 points Date : 2023-07-05 13:48 UTC (9 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | pionar wrote: | Are there any surprises here? For me, it's how well the .NET | regex lib does in compiled mode. | burntsushi wrote: | I agree. I did not expect .NET to do so well. | | Another surprise is perhaps the quadratic benchmark[1]. Even | for automata oriented engines, it provokes a case that is | unavoidably O(m * n^2). Basically, the automata oriented | engines (rust/regex, go/regexp and re2) all have worst case O(m | * n) time complexity for an individual search. But when you | move to "iterate over all matches in a haystack," that worst | case gets bumped up to O(m * n^2) for certain classes of | regexes. | | All regex engines in the barometer, except for Hyperscan, | suffer from this problem. That's not because of any special | secret in Hyperscan, but rather, that it implements different | match semantics than every other engine. | | I'm otherwise finding it hard to say what is "surprising" haha. | Perhaps others from the outside can comment. | | [1]: https://github.com/BurntSushi/rebar#quadratic | dsemi wrote: | Would be interested to see how cl-ppcre compares as I've often | heard people praise its performance. | burntsushi wrote: | Me too! I ran out steam adding engines. If anyone wants to take | a crack at this, instructions are here: | https://github.com/BurntSushi/rebar/blob/master/CONTRIBUTING... | | Basically, you "just" need to build a Lisp program that accepts | a simple format on stdin, and then repeatedly executes the | benchmark given. Then it just needs to print out the time and | result of each sample collected. Finally, you need to tell | rebar how to build and run the program. | Joker_vD wrote: | Not to be confused with Rebar3 [0] which is a _de-facto_ package | manager and build tool for Erlang. | | [0] https://github.com/erlang/rebar3 | mg wrote: | No value for PHP? | burntsushi wrote: | PHP uses PCRE2, which is included in this barometer. I | mentioned PHP (among others) here: | https://github.com/BurntSushi/rebar/blob/master/WANTED.md | | If someone has a compelling argument for why PHP specifically | (and probably others) should be included, then please open an | issue and make your case. :-) | KerrAvon wrote: | notes on other WANTED stuff: NSRegularExpression uses ICU | regex underneath. However, I think Swift's new native regex | support is a from-scratch new engine; it pretty much has to | be to work well with Swift strings and the nifty extended | RegEx literal builder mode. | Aachen wrote: | I don't understand, what are the columns? Is this ratio thing a | factor of how much slower one library is than some other | unspecified thing? And why aren't all benchmarks run on all | engines if the results are averaged anyway; does the benchmark | count column mean something different? | Modified3019 wrote: | Yeah, definitely needs at least a "higher/lower is better" | comment for people that don't want to sink time and effort into | a random github project trying to figure out what these values | are saying. | [deleted] | burntsushi wrote: | Does this paragraph above the summary table answer your | question? | | > The summary statistic used is the geometric mean of the speed | ratios for each regex engine across all benchmarks that include | it. The ratios within each benchmark are computed from the | median of all timing samples taken, and dividing it by the best | median of the regex engines that participated in the benchmark. | For example, given two regex engines A and B with results 35 ns | and 25 ns on a single benchmark, A has a speed ratio of 1.4 and | B has a speed ratio of 1.0. The geometric mean reported here is | then the "average" speed ratio for that regex engine across all | benchmarks. | | So 1.0 represents the fastest possible ratio, where the engine | ranked first in all benchmarks. | | > And why aren't all benchmarks run on all engines | | Because they can't be. Some benchmarks, for example, test the | speed that a regex engine reports capture group offsets. But | not all regex engines (like Hyperscan) support that | functionality. In other cases, it's because of various limits | that are hit, e.g., "regex is too big." | | If you look at the individual benchmarks, more information | about each can be seen by expanding the details. And if there | are engines missing, it should tell you why. Look at the | noseyparker benchmark for example[1]. It's absolutely brutal | and most engines just time out. | | [1]: https://github.com/BurntSushi/rebar#noseyparker | Dowwie wrote: | Could the barometer be run against the very first release of | regex? | burntsushi wrote: | It can. I did it here: | https://github.com/BurntSushi/rebar/tree/ag/regexinit/record... | | And the PR: https://github.com/BurntSushi/rebar/pull/3 | | Although I probably won't merge it. Overall results: | $ rebar rank record/curated/2023-07-05/*.csv -M compile | --intersection Engine Version Geometric mean | of speed ratios Benchmark count ------ | ------- ------------------------------ --------------- | rust/regex 1.9.0 1.03 34 | rust/regexold 1.7.3 3.12 34 | rust/regexinit 1.0.0 4.35 34 | | You can explore individual benchmarks in either the README I | linked above, or by checkout out that PR and running this: | rebar cmp record/curated/2023-07-05/*.csv -M compile | --intersection | | (You'll need to build rebar first of course though.) | Dowwie wrote: | More than 2x improvement :). It started off fast and just | kept getting faster. Congrats. ___________________________________________________________________ (page generated 2023-07-05 23:02 UTC)