hngopher.com

       [HN Gopher] A Regex Barometer
       ___________________________________________________________________
        
       A Regex Barometer
        
       Author : burntsushi
       Score  : 82 points
       Date   : 2023-07-05 13:48 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | pionar wrote:
       | Are there any surprises here? For me, it's how well the .NET
       | regex lib does in compiled mode.
        
         | burntsushi wrote:
         | I agree. I did not expect .NET to do so well.
         | 
         | Another surprise is perhaps the quadratic benchmark[1]. Even
         | for automata oriented engines, it provokes a case that is
         | unavoidably O(m * n^2). Basically, the automata oriented
         | engines (rust/regex, go/regexp and re2) all have worst case O(m
         | * n) time complexity for an individual search. But when you
         | move to "iterate over all matches in a haystack," that worst
         | case gets bumped up to O(m * n^2) for certain classes of
         | regexes.
         | 
         | All regex engines in the barometer, except for Hyperscan,
         | suffer from this problem. That's not because of any special
         | secret in Hyperscan, but rather, that it implements different
         | match semantics than every other engine.
         | 
         | I'm otherwise finding it hard to say what is "surprising" haha.
         | Perhaps others from the outside can comment.
         | 
         | [1]: https://github.com/BurntSushi/rebar#quadratic
        
       | dsemi wrote:
       | Would be interested to see how cl-ppcre compares as I've often
       | heard people praise its performance.
        
         | burntsushi wrote:
         | Me too! I ran out steam adding engines. If anyone wants to take
         | a crack at this, instructions are here:
         | https://github.com/BurntSushi/rebar/blob/master/CONTRIBUTING...
         | 
         | Basically, you "just" need to build a Lisp program that accepts
         | a simple format on stdin, and then repeatedly executes the
         | benchmark given. Then it just needs to print out the time and
         | result of each sample collected. Finally, you need to tell
         | rebar how to build and run the program.
        
       | Joker_vD wrote:
       | Not to be confused with Rebar3 [0] which is a _de-facto_ package
       | manager and build tool for Erlang.
       | 
       | [0] https://github.com/erlang/rebar3
        
       | mg wrote:
       | No value for PHP?
        
         | burntsushi wrote:
         | PHP uses PCRE2, which is included in this barometer. I
         | mentioned PHP (among others) here:
         | https://github.com/BurntSushi/rebar/blob/master/WANTED.md
         | 
         | If someone has a compelling argument for why PHP specifically
         | (and probably others) should be included, then please open an
         | issue and make your case. :-)
        
           | KerrAvon wrote:
           | notes on other WANTED stuff: NSRegularExpression uses ICU
           | regex underneath. However, I think Swift's new native regex
           | support is a from-scratch new engine; it pretty much has to
           | be to work well with Swift strings and the nifty extended
           | RegEx literal builder mode.
        
       | Aachen wrote:
       | I don't understand, what are the columns? Is this ratio thing a
       | factor of how much slower one library is than some other
       | unspecified thing? And why aren't all benchmarks run on all
       | engines if the results are averaged anyway; does the benchmark
       | count column mean something different?
        
         | Modified3019 wrote:
         | Yeah, definitely needs at least a "higher/lower is better"
         | comment for people that don't want to sink time and effort into
         | a random github project trying to figure out what these values
         | are saying.
        
           | [deleted]
        
         | burntsushi wrote:
         | Does this paragraph above the summary table answer your
         | question?
         | 
         | > The summary statistic used is the geometric mean of the speed
         | ratios for each regex engine across all benchmarks that include
         | it. The ratios within each benchmark are computed from the
         | median of all timing samples taken, and dividing it by the best
         | median of the regex engines that participated in the benchmark.
         | For example, given two regex engines A and B with results 35 ns
         | and 25 ns on a single benchmark, A has a speed ratio of 1.4 and
         | B has a speed ratio of 1.0. The geometric mean reported here is
         | then the "average" speed ratio for that regex engine across all
         | benchmarks.
         | 
         | So 1.0 represents the fastest possible ratio, where the engine
         | ranked first in all benchmarks.
         | 
         | > And why aren't all benchmarks run on all engines
         | 
         | Because they can't be. Some benchmarks, for example, test the
         | speed that a regex engine reports capture group offsets. But
         | not all regex engines (like Hyperscan) support that
         | functionality. In other cases, it's because of various limits
         | that are hit, e.g., "regex is too big."
         | 
         | If you look at the individual benchmarks, more information
         | about each can be seen by expanding the details. And if there
         | are engines missing, it should tell you why. Look at the
         | noseyparker benchmark for example[1]. It's absolutely brutal
         | and most engines just time out.
         | 
         | [1]: https://github.com/BurntSushi/rebar#noseyparker
        
       | Dowwie wrote:
       | Could the barometer be run against the very first release of
       | regex?
        
         | burntsushi wrote:
         | It can. I did it here:
         | https://github.com/BurntSushi/rebar/tree/ag/regexinit/record...
         | 
         | And the PR: https://github.com/BurntSushi/rebar/pull/3
         | 
         | Although I probably won't merge it. Overall results:
         | $ rebar rank record/curated/2023-07-05/*.csv -M compile
         | --intersection         Engine          Version  Geometric mean
         | of speed ratios  Benchmark count         ------
         | -------  ------------------------------  ---------------
         | rust/regex      1.9.0    1.03                            34
         | rust/regexold   1.7.3    3.12                            34
         | rust/regexinit  1.0.0    4.35                            34
         | 
         | You can explore individual benchmarks in either the README I
         | linked above, or by checkout out that PR and running this:
         | rebar cmp record/curated/2023-07-05/*.csv -M compile
         | --intersection
         | 
         | (You'll need to build rebar first of course though.)
        
           | Dowwie wrote:
           | More than 2x improvement :). It started off fast and just
           | kept getting faster. Congrats.
        
       ___________________________________________________________________
       (page generated 2023-07-05 23:02 UTC)