[HN Gopher] Regex engine internals as a library
       ___________________________________________________________________
        
       Regex engine internals as a library
        
       Author : kretaceous
       Score  : 301 points
       Date   : 2023-07-05 13:36 UTC (9 hours ago)
        
 (HTM) web link (blog.burntsushi.net)
 (TXT) w3m dump (blog.burntsushi.net)
        
       | recov wrote:
       | Burntsushi is a machine.
        
         | agumonkey wrote:
         | lets him and borkdude meet
        
         | kzrdude wrote:
         | Deterministic and finite such, or what?
        
           | sd9 wrote:
           | Non-deterministic, deterministic, it matters not
        
       | ttul wrote:
       | Story time. At ActiveState, my colleague and I - both fresh out
       | of school - were tasked with building a regular expression
       | debugger for the Komodo editor. We hired the legendary Perl guru
       | Mark Jason Dominus to hack some hooks into Perl's regex engine
       | and then by exposing these hooks to the UI, we allowed users to
       | watch the execution of their regexes step by step.
       | 
       | These days, various web-based tools offer superior functionality.
       | But in 2001, Komodo's Rx Debugger was absolutely state of the art
       | and so much fun to work on.
        
         | mschaef wrote:
         | Any recommendations on specific web based tools for this?
        
           | imaginary_unit wrote:
           | not OP, but these are some I used in the past:
           | 
           | https://www.debuggex.com/
           | 
           | https://regex101.com/
        
         | Twirrim wrote:
         | I've found myself in the past in a situation where I needed an
         | offline regex debugger (dealing with air-gapped networks, so
         | the folks that needed to use the tool literally couldn't get at
         | any online site, nor would they be permitted to even think of
         | putting what they were working with in to an online one, no
         | matter how it's designed), and it seems almost all of the work
         | has gone in to online tools. The offline tools are scarce and
         | somewhat lacking in comparison to, say, https://regex101.com/.
        
           | nerdponx wrote:
           | RegexBuddy is great and runs fine on WINE.
        
           | aorth wrote:
           | Good question! I use regex101.com a few times a month and
           | I've thought about offline tools too. I just happened to see
           | this GNOME application called Wildcard recently:
           | 
           | https://flathub.org/apps/com.felipekinoshita.Wildcard
           | 
           | Haven't tried yet...
        
           | lesuorac wrote:
           | It seems like the site doesn't really need a server so people
           | have made offline versions.
           | 
           | So should be able to burn something like [1] to a disk.
           | 
           | [1]: https://github.com/ibaaj/Regex101.com-offline-
           | app/pull/1/fil...
        
       | jakobnissen wrote:
       | BioJulia has published Automa.jl, which is a pure-Julia regex
       | engine that can insert arbitrary Julia code at compile time.
       | 
       | Not to take away from Rusts regex which are far more advanced
       | than Automa, but I just take issue with that library being the
       | first time regex internals have been exposed as a library :)
        
         | burntsushi wrote:
         | These sound like two different things. PCRE2, for example, has
         | support for "callouts" which sound similar to what you're
         | saying? https://www.pcre.org/current/doc/html/pcre2callout.html
         | --- Other things like ragel and re2c have done similar things.
         | 
         | What I'm talking about in this blog is very different from
         | that. This is about taking the internals of the regex library
         | itself and turning them in a separately versioned library that
         | others can compose.
         | 
         | This makes _somewhat_ less sense for a backtracker because most
         | backtrackers just have one engine: the backtracker. But an
         | automata based library usually has many different engines that
         | can be composed in different ways. Still, backtrackers have
         | things they _could_ expose but don 't in practice, such as a
         | regex parser and an AST.
        
           | nsajko wrote:
           | You're both correct. In the context of Julia, where
           | compilation and run time are arbitrarily interleaved,
           | something like re2c can be a library and used at run time.
        
       | alpaca128 wrote:
       | I experimented with the regex-automata crate in the past and it's
       | the only one I found which can be used for text editors because
       | the direct access to the underlying DFA makes it compatible with
       | any text data structure. The usual regex library APIs expect the
       | input to be one contiguous string.
        
       | yazaddaruvala wrote:
       | For a problem at work, I needed to create a RegexSet of over 10
       | MM really long regexes. No engine out of the box could handle it.
       | Even Rust's RegexSet wasn't good enough by default.
       | 
       | However, trying to use and read through regex-automata and regex-
       | syntax (even in 2018) was such a very valuable and helpful
       | learning resource. I ended up modeling the project at work off
       | Lucene APIs, but it was all only after learning the basics from
       | the regex crates.
       | 
       | Best set of libs! Thanks sushi!
        
         | OJFord wrote:
         | I'm guessing the answer's 'no' because you didn't elaborate to
         | begin with, but on the off-chance - can you share any more
         | about what the problem/project was?
        
         | burntsushi wrote:
         | Yeah 10 million regexes is huuuuge. Aho-Corasick can barely
         | handle 10 million literals.
         | 
         | Future work is definitely trying to make the regex engine scale
         | better with more patterns. Currently, it will crap out well
         | before 10 million regexes, and I'm not sure that goal is
         | actually attainable. But it can certainly be better than what
         | it is today.
         | 
         | And of course, Hyperscan is really the gold standard here as
         | far as multi-pattern searching goes. I'm not sure how well it
         | will handle 10 million patterns though.
        
       | jjice wrote:
       | I've only given a quick read/skim, but this is extremely
       | impressive. BurntSushi has a lot of create stuff out there, but
       | the Rust regex crate is legendary. The fact that Rust has had
       | such a performant and easy to use regular expression library for
       | so long is a real treat to the community.
       | 
       | This article as a whole is also a treasure. There's a fairly
       | limited number of people who have written a ton about regular
       | expressions, but they all do so extremely well. Russ Cox's
       | article series (as mentioned in this one) is really great. I used
       | it in college to write a regular expression engine one summer
       | after I had started to fall in love with the perfect cross
       | between theory and practice that regular expressions are.
       | 
       | The changes for more in depth testing here are also very
       | interesting, especially for a crate that I'm sure if critical so
       | a lot of the ecosystem, and I appreciate the writeup on such a
       | deep dive topic.
       | 
       | Are regular expressions hard to read? Sometimes they can,
       | especially if you haven't gone out of your way to take deep
       | dives. Are they not perfect for a lot of parsing tasks? Sure,
       | there are definitely things people shouldn't use regexs for that
       | they do, like validating emails. At their core though, regular
       | expressions are one of the most power dense tools we have in
       | pretty much every language. The insane stuff you can get done
       | (sometimes in a way that would be frowned upon) in so little time
       | if just incredible.
       | 
       | I'd love for there to be more content on regular expressions as
       | well. Right now I'm only familiar with one book that really does
       | a great job (at a practical level), and that's Mastering Regular
       | Expressions by Jeffrey Friedl. On a theory level, a lot of
       | compiler books will talk about them to varying degrees. The
       | Dragon Book has some good content on regular expressions at an
       | implementation level too.
       | 
       | Does anyone have any other book recommendations for regular
       | expressions?
       | 
       | And if BurntSushi or any other regular expression engine
       | contributors see this, I really appreciate the effort y'all put
       | in to building such powerful software.
        
         | btown wrote:
         | https://www.cs.princeton.edu/courses/archive/fall19/cos226/l...
         | and https://kean.blog/post/lets-build-regex are excellent
         | introductions to implementing a (very) simplified regex engine:
         | construct a nondetermistic finite state automaton for the
         | regex, then perform a graph search on the resulting digraph; if
         | the vertex corresponding to your end state is reachable, you
         | have a match.
         | 
         | I think this exercise is valuable for anyone writing regexes to
         | not only understand that there's less magic than one might
         | think, but also to visualize a bunch of balls bouncing along an
         | NFA - that bug you inevitably hit in production due to
         | catastrophic backtracking now takes on a physical meaning!
         | 
         | Separately re: the OP, https://github.com/rust-
         | lang/regex/issues/822 (and specifically BurntSushi's comment at
         | the very end of the issue) adds really useful context to the
         | paragraph in the OP about niche APIs:
         | https://blog.burntsushi.net/regex-internals/#problem-request...
         | - searching with multiple regexes simultaneously against a text
         | is both incredibly complex and incredibly useful, and I can't
         | wait to see what the community comes up with for this pattern!
        
         | jcranmer wrote:
         | > Are regular expressions hard to read? Sometimes they can,
         | especially if you haven't gone out of your way to take deep
         | dives. Are they not perfect for a lot of parsing tasks? Sure,
         | there are definitely things people shouldn't use regexs for
         | that they do, like validating emails. At their core though,
         | regular expressions are one of the most power dense tools we
         | have in pretty much every language. The insane stuff you can
         | get done (sometimes in a way that would be frowned upon) in so
         | little time if just incredible.
         | 
         | The real crowning use case for regular expressions (at least in
         | terms of parsing-like tasks) is when you've got formats with
         | varied delimiters. So one format I was parsing this weekend was
         | basically header:field1,field2,field3"data"hash (with a fixed
         | number of fields). Or another format I work with is
         | suite~split/test1,test2@opt1:opt2^hw1^hw2#flags1#flags2 (most
         | of the elements of which are optional). Regular expressions
         | shine here; using primitives like split don't quite cut it in
         | these formats.
         | 
         | And I think this also is a major cause of why regular
         | expressions tend to become unreadable quickly. If you look at
         | parsing via regular expressions as a whole, there's basically
         | three things that are being jammed into one expression: what
         | are the delimiters between fields, what is valid in each field,
         | and which fields are optional. These are basically three
         | separate concerns, but most regex APIs are absolutely terrible
         | at letting you separate these concerns into separate steps (all
         | you can do is provide the string that combines all of them).
        
           | tialaramex wrote:
           | > header:field1,field2,field3"data"hash (with a fixed number
           | of fields).
           | 
           | In a language where regex is just right there, like Perl, I
           | agree that the natural way to parse this is always a regex,
           | although on the other hand in Perl the natural way to do
           | almost everything is a regex so maybe Perl was a bad example.
           | 
           | In a language with split_once (and a proper string reference
           | type) I actually rather like split_once here, splitting away
           | header, and then field1, and then field2, and then field3,
           | and then data, leaving just hash.
           | 
           | I guess this gets to your concern, by writing it with
           | split_once we're clear about a lot of your answers, each
           | delimiter is specified with what it's splitting, none of the
           | fields are optional (unless we write that) and (if we write
           | code to check) what is valid in each field.
        
             | jcranmer wrote:
             | Yeah, split_once is pretty handy, although chaining
             | together can get a little verbose. It would be nice to
             | write this:                 let (y,m,d,h,m,s) =
             | break_str!(s, year-month-day hour:minute:second)?;
             | 
             | instead of this:                 let (y, split) =
             | s.split_once('-')?;       let (m, split) =
             | split.split_once('-')?;       let (d, split) =
             | split.split_once(' ')?;       let (h, split) =
             | split.split_once(':')?;       let (m, s) =
             | split.split_once(':')?;
        
               | burntsushi wrote:
               | I'll use this as an opportunity to plug the regex crate's
               | new 'extract' API. :-)                   use
               | regex::Regex;              fn main() {
               | println!("{:?}", extract("1973-01-05 09:30:00"));
               | }              fn extract(haystack: &str) ->
               | Option<(&str, &str, &str, &str, &str, &str)> {
               | let re = Regex::new(
               | r"([0-9]{4})-([0-9]{2})-([0-9]{2})
               | ([0-9]{2}):([0-9]{2}):([0-9]{2})",
               | ).unwrap();             let (_, [y, m, d, h, min, s]) =
               | re.captures(haystack)?.extract();             Some((y, m,
               | d, h, min, s))         }
               | 
               | Output:                   Some(("1973", "01", "05", "09",
               | "30", "00"))
               | 
               | That gets you pretty close to what you want here.
               | 
               | (The regex matches more than what is a valid date/time of
               | course.)
        
               | Izkata wrote:
               | Python (even in 2):                 >>> import re
               | >>> pattern =
               | re.compile(r"([0-9]{4})-([0-9]{2})-([0-9]{2})
               | ([0-9]{2}):([0-9]{2}):([0-9]{2})")       >>> results =
               | pattern.match('1973-01-05 09:30:00')       >>>
               | results.groups()       ('1973', '01', '05', '09', '30',
               | '00')       >>> (y, m, d, h, min, s) = results.groups()
               | 
               | (`results` will be None if the regex didn't match)
               | 
               | Regexes are one of those things where, once you
               | understand it (and capture groups in particular) and it's
               | available in the language you're working in, string-
               | splitting usually doesn't feel right anymore.
        
               | burntsushi wrote:
               | I believe all of the people in this thread understand
               | regexes extremely well. :-)
               | 
               | There is a lot of reasonable room to disagree about when
               | and where regexes should be used.
        
           | kbenson wrote:
           | The thing that I think sometimes gets lost when people are
           | talking about the strengths and weaknesses of regular
           | expression use, is that it can matter quite a bit what the
           | language in question is like. If you're using a language that
           | compiles down to a low level or is expected to have a fairly
           | close match of expressions to specific code run (C, C++,
           | Java, etc) then the choice is between using a regular
           | expression library which often does an _good enough_ match of
           | what you want to accomplish in a performance manner that
           | rolling your own optimized parser is hard to justify.
           | 
           | When using an interpreted language like Perl or Python, it's
           | a slightly different story, since often there's no way to get
           | as good of performance out of a custom parser written in that
           | language as you can get from the optimized regular expression
           | library included (if you use it competently). The overhead in
           | the interpreted language can make some tasks impossible to do
           | as quickly as teh regular expression engine can. In a way, a
           | regular expression library is to parsing what Numpy is to
           | computation.
           | 
           | I remember a task I had to parse a bunch of _very simplistic_
           | XML in Perl a decade ago, where it consisted of the
           | equivalent of a bunch of records of dictionaries /hashes
           | (solr), and needed them as Perl hashes to work on them. I
           | surveyed every XML parsing module available to me in Perl,
           | and couldn't quite get the performance I needed, given the
           | file was a few GB and it needed to be done as quick as
           | possible multiple times a day. I implemented a simple outer
           | loop to grab a working chunk of the file, a regex to split
           | that chunk into individual records text chunks, and another
           | regex to return each key/value as an alternating list I could
           | assign directly to a hash. It was 5-6x faster than the faster
           | XML streaming solution I found that I could use as a Perl
           | module (which were all just wrapping X libs).
           | 
           | Could I have coded a custom solution in C that parsed this
           | specialized format faster? Undoubtedly, but I'd still be
           | marshaling data through the C/Perl interface, so would have
           | taken a hit to get the data easily accessible to the rest of
           | the codebase, which is less of an issue for regular
           | expressions in Perl since they're a native capability of the
           | language and deeply integrated. Using regular expressions as
           | a parsing toolkit yielded very good results at a fraction of
           | the time and effort, and honestly in my opinion that's what
           | regular expressions are all about.
        
         | chubot wrote:
         | What kind of content are you looking for? From a user POV, I
         | think there is thankfully not much to learn about automata-
         | based regexes like RE2 and rust/regex. The exception is if
         | you're writing a performance-sensitive system entirely around
         | regexes, like an intrusion detection system or search engine.
         | 
         | A regex is a very small concept with a very large syntax :)
         | It's "just" literals, alternation, repetition, and
         | concatenation. So it's mostly being able to see through the
         | syntax IMO.
         | 
         | Nice animated diagrams in this recent post -
         | https://news.ycombinator.com/item?id=35503089
         | 
         | (On the other hand, from the implementation POV, there is a
         | wealth of material online, but you already know where to find
         | that :) I still think someone should "cliff notes" the Cox RE2
         | articles, since the ideas are important but there is a lot of
         | detail. There are also dozens of references in those articles
         | for people to read; it was a comprehensive survey at the time.)
         | 
         | ---
         | 
         | BTW Friedl's book covers backtracking engines, which makes
         | sense because PCRE, Perl itself, Python, Java, JavaScript, etc.
         | all use that style
         | 
         | But as far as I remember, most of the book is irrelevant for
         | users of automata-based engines. It's a whole bunch of detail
         | about optimizing backtracking, reordering your clauses, and
         | lots of detail about engine-specific and even version-specific
         | differences.
         | 
         | Personally I just ignore all of that, and it doesn't cause me a
         | problem. I use regexes all the time in every language, and
         | stick to the features that an automata-based engine can handle.
         | 
         | Once you go into the backtracking stuff, then you might want
         | the Friedl book, and that's a huge waste of time IMO. It's way
         | easier to do that kind of thing outside of a regex engine, with
         | normal parsing techniques.
         | 
         | And honestly once you go outside, you'll find you don't even
         | need to backtrack. Perl-style regexes can be an astoundingly
         | inefficient way to write a pattern recognition algorithm.
         | 
         | ---
         | 
         | Eggex is my attempt to separate the two kinds of things, since
         | it's not obvious based on the syntax:
         | https://www.oilshell.org/release/latest/doc/eggex.html
         | 
         | It makes the syntax smaller, more familiar, and more
         | composable. It's closer to the classic lex tool (e.g. GNU
         | flex). It's funny that lex has a way better syntax for regular
         | expressions for Perl, but it's used probably 10,000x less often
         | because most people want an interpreter and not a compiler to C
         | code. But lex has 90%-100% of what you need, and it's way
         | simpler, with a nicer syntax.
         | 
         | Literals are quoted so you don't have the literal/metachar
         | confusion, and you can reuse patterns by name. The manual is
         | worth skimming:
         | 
         | https://westes.github.io/flex/manual/Simple-Examples.html#Si...
         | 
         | I use re2c though, not flex, and its manual is also worth
         | skimming: https://re2c.org/manual/manual_c.html
        
         | amelius wrote:
         | Does this RegEx library utilize a JIT, just like the ones in
         | most JavaScript implementations?
         | 
         | If not, then perhaps this is a case where JavaScript beats
         | Rust.
        
           | burntsushi wrote:
           | Nope. Usually only backtrackers have a JIT (not strictly
           | true, I believe there is a JIT of a Thompson NFA and there is
           | also redgrep). This library is not a backtracker.
           | 
           | I would be interested in experimenting with a JIT in the
           | future, primarily as a way to speed up capture group
           | extraction. But this regex library uses a lazy DFA, which has
           | excellent theoughput characteristics compared to the normal
           | backtracking interpreter (and perhaps even a backtracking
           | JIT).
           | 
           | > If not, then perhaps this is a case where JavaScript beats
           | Rust.
           | 
           | Haha. No. Follow the link to rebar to see performance
           | comparisons. A JIT isn't everything.
           | https://github.com/BurntSushi/rebar
        
       | BoppreH wrote:
       | Can this be used for non-string lists? I always found it
       | frustrating that we have this amazing machinery for searching and
       | modifying lists of _characters_ , but as soon as I have a list of
       | _numbers_ or _dates_ it all goes out of the window.
       | 
       | Let's say I have a list of login attempt dates, and I want all
       | sequences of 5+ failed login attempts followed by a success. With
       | a regex it's trivial, but I can't use that; I have to roll my own
       | loop, flags, and temporary lists.
       | 
       | Sure, I could convert my list to a string, process it, and (try
       | to) convert it back, but the downsides are obvious.
       | 
       | Even if the performance is not as good as string-based regex, why
       | shouldn't we have regexes for arbitrary list types?
       | 
       | ---
       | 
       | Edit: I realized this is not the first time I had this idea, and
       | found an old Python prototype:
       | https://github.com/boppreh/listregex
       | 
       | It's horribly slow, but as an API experiment I'm quite happy. It
       | also gives some tools not available to regexes, like inverting
       | and intersect patterns, and matched pairs.
        
         | saagarjha wrote:
         | You're going to have to define some sort of API for matching on
         | a sliding window of values. This isn't impossible, but most
         | languages don't really have great interfaces for it.
        
           | BoppreH wrote:
           | Parser combinators[1] seem perfectly suited for this.
           | # A sequence of 5 or more failed login attempts followed by a
           | successful one.          obj_regex.search(repeat(is_failed,
           | min=5)+is_success, my_list)
           | 
           | [1]: https://en.wikipedia.org/wiki/Parser_combinator
        
         | kccqzy wrote:
         | The C++ standard library std::basic_regex attempts to do this
         | by exposing a templated class over a custom character type.
         | https://en.cppreference.com/w/cpp/regex/basic_regex You can
         | provide your own trait class to define things for your custom
         | "characters".
         | 
         | Unfortunately performance goes out of the window. And it's
         | likely to work as well as storing arbitrary non-character
         | things into a custom std::basic_string.
        
         | burntsushi wrote:
         | Nope. The regex library is tightly coupled with searching
         | strings. That's an intentional design decision. Making a regex
         | engine like this have a generic alphabet is a total non-
         | starter. It's impractically difficult for me to do. Especially
         | the API design and doing it in a way that doesn't impact the
         | perf of the primary use case.
         | 
         | The kind of regex engine you want, especially one where you
         | don't care about perf, is not hard to build. You could take the
         | `regex-lite` crate I published, for example, and code it up to
         | be as generic as you want. In so doing, I expect you'll run
         | into a number of interesting challenges.
         | 
         | Anyway, it's not like these things don't exist. They do. People
         | attempt to build them[1]. They just usually don't gain much
         | traction because I suspect you're overstating their general
         | utility. :-)
         | 
         | [1]:
         | https://docs.rs/automata/latest/automata/trait.Alphabet.html
        
           | schuyler2d wrote:
           | That said it would be a really cool feature to have some of
           | the more complicated but common regexes (urls, emails, dates)
           | be some kind of "template" in the syntax. Eg what if I could
           | do '(?D<h=22>)' or something like that to search for all
           | dates in any format that had some version of 8pm represented.
           | 
           | Anyway, super appreciate your work and your writing!
        
             | burntsushi wrote:
             | It's an interesting idea, but I think it would be worth
             | trying to actually prototype it as its own library first
             | and see if it catches on. My suspicion is that it probably
             | won't work as nicely as you think it will unfortunately.
        
           | IshKebab wrote:
           | I've also been looking for the same thing for something like
           | SystemVerilog Assertions (SVA) which is very regex-like, so
           | it isn't just him!
        
             | burntsushi wrote:
             | Indeed. I've been responding to questions like this for a
             | long time. :-) https://old.reddit.com/r/rust/comments/4a8pb
             | v/the_regex_crat...
             | 
             | Someone should go out and built it. It won't be me though,
             | that's for sure. Believe it or not, such a library wouldn't
             | leverage much of my experience. The vast majority of the
             | complexity inside the regex crate is about optimizing for
             | sequences of bytes.
        
               | IshKebab wrote:
               | Yeah I am slowly working on it. I think even custom
               | alphabets wouldn't work for me. I'm trying to build
               | something so this works:                       let
               | is_even = |x: &i32| *x % 2 == 0;             let is_odd =
               | |x: &i32| *x % 2 == 0;             let pattern =
               | sequence(&[                 is_even,
               | is_odd,                 repeat(or(is_even, is_odd), 2,
               | 4),             ]);             let mut matcher =
               | Matcher::compile(pattern);             let elements = [4,
               | 1, 23, 5, 6];             for element in elements {
               | matcher.step(element);             }
               | assert!(matcher.matches());
               | 
               | I'll get there one day.
               | 
               | Amazing work on the Regex crate by the way!
        
           | nerdponx wrote:
           | Their utility is overstated until the one time when you
           | actually need/want it.
           | 
           | Last time I tried to figure out if there were existing tools
           | to solve a problem like this, I came across Event Calculus:
           | https://en.wikipedia.org/wiki/Event_calculus
           | 
           | I'm sure there's some interesting CS theory to be uncovered
           | here.
        
             | burntsushi wrote:
             | Oh it's definitely overstated by suggesting it should live
             | alongside a general purpose regex engine. :-)
             | 
             | I have no doubt that its utility can be great in niche use
             | cases. I've never come across one in my decades of
             | programming that I know of, but I'm sure they exist.
        
               | BoppreH wrote:
               | > I've never come across one in my decades of programming
               | that I know of, but I'm sure they exist.
               | 
               | The login attempt example was convoluted, so here are
               | more common scenarios:
               | 
               | - Grouping a list into pairs by matching /../
               | 
               | - Finding or collapsing repeated sequences with /(.)\1+/
               | 
               | - Parsing tag+value binary formats.
               | 
               | - Searching structured logs.
               | 
               | - DSL's for unit tests.
               | 
               | Are there better ways to do each of those? Yes, but
               | either because someone implemented those specific
               | functions, or it's a much longer solution.
               | 
               | Though I'm not recommending list-regexes for production
               | code anytime soon. Prototypes and code golfing, sure, but
               | not until the community understands it better.
        
               | burntsushi wrote:
               | Yeah I just wouldn't make use of regexes with arbitrary
               | alphabets for any of those tasks personally. Doesn't
               | really seem like a win to me.
               | 
               | But like I said, someone should go build it.
        
           | LoganDark wrote:
           | > They just usually don't gain much traction because I
           | suspect you're overstating their general utility. :-)
           | 
           | I find myself desiring a DFA over LLM tokens right now, which
           | are 16 bits each, at least for this model.
           | 
           | I, personally, am very pleased with the idea of attempting to
           | build a dense DFA with that vocabulary. Maybe your computer
           | would explode!
        
             | telotortium wrote:
             | Could you not encode the tokens into 8-bit bytes? Rust
             | regex supports byte strings:
             | https://docs.rs/regex/latest/regex/bytes/struct.Regex.html
        
       | otterpro wrote:
       | I use Ripgrep daily for searching anything in code or in text
       | files, and I'm grateful every time I use it (on windows, linux,
       | mac, vscode, vim, etc...). It's one of those software that has
       | changed my life and how I work. Whenever I'm forced to use
       | `grep`, it feels like I've travelled back in time when everything
       | ran on a single core CPU and data were on slow spinning hard disk
       | on PATA/IDE. Anyway, burntsushi definitely deserves honor amongst
       | other great programmers.
        
         | kzrdude wrote:
         | ripgrep comes out of a "lineage", before it there was ag and
         | before that ack, that did similar things to provide a much
         | better interface than just grep.
        
       | LoganDark wrote:
       | I was in the middle of writing some regex-automata code (using
       | the early 0.2.0 release) when this post dropped. Welp, time to
       | see if I'm going to have to start all over digging into some
       | brand new internals~
       | 
       | (I haven't read the post yet because I have an important call in
       | a few minutes, but it looks like a _very_ interesting and also
       | _conveniently-timed_ blog post.)
       | 
       | (Edit a few minutes later: looks like the answer might be yes,
       | but since this is a polished release, I might be able to simplify
       | my code massively. Wish me luck~)
       | 
       | (Edit 2 about 10 minutes later: well that was pretty painless and
       | the new Builder::patch method is a total upgrade. Awesome~)
       | 
       | P.S. I'm still blocked from all your GitHub repositories and I
       | think that's kind of unfair considering how widespread a lot of
       | your crates are. I don't remember the original incident anymore.
       | I believe the regex crate itself is under the rust-lang
       | organization now, but there are still others I can't interact
       | with.
        
         | burntsushi wrote:
         | The regex-automata 0.2.0 docs did have a giant warning about
         | this, and strongly recommended going with 0.1:
         | https://docs.rs/regex-automata/0.2.0/regex_automata/
         | 
         | > I'm still blocked from all your GitHub repositories and I
         | think that's kind of unfair considering how widespread a lot of
         | your crates are. I don't remember the original incident
         | anymore.
         | 
         | I don't either. I block a lot of people for a lot of reasons. I
         | unblocked you.
        
           | LoganDark wrote:
           | > The regex-automata 0.2.0 docs did have a giant warning
           | about this
           | 
           | I didn't listen, and it paid off, because a bunch of the
           | knowledge I learned on it is still perfectly applicable to
           | 0.3.0; I already have the same NFA as before. :)
           | 
           | Now my appointment got canceled so I get to read your blog
           | post _and_ play with the new version. Fun!
        
       | celeritascelery wrote:
       | My biggest wish for rust regex was support for ARM SIMD.
       | Currently the literal optimization is only vectorized on x86. It
       | seems that the author doesn't own an ARM machine, and therefore
       | can't test implementation and so doesn't want to support it.
        
         | AceJohnny2 wrote:
         | > _It seems that the author doesn't own an ARM machine, and
         | therefore can't test implementation and so doesn't want to
         | support it._
         | 
         | Tangential (since he answered already), but in this era of
         | cheap cloud instances, I'd be surprised if "not owning a
         | <blank>" is any hindrance to support.
        
           | cesarb wrote:
           | > but in this era of cheap cloud instances, I'd be surprised
           | if "not owning a <blank>" is any hindrance to support.
           | 
           | When you own a device, it doesn't cost you more (other than
           | electricity) if you take longer while playing with it, and
           | there's no extra cost (other than space) to having it ready
           | to be used when necessary. When you rent a cloud instance,
           | there's a direct monetary cost for every extra minute you
           | take while investigating an issue, and it's less guaranteed
           | that it'll be available when you need it.
           | 
           | Or, to put it more simply: unless he has his own personal
           | device to play with, he'll have to spend extra money for
           | every issue he investigates and every change he makes to the
           | ARM support.
        
             | burntsushi wrote:
             | Yes, exactly. There is a ton of friction.
        
         | burntsushi wrote:
         | I actually have an M2 mac mini in the mail from Apple for
         | exactly this purpose.
         | 
         | My time horizon is very long. It takes me a long time to do
         | things these days.
         | 
         | It has never been true that I don't want to support it. Merely
         | that it is difficult to verify and test if I can't do it
         | myself. There is also the problem that the port from x86 to arm
         | is not straight-forward, due to both my own ignorance and what
         | I believe are important missing vector operations such as
         | movemask.
         | 
         | This is discussed a bit more here (including the bit about
         | movemask): https://github.com/BurntSushi/memchr/issues/76
        
         | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-07-05 23:00 UTC)