[HN Gopher] Java Verbal Expressions ___________________________________________________________________ Java Verbal Expressions Author : victor106 Score : 220 points Date : 2020-11-25 15:29 UTC (7 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | 6gvONxR4sf7o wrote: | regex suffers from the same problem that inlined code does. | | For example, if you saw this in a code review, what would you | say: log_and_return(rank_by_time(compute_recommendations(get_data | (client_id,date), find_nearest_neighbors(client_id)))) | | You'd tell them to create some intermediate variables. But when | it's a regex, apparently we're all fine with this: | | /((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\\+\$,\w]+@)?[A-Za-z0-9.-]+| | (?:www.|[-;:&=\\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\\+~%\/.\w-_])?\? | ?(?:[-\\+=&;%@.\w_])#?(?:[\w]*))?)/ | TazeTSchnitzel wrote: | By using this "builder" syntax you gain: | | * not having to distinguish special characters in the pattern | being matched from special characters part of regex syntax | | * no ambiguity as to whether something is a digraph or not | | * no escaping hell | | * unambiguous human-readable names for all the regex features | used | | * the ability to use whitespace to clearly separate different | parts of the regex | | * the ability to comment parts of the regex | | It sounds great to me. Have you ever tried making a regex | matching something with backslashes in it, and then you have to | put that regex inside a string literal? Have you ever had to | switch between different regex environments and not known which | symbols require escaping, or what is the correct way to write | something in a particular environment? I've had all these | problems. | TimTheTinker wrote: | Many of those gains can be had by using first-class and more | full-featured regexes, like those that that are available in | other languages (Ruby, Perl): | | - escaping hell isn't that much of a problem, since you're only | ever escaping something once (not like a regex in a string) | | - several languages support separating regexes across several | lines | | - regex commenting (including named groups) is a standard | feature in many languages, and that's besides using first-class | comments across multiple lines | | I think you do have a point about digraphs (or homographs), but | unless I misunderstand, those would be a problem whether or not | the character(s) are part of a string vs. a first-class regex. | As for unambiguous human-readable names for regex features | used, tools like this (https://regexr.com/) are available and | very effective. | | I might prefer Java Verbal Expressions over java.util.regex, | but to me that's more of a knock on Java and its lack of | proper, first-class regexes than anything else. | jehna1 wrote: | Anyone looking for a non-Java implementation: This library has | been ported to 30+ languages, and you can find a list of them at | http://verbalexpressions.github.io/ | yoz wrote: | Sure, it's much easier to read, especially when it comes to | finding and understanding a two-character diff in a 50-char | regex. | | Sure, I get the benefits of type safety. | | Sure, it'll save me time debugging when I accidentally create an | invalid regex. | | But _what am I meant to do with all that time saved?_ Read a | book? Write more code? I don 't get it. Let me waste the time on | a ludicrously arcane syntax where I spend half the time looking | at every bracket trying to understand if it's a control character | in that particular context, because the ego trip I get from | mastering this ridiculousness is HUGE! | | (Yes, I understand regex syntax. I've been able to explain the | phrase "zero-width negative lookbehind assertion" for the past | twenty years. I inhaled the Friedl book and got utterly high on | the idea that the awesome power of regular expressions - which | are genuinely great in how they ease flexibility in accepting | input - is entwined with their completely inhuman syntax. But I | was wrong.) | grishka wrote: | I've never had any problems writing and debugging regular | expressions after I came across this: https://regex101.com | | And since regexes are usually write-once, adding this | complexity on top of them serves no additional benefit. If | anything, it'd probably make it _harder_ for the next guy to | understand your code. | ziml77 wrote: | I bought RegexBuddy years ago and have loved it for | debugging. However it only runs on Windows. Found regex101 | recently and I think it's a great alternative (though I | almost didn't check it out because the domain has SEO abuse | site vibes). | sixo wrote: | It's fun to write regex but it is absolutely miserable to read | it. This looks like an improvement. | szatkus wrote: | It is. For most parts regexes like "\d+" are ok, but when | there is something more complicated I pull Verbal Expessions | into a project. To these days reactions on CR were mostly | positive or netural at worst. If it was built into the | standard library I would probably use it instead of regex, | but adding a new dependency and interfacing it with libraries | that expect Java regex objects has its cost. | sergeykish wrote: | Regular expression defines graph, graphical representation | looks like a better choice: | | https://jex.im/regulex/#!flags=&re=%5E(%3F%3Ahttp)(%3F%3As)%... | | Usage example -- CSS Syntax Module Level 3 documentation: | | https://www.w3.org/TR/css-syntax-3/#string-token-diagram | | and JSON specification: | | https://www.json.org/json-en.html | | Have not found visual editor, made sample in quiver: | | https://q.uiver.app/?q=WzAsMTEsWzIsM10sWzAsNl0sWzEsMCwiXiJdL... | wwright wrote: | Graphs may be more clear, but if we rule out visual editors, | IMO this approach is still a net positive. | maweki wrote: | > Sure, I get the benefits of type safety. | | It seems not ;) | | It's as if your java-compiler would stop warning you on | forgotten semicolons and would instead error out during runtime | when it reaches the statement with the missing semicolon. | | It's not your time saved. It's time saved not running the test | suite, for example. An uncompilable regex is a category of | errors that you can ban completely from your program. Like java | bans syntax errors as a category of (runtime) errors. It's time | saved as any developer will not break this in a way that is not | a semantic error. It's time and mind saved not thinking about a | whole class of errors. | lgeorget wrote: | Fortunately C++ won't suffer from this kind of problems since | there, you can make your regex builder a constexpr! | | (lol) | [deleted] | yoz wrote: | Thank you for the clarification! To be clear: my post is | sarcastic, and I was trying to say that this library looks | like a significant usability improvement over traditional | regex syntax. | brown9-2 wrote: | Saving time is not just about getting to use it elsewhere - you | also save time fixing bugs and the harm they can cause. | pwdisswordfish4 wrote: | Verbose does not 'easier to read' make; especially when you | don't know whether 'anythingBut' means (?!...) or [^...]. | | Type safety is nice, sure, but it's a rather small benefit in | this case. It doesn't mean abandoning commonly-understood | syntax is worth it. Most regular expressions are short enough | to make errors visible with the naked (or IDE-assisted) eye. | | This library at best looks like a crutch for a deficient | language (which Java admittedly is), and at worst an | unnecessary obfuscation layer. | deepsun wrote: | You don't like Java, I see. | | This tool has little to do with Java, except that author | decided to implement it in it. It's a regular expression | composer. You could implement it in any other general-purpose | language. | jehna1 wrote: | It already is. In 30+ of them. You can find them on: | http://verbalexpressions.github.io/ | shawnz wrote: | Surprisingly it emits [^...]* | | See: https://github.com/VerbalExpressions/JavaVerbalExpressio | ns/b... | alisonkisk wrote: | The "deficient language" is "regex" not Java. | | Escape codes and special characters for regex semantics is a | deficiency of the 1980s programming world, not Javam | romanoderoma wrote: | They can be hard to read, but I don't think they are | deficient, on the contrary I think they are very elegant | | Stephen Cole Kleene was a brilliant mathematician and when | he invented regexs in the 50s of the past century, he | anticipated a lot of concepts that became popular in | computer science, such as recursion (which he also founded | as a branch of mathematics and computer science together | with Alonzo Church, Kurt Godel and Alan Turing) | | Java on the other hand has some deficiencies here and there | and it's not really a modern language free from old cruft | admax88q wrote: | > (?:[a-z0-9!#$%&' _+ | /=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'_+/=?^_`{|}~-]+) _| "(?:[ | \x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\\\[\x | 01-\x09\x0b\x0c\x0e-\x7f])_")@(?:(?:[a-z0-9](?:[a-z0-9-] | _[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]_ [a-z0-9])?|\\[(?:(? | :25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2 | [0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\ | x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\\\[\x01-\x09\x0 | b\x0c\x0e-\x7f])+)\\]) | | Such elegance. | | Might as well write code in machine code while we're at | it. | Tainnor wrote: | That's like complaining that you can write ugly code in | any language. The problem isn't the regular expression, | it's that email addresses, while they technically may | form a regular language (not sure if they 100% do), are | an insanely complicated such language and not a very nice | one. | | How would you write a specification for that language in | any other way that was _more_ elegant? Sure, you could | make it more verbose, but that wouldn 't make it easier | to understand the whole of it, or why it is the way it | is. | admax88q wrote: | Almost every RFC writes their grammars in some form of | BNF not in regular expressions. RFC are written to be | understandable. | | > Sure, you could make it more verbose, but that wouldn't | make it easier to understand the whole of it, or why it | is the way it is. | | Absolutely it would. The way to understand a large thing, | is to understand the smaller components and then put them | together. Regular Expressions to do not compose well. | Tainnor wrote: | > Regular Expressions to do not compose well. | | That is patently untrue. Regular expressions compose | under a number of important mathematical operations, such | as union, intersection and concatenation. If your PL | supports string interpolation, it's trivial to compose | them in these ways (well ok, maybe not intersection). | Nobody says that your regex needs to be written as a | single string. | ajuc wrote: | Elegant regexes are almost unheard of in real world no | matter if it's e-mail or anything else. | notreallytrue wrote: | Principia Mathematica wants a word in private | | P.s. do you realise how much harder it would be to | understand the same thing in machine language? | Yeroc wrote: | Is there any language 20+ years old without cruft? | notreallytrue wrote: | Haskell? (30 years old) | | Where are the Lispers when you need them? :) | wwright wrote: | Haskell absolutely has waaaay too much cruft. Have you | read the 30 page articles recommending which extensions | to use? Have you ever seen MTL? Read any documentation | written by Edward Kmett? | colonwqbang wrote: | > Verbose does not 'easier to read' make | | Java is founded on the opposite principle, I think. | dehrmann wrote: | Yeah; I'd rather have proper multiline strings in Java and a | regex documented with the COMMENTS flag set. What I don't | need is a regex builder. Or SQL builder, for that matter. | [deleted] | AlphaSite wrote: | I think java has (or is getting) multiline stings now. | MHordecki wrote: | Java got them in the most recent version 15: | https://openjdk.java.net/jeps/378 | dehrmann wrote: | This took them _way_ too long for how easy it is to add | and how many headaches it would prevent. | oweiler wrote: | I'm an average developer but never found regular expressions too | hard to write or even read. | theparanoid wrote: | Anything but the simplest regexes are tricky to correctly | write. | jacobwilliamroy wrote: | How do I learn regex? I get confused because it seems like maybe | there's more than one kind of regex floating around out there, | and since regex is made of lots of punctuation symbols, it's very | hard to search for things about it on the web. Is there a single | book I can read? A couple books? Does it depend on my runtime | environment? | gambler wrote: | Once you start thinking about it, it's mind boggling that we have | thousands of languages and yet most of them don't have built-in | facilities to construct and parse grammars (at least context-free | ones). Every single designer seems to think that _their_ language | is finally good enough and will not be used as a starting point | for another one. | throwaway_pdp09 wrote: | Because building in a parser is inappropriate; it isn't | generally worth it. You use a separate tool or framework, you | don't build it into a language. | rbonvall wrote: | Raku (Perl 6) has grammars as first-class citizens: | https://docs.raku.org/language/grammars#Creating_grammars | TimTheTinker wrote: | I am all for developer ergonomics, and I'm a fan of Ruby... but | the problems this library would add to a codebase/project seem | too big to be worth the benefits: | | - non-standard syntax requiring its own documentation, which | developers would have to consult separately (even if they already | know regular expressions) to modify generated regular expressions | | - removing the ability to test and validate regular expressions | independently of the codebase (say, in the terminal, a small | shell script, or using an online tool) | | - a new rabbit hole to traverse when debugging a problem | | - assuming the security risks associated with handing over regex- | building to a library built by someone else (even more so if the | regex is parsing private or protected data) | | - adding a new dependency that may or may not be maintained in | the future | | For those who would want to use this library, I would suggest | using a separate tool to build and/or understand regular | expressions. Here's one example, and I'm sure there are others: | https://regexr.com/ | tasogare wrote: | I did a little class like this with a fluent API in C# to | generate regex in a project that requires big ones. It make | working with regex super easy and super maintainable. | soco wrote: | Looks a bit abandoned though, doesn't it. Otherwise I'd love it | for the safety and readability (while I'd still need to re-learn | all what got forgotten in the last half a year before I used | regex last time) | swlkr wrote: | It's semi-related, but if you're into easier regex, have a look | at janet's PEGs | | https://janet-lang.org/docs/peg.html | jjevanoorschot wrote: | For everyone that doesn't see the point, take a look at the | example of parsing a long string [0]. The verbal expression is | _much_ easier to read than the regular expression. | | [0] | https://github.com/VerbalExpressions/JavaVerbalExpressions/w... | pavon wrote: | I don't see it. The regex is mostly hard to read because they | formatted it poorly and put in a bunch of unnecessary non- | capturing groups. I find this to be just as easy (if not | easier) to read as their first example: | String pattern = ( "(\d+)\t"+ | "(\d+)\t"+ "([0-1])\t"+ | "(http://localhost:20\d{3})\t"+ "([0-1])\t"+ | "(\d+)\t"+ "([0-1])\t"+ "(\d+)\t"+ | "(\d+)\t"+ "([0-1])\t"+ "(\d+)\t"+ | "(STR[0-2])" ); | | And this is just as easy to read as their second example: | String num = "(\d+)\t"; String bool = "([0-1])\t"; | String url = "(http://localhost:20\d{3})\t"; String str | = "(STR[0-2])"; String pattern = | num+num+bool+url+bool+num+bool+num+num+bool+num+str; | | And yes, I do frequently split up my regexes like that to make | them more readable. | | The only improvement I see is that you don't have messy | escaping in the url. That is genuinely nice. It motivates me to | start using an regEsc() function instead of doing it by hand. | However, I find "capt().endCapture()", and other verboseness to | be a step backwards. | | Edit: Actually, from what I can tell, all the escaping was | unnecessary in this case as well. Updated examples without | unneeded escape characters. | flying_sheep wrote: | That really depends on how complicate the regular expression is. | For me this debate sounds like arguing assembly vs C. We will | need some sort of abstraction to develop a higher-level stuff in | case we need it. | jjice wrote: | While neat, I think that if you're a developer, you'd be better | off learning basic regular expressions instead so you can use | them in whatever language you'd like. Depending on this would | probably just make moving to a new code base that doesn't use | this a lot more confusing. | | A normal regex with a comment above it explaining what it does | (for complex cases) always worked well for me. | nsxwolf wrote: | I can't learn them. I've tried for over 20 years and every time | I use them the knowledge is deleted from my brain immediately. | A library like this would be very helpful if it worked. | | One problem is that I'm more likely to need regex almost | anywhere but Java code. | rhacker wrote: | At the bottom there's a list of other languages that support | the same API | TonyTrapp wrote: | Even as a developer you may have to assemble a regular | expression at runtime, at which point a library that can do it | for you may be much more handy than having to assemble the | string yourself. | | And even if you know regex by heart - assembling it with | function calls can still be better / safer just like you | shouldn't insert SQL parameters by hand into your SQL query | strings. | spatx wrote: | I think there is value in both cases. I've seen many developers | that have struggled with regex even with all those hundreds of | tools to learn and to build/test regex. This could be useful to | them to start with, and they can learn regex according to their | time/needs. I see solutions like this as a choice, and the fact | that people are using these shows that there is value in having | that choice, even if is not obvious to us at first glance. | patal wrote: | That does not work so well if you're working in a team. A | fairly complex regular expression is always hard to read. | | We see this as early as in code review and as late as when you | find a production bug because expectations of the surrounding | code have changed. | | For those reasons, we usually break regexes into parts anyway, | and name and comment the single parts. Using the library's | example, we might have: protocol = | "^(?:http)(?:s)?" protocol_separator = "(?:\:\/\/)" | url = "(?:www\.)?(?:[^\ ]*)$" regex = protocol + | protocol_separator + url | | Which turns out to be in the direction of these Java Verbal | Expressions. I find the Verbal Expressions idea really | enlightening. | 1f60c wrote: | The HN title isn't very informative. | | Maybe you could change it to something like: Java | Verbal Expressions: a DSL for regular expressions | [deleted] | skocznymroczny wrote: | Looks interesting. I find out all my regexes are pretty much | write-only. When I come back to them few months later, I can't | make much of them and it's easier for me to start from scratch. | Tools such as https://regex101.com/ are amazing though for | development of regexes and later trying to make sense of them. | jug wrote: | I'm not sure if this is great or crazy! I try to not be swayed by | the handpicked examples because this at least _feels_ like a | design that could get messy once you try to do the particularly | gnarly regexeps that this library claims it was designed for. If | it's great, it should already have been done long ago, hmm... | bwestergard wrote: | This is a nice API. It seems to get right up to the edge of | becoming a parser combinator library. | | Is it actually improving performance to use the regular | expressions internally to evaluate matches? | justin_oaks wrote: | This project would be better if it wasn't exactly a 1-to-1 | mapping from words/methods to regular expressions. For example, | the regex "\d+" maps to the code "digits().oneOrMore()". That | doesn't read well in English because it's odd to have an | adjective after the noun (i.e. we say "red bird" not "bird red"). | | Also, a serious weakness in regex is they are "write only", or | hard to read. That's because they are compact and don't have | discernible sections that are then assembled together. | | You can do that yourself in Java by assigning chunks of regex to | variables and then concatenating them together, but the regex | engine doesn't let you do that itself. You can't name sections of | the regex or insert comments into it. | | The example | ^(?:http)(?:s)?(?:\:\/\/)(?:www\.)?(?:[^\ ]*)$ | | could be better if it could be broken down into named pieces or | commented like this: ^ | (?:http)(?:s)? # http or https (?:\:\/\/) # :// | (?:www\.)? # optional www. (?:[^\ ]*) # rest of | URL (no spaces) $ | dmarlow wrote: | I love your example of how it should be explained. This helps | people correlate the verbal aspects to the regex parts they | described. This ultimately reinforces and helps people learn | regex more deeply. | throwaway_pdp09 wrote: | I thought java regexes had comments? | | https://docs.oracle.com/en/java/javase/11/docs/api/java.base... | justin_oaks wrote: | Huh, I didn't know that. I've read through a fair amount of | Java code with regexes and never seen anyone use comments. | Maybe it's because Java doesn't have proper multi-line string | support built into the language. | | If you don't have multi-line support in the language then | you're more likely to put the comments outside the string: | String regex= "^" +"(?:http)(?:s)?" // | http or https +"(?:\:\/\/)" // :// | +"(?:www\.)?" // optional www. +"(?:[^\ ]*)" | // rest of URL (no spaces) +"$"; | throwaway_pdp09 wrote: | ... which has to be a better way of doing it (comments + | regexps in digestible chunks) than having a rather wordy | library. | abhinai wrote: | Beautiful though a little verbose! | chrisbrandow wrote: | solve a problem with regex: now you have 2 problems. | | well, now you have 3. | ebiester wrote: | I wrote one of these in 2002, back in college, after being | inspired by Icon and SNOBOL. From Wikipedia: s := | "this is a string" s ? { # | Establish string scanning environment while not pos(0) | do { # Test for end of string tab(many(' | ')) # Skip past any blanks word := | tab(upto(' ') | 0) # the next word is up to the next blank -or- | the end of the line write(word) # | write the word } } | | I really think we lost out when we went toward regular | expressions rather than SNOBOL/Icon syntax, but I don't think a | direct substitute is as much the issue. | prabhatjha wrote: | This is a fantastic idea -- the kind you see and go why the heck | this was not done before. Such a huge time saver. | redmorphium wrote: | Reminds me of https://github.com/francisrstokes/super-expressive | PaulHoule wrote: | That kind of thing works even better in Java because the static | type system enforces it. | | In particular generic methods don't have the problem of type | erasure that affect generic classes so many things you would | want to do with types "just work". | | Almost everybody is afraid of it, but $ works just fine as an | identifier and can be used to make a DSL that looks like jQuery | in Java. | | Maybe someday i will write a class like: | class("some.namespace.MyClass").method(...) | antpls wrote: | That would definitely help code reviews and maintenance. Is there | anything similar for Python? | ajainy wrote: | of course as others pointed out, writing direct exp might be | optimal or every dev should learn about it. | | BUT in my whole career span, whenever I have to use regex, I | spend couple of hrs learning and testing. This kind of library | for Java open doors for many other things. (testibility, default | library using default methods etc.., integration with streaming | ). And as community adds to it, it can be optimized internally. | All end user needs to do upgrade versions. Can be extended part | of javax validation specs. | murkle wrote: | Another key point: makes the code readable! | dailygrind___ wrote: | I think Regex is too low-level and a problem worth abstracting. | It works fine with simple patterns but I don't really see how a | pattern like this: | | /((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\\+\$,\w]+@)?[A-Za-z0-9.-]+| | (?:www.|[-;:&=\\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\\+~%\/.\w-_] | _)?\??(?:[-\\+= &;%@.\w_]_)#?(?:[\w]*))?)/ | | (https://stackoverflow.com/questions/161738/what-is-the-best-...) | | contributes to readability. | ridaj wrote: | Is the implied contention that | `regex().capt().digit().oneOrMore().endCapt().tab()` is easier to | read than `([0-9]+)`? | | If so, maybe this isn't for everyone :) | miked85 wrote: | I feel like one would be much better off and more efficient by | just learning regular expressions. | hcarvalhoalves wrote: | The fact there are countless RegEx cheatsheets and pages like | https://regex101.com/ or https://regexr.com/ is evidence | RegExes are not intuitive or easy to remember. Composing plain- | english functions can be easier to remember, and editors can | provide auto-complete. | ed25519FUUU wrote: | The argument isn't that they're easy to use, it's that | they're widely used and widely available in virtually all | languages. You'll encounter them. | | The time it takes to learn regular expressions will pay off | because you'll be reading them and writing them for your | whole career. | king_magic wrote: | Eh, not necessarily. 15 years into my career, I can count | the number of times I've needed regular expressions on two | fingers. | flatiron wrote: | IntelliJ at least has a built in regex maker that you can | test against strings in the IDE. Pretty close to auto | complete. | pwdisswordfish4 wrote: | It's only evidence that they have to be learned, like | everything else. | djeiasbsbo wrote: | I'd say the bigger issue are the different regex | implementations. If you use Java, Javascript and grep you | already have to know the peculiarities of each | implementation... | teknopurge wrote: | I love seeing new things and building, I also want to | understand why people would find value in this? Is it because | people are learning things differently and find this easier to | digest instead of using regexs? or native substring | tokenization/boolean primitives? | | The new me is being less critical and positive... | (smileyface.jpg) | nerdponx wrote: | I agree. | | But if you want better readability and comments, Python's | "verbose" regex (?x) is a beautiful thing. You can usually also | just construct regular expressions incrementally by | concatenating strings or whatever your language supports. | tekknolagi wrote: | A buddy of mine made Remake | (https://docs.rs/remake/0.1.0/remake/) with this kind of thing in | mind. It's a DSL for composing regular expressions in a readable | way. | throwsofaraway wrote: | Trying to simplify something that doesn't simplify inherently | isn't always a good idea. Regex is pretty close to the least | level of abstraction that is necessary to get the job done. It | could probably be improved on, but probably not by much. | | Some commenters below mentioned this Java syntax is a good idea | and using endless number of regex cheatsheets as a testament to | why regex is not simple enough and should be replaced. It's | almost silly that this is even an argument on HN. Take for | example quantum physics, there are lots of videos and guides that | try to explain how it works, in fact some of the smartest people | tried to explain it, even Richard Feynman. But he famously said | if you think you understand quantum mechanics you don't | understand quantum mechanics. | | Some things cannot be reduced any further, this does not mean | those things are always simple in nature or somehow were designed | in a convoluted way on purpose. | | At least when it comes to regex it's important to keep in mind | what Einstein said, "everything should be as simple as possible | but no simpler." | | It's ironic that people apply reductionism to simplify regex, a | thing that itself one could argue is a prime example of | reductionistic design, yet they complain it's too abstract while | applying reductionism. | rendall wrote: | This project seems to be rediscovered every so often | | https://news.ycombinator.com/from?site=github.com/verbalexpr... | chubot wrote: | Related: Oil has an regex syntax that composes and doesn't have | escaping problems: | | https://www.oilshell.org/release/latest/doc/eggex.html | | Direct link to example: | | https://www.oilshell.org/release/latest/doc/eggex.html#examp... | | A longer example: | | http://www.oilshell.org/blog/2019/12/22.html#eggex | cutler wrote: | Maybe if Java left the Stone Age and fixed the need to escape | regex metacharacters this wouldn't be necessary. | cratermoon wrote: | this link has been posted 15 times on NH. First time was over 7 | years ago https://news.ycombinator.com/item?id=6200070 | tomp wrote: | Is it just me or does this seem like a very bad idea? I mean it | _seems_ nicer but the reality is, if you don 't know how Regexes | work, you won't understand the nuances of the "verbal" regex | either... Also, some optimisation maybe? | ^(?:http)(?:s)?(?:\:\/\/)(?:www\.)?(?:[^\ ]*)$ | | could be better written as | ^https?://(?:www[.])?[^ ]*$ | | Or am I missing something? In that case, I'll readily admit this | library is a good idea :) | wffurr wrote: | Now we have three problems... | dkarl wrote: | Sadly, the ability to mash autocomplete instead of looking at | the doc page for regular expressions will be a major selling | point. | Gaelan wrote: | Why sadly? Autocomplete does a ton for making APIs more | discoverable and easy to use. | dkarl wrote: | The intersection of debugging regexes and debugging code | written by someone cycling through autocomplete looking for | methods that sound right should not be real. It should be a | myth, a region of programmer hell, a scary story to tell | children about what will happen to them after they die if | they don't document their code. May Dijkstra strike down | anyone who succeeds in bringing this horrible idea to | production. | InfiniteRand wrote: | I think there's certain use case for this, a moderate regex | user who's not an expert and not fully comfortable with regexes | but knows the basics, and who is in a project where they need | to heavily use regexes for a limited amount of time and where | they will need to maintain this code going forward. | | If you use regexes a lot, you are better off learning regexes, | if you use regexes a little, this is a lot to learn to avoid | learning a little about regexes. But there is a moderate user | sweet spot where I could see this useful. | ajuc wrote: | I like it as a simplistic builder. Much easier to read, | autocompletes, and (I assume) handles escaping for you (because | it knows you put only raw data inside). | | Just escaping alone is a big selling point for me. | toxik wrote: | In fact, why test for www at all? It is a subset of [^ ]* | anyway. | [deleted] | CapacitorSet wrote: | It seems that the cruft really boils down to using groups even | where there is no ?/*/+ qualifier. | simias wrote: | I think it's a great idea... if you already know regex. | Effectively it's just a different syntax for the same construct | after all, it doesn't simplify anything, it just makes it more | readable. Oh and it makes escaping a non-issue, which already | almost sells me on the idea completely, since it seems that 50% | of the time I spend writing regex is figuring out what needs | escaping and how. | | Writing regexes is not much of an issue usually (although the | many dialects in common use are always a source of frustration) | but reading them is always a pain, for me at least. For quick | and dirty shell scripts or vim editing it's great, for stuff | that's supposed to be long lived and actively maintained in a | codebase I think this verbal approach is a great idea, at least | in theory. | | Regarding the optimization of the intermediate result it should | only be a problem if you actually need to output these regexes | for other uses or if you need to compile many of them at | runtime with performance constraints. If your regexes are pre- | compiled then the resulting DFA should look the same as far as | I can tell. | | If somebody makes a Rust crate with a similar concept I'll be | sure to try it out next time I have to write regexes in a | codebase. | dehrmann wrote: | > I think it's a great idea... if you already know regex | | It's actually a bad idea in this case because regex is mostly | the same in every modern language, so if you know it, you | know it everywhere. What you don't know is this. | | I agree with the common complaint that regex is effectively | write-only, but this is only half due to its terse syntax. A | pattern can be pretty complex on its own, and complex things | are hard to understand. Imagine what code matching behavior | of a complex regex would look like. | simias wrote: | > It's actually a bad idea in this case because regex is | mostly the same in every modern language, so if you know | it, you know it everywhere. What you don't know is this. | | I disagree, at least in my experience there are significant | differences between multiple regex engines I'm used to use | regularly. In no particular order: are parens and other | operators treated literally by default or do they need to | be escaped? Are character class like '[:alpha:]' | understood, or do I need to write them explicitly? | Similarly, do I have access to \w \W \s and friends? Can I | use + to mean {1,} ? Can I use '?' to match 0 or 1 (common) | or do I have to use = (vim)? Or maybe just {0,1}? But then | should I escape the braces? Do I have recursion? Do I have | named captures? | | Those are not theoretical concerns, that's stuff I | routinely end up getting wrong because I forget that this | one feature that works in pcre does not work in vim or | works differently in sed etc... | dehrmann wrote: | > are parens and other operators treated literally by | default or do they need to be escaped? | | > Can I use + to mean {1,} ? Can I use '?' to match 0 or | 1 (common) or do I have to use = (vim)? Or maybe just | {0,1}? But then should I escape the braces? | | I think that's just older tools like vi and sed. Perl, | Python, Java, and Javascript use a similar modern version | where + and ? work, and parentheses and braces don't need | to be escaped. | lucb1e wrote: | > if you know it, you know it everywhere. What you don't | know is this. | | Right, one language might have anythingBut(" ").endofline() | and the next language might have a different . operator | like anythingBut(" ")->endofline() or it might even require | nesting calls. None of these things are a significant | hurdle and if we standardize the names (endofline, | anythingBut, ...) then you can make the same argument. It's | a chicken and egg argument: just use regex because that | works everywhere -> it's not universally implemented -> it | won't work everywhere. | | And aside from that, I have a similar experience to the | sibling comment: when using some command line tool that I | forgot (is it sed? Vim?) the default is that \\( is a | capture group whereas in normal regex ( is a capture group. | Grep offers you three regex variants to choose from. I have | to look up regex syntax or do trial and error every time I | don't use a language that I use daily. And I don't know all | of regex to begin with, I just know everything I ever | needed but people posted examples here with (?:x) which I | don't know. I once read it and remembered it for a few days | I think... so anyway, consistent and descriptive method | names seems a lot easier especially when you consider | autocompleting IDEs. | hansjorg wrote: | Rust version of the same library: | | https://github.com/VerbalExpressions/RustVerbalExpressions | | Implementations for 36 different languages: | | http://verbalexpressions.github.io/ | pwdisswordfish4 wrote: | Well, there's at least one advantage: apparently this builder | library automatically escapes literal strings passed to it, so | you no longer need to worry about injection bugs if you | construct patterns dynamically (cf. parametrised queries versus | 'come on, just use mysql_real_escape_string, it's not that | hard'), | | I'm not sure this alone pulls its weight, though; most of the | time, regular expressions are fixed at compile time. And I'd | still prefer something that mostly preserves commonly- | understood pattern syntax. Having to guess whether | 'anythingBut' means (?!...) or [^...] is not encouraging. | | (This was apparently ported from JavaScript, where it is even | more pointless: template literals can take care of the escaping | part without abandoning standard pattern syntax. But as far as | I know, Java has no equivalent feature.) | bjarneh wrote: | > Is it just me or does this seem like a very bad idea? | | It's not just you. As you say this can only truly be used by | people you understand regular expressions; and they would most | likely prefer not to use this stuff. | | It seems the whole IT industry is obsessed with helping us do | all sorts of things, even simple things, which in the end often | makes things more complex. Different query languages that | translate to SQL to help us out, which often create super- | complex SQL. All sorts of wrappers to avoid us having to deal | with all sorts of formats (JSON/XML..). Hopefully those | wrappers do something useful with those date-objects you know | you have in there somewhere... | marcinzm wrote: | >It's not just you. As you say this can only truly be used by | people you understand regular expressions; and they would | most likely prefer not to use this stuff. | | I know regex and I hate writing it. It's unreadable and I | need to spend time remembering/googling/checking the exact | syntax. And, of course, the syntax differs from | implementation to implementation in subtle but important ways | (ie: need to double escape in python, etc.). | cutler wrote: | Perl and Ruby don't need to escape regex metacharacters so | why do Python and Java? It's just archaic. | wutbrodo wrote: | > It's not just you. As you say this can only truly be used | by people you understand regular expressions; and they would | most likely prefer not to use this stuff. | | There's a niche where this might be useful, but by definition | it's small. I understand regexes a moderate amount, and can | construct arbitrarily complex ones when necessary. But I do | it just infrequently enough that it can be painful and | halting above a certain level of complexity, with lots of | testing and reference-checking. It'd be nice to use something | sane like this, and I think I fall squarely into the category | of "people who understand regexes but would prefer to use | stuff like this". Though as I said, this niche is almost by | definition small, and on top of that I can't remember the | last time I used Java. | | Completely independently, in any non-trivial engineering | system, readability is important, and this helps a lot there. | pydry wrote: | A lot of IT is the parsing and mapping of one kind of | language (whether markup, DSL, Turing complete) on to | another. | | Doing it right is a delicate balancing act of being just | powerful enough to express everything the user needs without | devolving into an unreadable or repetitive mess. Some people | manage to achieve neither. | dehrmann wrote: | > Different query languages that translate to SQL to help us | out | | That and UI SQL builders. What I want is typeahead column | names, not a dropdown for the column, the operator, etc. | _jal wrote: | Yep, and SQL-builders are the first thing I thought of, too. | | These tools are great for letting someone build something | they don't understand, and leaves them completely adrift when | something goes wrong. | | The next step is they bring this nonstandard thing to "the | expert", who has to figure out their tool before they can | figure out what's going wrong... | simias wrote: | I don't think SQL builders are a good comparison because: | | - SQL can already be made fairly readable by default, it's | not just a long series of cryptic tokens. The main point of | SQL builders is not to make SQL more readable, it's to make | SQL approachable by people who don't know SQL. | | - There can be several ways of achieving the same result in | SQL, with sometimes deep performance implications, so it's | really important to understand what is being executed and | in what order. Regular languages are much simpler and while | the string representation of the regex might end up longer | than the handcrafted equivalent, the runtime performance | should end up being the same since in the end it's all | deterministic finite automatons. | | - SQL builders have to be at least a little bit opinionated | to be really useful, in general they make it easy to create | simple queries but can quickly become limiting for complex | queries, especially if you already know SQL. These "verbal | expressions" on the other hand can easily map 1:1 with raw | regex constructs, allowing somebody who already knows regex | to express exactly the same logic, just in a more verbose | and human readable way. | | This verbose syntax operates at exactly the same level of | abstraction as normal regex, it's just a syntactical | transform effectively. It's like JSON vs. CBOR or something | like that. | _jal wrote: | > There can be several ways of achieving the same result | in SQL, with sometimes deep performance implications | | Which is also very true of regexes, especially the more | feature-rich ones variants. | | And the existence of variants was a large part of what I | was getting at. | | > it's just a syntactical transform effectively | | Yes, it is tooling that helps people do things they don't | understand. | lmilcin wrote: | No, you are not. These "verbal" expressions are nothing more | than a builder for actual expression. So you can't actually use | it without understanding regular expressions. | disgruntledphd2 wrote: | They're much easier to scan in a large codebase though, which | I suspect is the major advantage. | jariel wrote: | " These "verbal" expressions are nothing more than a builder | for actual expression." | | It may be under the hood, but there's no reason for it to be. | | There's nothing inherent in our regexes that would imply they | are 'the language' for that purpose, it just so happens we | really only have one commonly used one. | | Like most things invented forever ago, there might be | opportunities for a 'cleaner, better way'. | lmilcin wrote: | Obviously, there might be occasions to improve. | | But, regular expressions seem quite well optimize from my | point of view. | | Regular expressions are used for exact same task regardless | of programming language -- using single expression language | regardless of programming environment seems like a huge | advantage. It can be embedded in configuration file, as a | string in a database, on a web page or deep in backend | code, and it will still work the same. | | The "Java Verbal Expressions" already have "Java" in the | name and so are complete loss when it comes to portability. | | Then comes the fact that "Java Verbal Expressions" are many | times more code that actual regular expressions. That isn't | easier to scan, it is much worse. | | Regular expressions are very succinct and you can express a | lot in a single line of it. Comparable JVE-s would require | many lines and wouldn't be more readable for anybody other | than a person that doesn't know regexes at all. | lmilcin wrote: | It is a huge amount of code for a relatively simple expression. | | I see not a single situation where this would actually look more | readably than a proper regex. | | Unless... you don't want to learn regular expressions and then | you have two problems... | [deleted] | mrkeen wrote: | Even though I still use regexes in rare circumstances - e.g. | inside config files, parser combinators already do a much better | job than this (or regexes) when you are writing maintainable | code: warcEntry = do header <- | warcHeader crlf body <- do | contentLength <- getContentLength header | compressionMode <- getCompressionMode header | warcbody contentLength compressionMode crlf | crlf return (WarcEntry header body) | | If you accept crlf as "carriage-return-line-feed", the rest | basically reads as pseudocode. crlf could have just as easily | been written (string "\r\n") I guess. | | Parser combinators can: | | * call out to other parsing functions (e.g. warcHeader) - so you | can build your code out of testable units. | | * bind results to variables and start using them during the | parse, e.g. warcHeader returns data containing contentLength and | compressionMode, which is then fed to the warcbody function so it | knows what to expect. | pandemic_region wrote: | WHERE HAVE YOU BEEN ALL MY LIFE | surfsvammel wrote: | Unlike many others, I actually like this idea. I know regular | expression, but many of my colleagues do not. They often have a | hard time understanding what a particular regex do, event though | I often document them step by step. Something like this would | make it more readable. | | I do agree with others here, that it seems a bit rough around the | edges and some optimisation might be needed. But I think the idea | itself is sound. | maweki wrote: | Maybe you should look at visualizations like Regex Railroad | Diagrams. This is what helps me most. | zvrba wrote: | I limit my brain-time on constructing a regex to 5 minutes max. | If it takes me longer than that, I reach for a parser. Pick the | right tool for the job. | maweki wrote: | It's pretty verbose, but it is useful in the sense that you have | type-safety between character groups and the control characters. | It's neat that it only allows you to create valid Regexes (I hope | it does). At least you have static safety that your parenthesis | for capture group are properly closed. | | This advantage is not explained. Not being able to construct | invalid regular expressions is a good static safety guarantee | that you don't get when you embed DSLs as strings. | | Edit: This is the same reason why we would prefer jOOQ to | embedded String-SQL, if speed/dependencies are of no concern. | You're not allowed to construct invalid SQL as the java-type- | system gives you these guarantees when using an embedded DSL | instead of a String-DSL. This is very powerful, but of course | only works if the type system of the host language is powerful | enough. | laszlokorte wrote: | is(4).equalTo(5.plus(eulers_constant.toThePowerOf(1.toTmaginaryUn | it().times(rationBetweenCircumferenceOfACircleToItsDiameter)))) | stickfigure wrote: | This is cool, but I'm disappointed to see the horrid builder | pattern show up again. Imagine you had to use StringBuilder every | time you wanted to manipulate a String? | | Just make all fields final and combine the builder and 'working' | class into a single immutable object. Like String. | | `build()` everywhere is syntactic noise, and you either lose | immutable safety (by passing around builders everywhere, as in | the examples) or composability (by passing around the 'sealed' | objects). Builders are an antipattern that should only be used in | cases where extreme performance is required. | x87678r wrote: | I wish you were interviewing me. This is Java world you're | talking about and if you can't squeeze a dozen Gamma Design | Patterns into your code you aren't good enough. | noema wrote: | The main intent of Builder isn't performance, but to avoid a | combinatorial explosion of constructors for every possible set | of parameters. | lalaithion wrote: | So why have constructors for every possible set of | parameters? | | Why VerbalExpression.regex() | .startOfLine().then("http").maybe("s") | .then("://") .maybe("www.").anythingBut(" ") | .endOfLine() .build(); | | Instead of new VerbalExpression() | .startOfLine().then("http").maybe("s") | .then("://") .maybe("www.").anythingBut(" ") | .endOfLine(); | szatkus wrote: | Both are equally readable to me, but with the builder | pattern you have an ability to fork a builder. Cloning | objects in Java could be messy. | a_e_k wrote: | Emacs has had an Emacs Lisp version of this for a long time. It's | implemented as a macro so it can build the string regexp at | compile time. | | https://www.gnu.org/software/emacs/manual/html_node/elisp/Rx... | quickthrower2 wrote: | Take 3 more steps in this direction and you can shed the regex | entirely and have parser combinators | kleiba wrote: | Looking at the example, my immediate reaction was that the main | advantage would be the `anything_but` method, relieving me from | the cumbersome construction of stuff like this: | (?:[^t]|t(?:[^r]|r(?:[^u]|u(?:[^m]|m[^p])))) | | What a time-saver it would be to write | anything_but("trump") | | Except, then you look at the source code and see this: | public Builder anythingBut(final String pValue) { | return this.add("(?:[^" + sanitize(pValue) + "]*)"); } | | Sad face :( | recursive wrote: | Why would you be constructing stuff like that? It consumes the | input string up until it differs. When is that useful in a | regex? | enricozb wrote: | How else would you write that you want to match all strings | that don't contain string X? If you were matching at a | specific position, you should use a negative lookahead | (?!xyz), but I think in some cases you might need the mess | above. | recursive wrote: | I can't imagine a case where the mess would be useful at | all. | | Negative lookahead is the only way I can imagine this being | possibly useful. | | I.E. "Give me a string that's not trump and has a vowel in | it". | | Given "trunk", that mess above would match all of "trun". | Would good is matching a prefix going to do ever? | aparsons wrote: | The example isn't a correct URL test regex (far from correct | actually - even though there are plenty of edge cases regular | regex strings tend to miss also) | jefftk wrote: | Their example is showing what the library can do, not trying to | determine which strings are URLs. | im3w1l wrote: | If you put it as a showcase, then people will use it. | cfv wrote: | It'd be absolutely bonkers if you could use this exact same DSL | to _generate_ valid strings | recursive wrote: | Well, since you can use the regex itself to generate valid | strings, it's certainly possible. | slifin wrote: | https://github.com/lambdaisland/regal | | Is a regex DSL that will let you do that, wouldn't be surprised | to see others | m12k wrote: | You can - use this to generate a regex, then run that regex | through one of these libraries: | https://stackoverflow.com/a/22133/126183 | s4n1ty wrote: | Wow, pretty sure I played with something like this for Python in | the 90s. People have been trying to replace regexps with | something more readable for a _long_ time. | | This seems like a decent attempt, although the syntax for | captures looks a little clumsy. ___________________________________________________________________ (page generated 2020-11-25 23:00 UTC)