[HN Gopher] Show HN: RegEx for Regular Folk - A visual, example-... ___________________________________________________________________ Show HN: RegEx for Regular Folk - A visual, example-based introduction Author : shreyasminocha Score : 315 points Date : 2020-05-01 14:05 UTC (8 hours ago) (HTM) web link (refrf.shreyasminocha.me) (TXT) w3m dump (refrf.shreyasminocha.me) | binstub wrote: | Nice intro. Tangential question. Is there a regex tool that shows | where the expression failed ? Not in syntax, but the logical | failure point? Would be useful for when an expression gets a | little long and nested and modifications need to be made. | | Edit: I mean like: | | Target text is abcde | | Regex is /abe/ | | Is there a tool that will tell me it matched a and b and then | failed trying to match e ? | | Those sites are great resources but they are showing pass/fail | and do show an excellent breakdown when something satisfies the | expression, but I'm just wondering if there is something that | shows partial matching until the failure point? | nickysielicki wrote: | https://regexr.com/ | | https://regex101.com/ | | These websites have saved me hours of time at this point. | scottfits wrote: | Incredible resource - I liked the structured approach as opposed | to guess and check regex which is what most tools offer | evo_9 wrote: | Just some 2cent feedback - don't assume anything is known. | | The BASIC lesson doesn't mention anything about /g. Having not | touched regex in years I had no idea what that was and kept | thinking 'why isn't he showing it matching a g if he has that in | the example'. | shreyasminocha wrote: | Point. I've made the temporary omission of an explanation for | /g explicit. I've also included a link to the relevant section | of Flags in the note. | shanecoin wrote: | RegExr [0] does a great job of showing individual highlights even | when they are in a sequential string. You can try to implement | this if you want instead of showing a callout with a note to let | the reader know that they highlights should be on individual | characters. | | [0] https://regexr.com/ | iluvblender wrote: | This is looking great. Thank you. | iluvblender wrote: | Also, I rely on https://regexr.com/ for interactive RE | debugging. | shreyasminocha wrote: | Me too! There are links to RegExr next to each example. Glad | they have query string support. | [deleted] | Zhyl wrote: | Lots of praise here, so I won't re-iterate the good points | (presentation, pleasant tone, good structuring) and will head | straight to the meat of my issue with the title: | | This is not a book for regular folk. | | A regular HN reader, sure. A technically inclined interested | party who wants to break the ice with Regexes, sure. But not | regular folk. | | Here is what I'm talking about: | | > Introduction | | > Regular expressions ("regexes") allow defining a pattern | | Ok, with you so far. As a layman, though, I would be very much be | looking for you to expand on what you mean by 'pattern'. | | > and executing it against strings. | | "Executing" gets a wrinkled brow. "Strings" gets a squinty eye. | "executing against strings" and you've lost me. There's now too | much new information in this sentence for me to be on board with | it. If I knew what all those terms meant and the context with | which they are meaningful, I probably wouldn't be trying to read | 'RegEx for Regular Folk'. | | > Substrings which match the pattern are termed "matches". | | As above, but it's also slightly confusing here that we're | defining matches and we haven't even talked about what a pattern | is yet. As such, I can't even visualise or conceptualise what I | would be matching or similar. If I press on regardless, this is | just some unresolved debt that I will have to reconcile later or | I will just get frustrated and put the book down. | | > A regular expression is a sequence of characters that define a | search pattern. | | Ah, good, we're defining a pattern _after_ we 've already | described a 'match'. | | > Regex finds utility in: | | >input validation | | And straight out the bat we're hit with a term that is only going | to be relevant for techie people. Unless you are aiming this at | techie people. But aren't we aiming this at 'regular folk'? | | The above is really just my long drawn out beef with 'x for the | masses', 'y for mere mortals' and the like. For me the best | explanation of regular expressions comes from Al Sweigart in | 'Automate the Boring Stuff with Python' [1]. He not only gives a | pretty thorough explanation of pattern matching before bringing | in any domain-specific terms, but he also motivates why you would | want to pattern match in the first place. He gives context for | circumstances under which you might reach for regex as a tool. | | I'm looking through the later pages of this book and as a | techperson I'm thinking 'this is beautiful. I can see the | examples clearly, there is a clear correlation between the | visuals and the exercise.' I'm also thinking as a folk person | 'when the hell will I need a match? Under what circumstances and | I going to need to know that there is one 'p' in 'grape' but two | 'p's in 'apple'? What use is writing a pattern to match against | certain fruits and utility items? | | So yes, basically, after all that I can summarise "good book, bad | title". | | [1] https://automatetheboringstuff.com | shreyasminocha wrote: | Valid points. I agree. | | I'll try easing the curve, especially early on and make clearer | the intended audience. | parhamn wrote: | I sometimes wonder what a syntactically clarified regex could | look like. There are two things that often confuse newcomers: | | - What are escapes are and what needs to be escaped? | | - The <character-class><repetitions> structure of a regex. | | - Syntax around things like capture (is the parens part of some | matcher? what to escape?) | | We should have a version of regex that separates characters, | character classes and operators, or whatever the regex jargon for | those things are. Half the things I usually want to regex for, | like parens on a function or dot accessors need to be escaped! | | A quick example for illustration purposes (please don't point out | why this grammar wont map to regex): | <startofline>(['a' or 'b']<2,4,greedy>, | captureAs="prefix")[number or '.']<2><endofline> | | is definitely more approachable and easier to explain than the | regex equivalent (which I'm avoiding to write because I don't | have time to test if I got capture syntax right). | | Maybe someone makes a wasm regex-simple transformer we can use in | multiple languages. Regex is too useful to have such a scary | syntax for newcomers! | yoz-y wrote: | I think most people just like to hate on regex syntax because | when just glanced over it looks like spilled tea leaves. | | However I'd argue that it's not actually very hard to learn and | its brevity makes it easier to retain. (personally I did so | using https://www.regular-expressions.info/tutorial.html) | | I agree that escaping is a problem, mainly because languages | have often different rules for this. | romes wrote: | Hey! I'm wondering if in the first example of regex negation the | (^) should appear after the bracket ([). | shreyasminocha wrote: | Yes, it must appear immediately after `[` | trevor-e wrote: | FWIW Cisco Umbrella is blocking/reporting your site as a security | threat. | shreyasminocha wrote: | Very odd, no idea. Perhaps because the certificate is just a | few days old? | busterarm wrote: | We need this approach for more advanced RegEx and the regular | language subject in general for actual working programmers, from | some of what I've seen. | filmgirlcw wrote: | I love this! I love RegEx but have struggled trying to "teach" | others over the years. In addition to books like this, I often | find writing RegEx with something like Expressions[1] (and I know | there are many great website solutions, Expressions is just a | great app that I find very approachable to newcomers) is a great | way to learn. When you can see what you're writing select what | you want, you get a great grasp of how it works. This, alongside | a good book with good examples, is pretty much how I learned | RegEx ~12 years ago. | | [1]: https://www.apptorium.com/expressions | T3RMINATED wrote: | Mac only apps should die. | donaldihunter wrote: | This is nicely done. It could benefit from some non /g examples | on the basics page, especially since flags are not covered until | chapter 8. | | One visual enhancement that could be really helpful would be to | hover over the regex or the match and see the reciprocal | highlighted. | 120bits wrote: | I started teaching Python to my GF(she is working from home and | now has plenty of time to do some extra learning). She is not a | programmer and I have been giving her small functions to write. | Recently, we started with RegEx and she finds it really hard to | get into. She wants to see examples and follow along. I think | this will be perfect for her and anyone starting out to learn | regex. | canada_dry wrote: | Decent guide. It's great that all the examples are linked to | https://regex101.com/ so people can play/explore! | | More regex resources I rely on: | | http://www.regexr.com/ | | https://gchq.github.io/CyberChef | | https://regexper.com/#.%3F%5Bv%2Ci%5D.* | | https://cheatography.com/davechild/cheat-sheets/regular-expr... | shreyasminocha wrote: | The examples are linked to RegExr! There's a link titled | `[RegExr]` in blue next to each example. | | Also, those are some amazing resources, especially CyberChef. | canada_dry wrote: | Corrected! ( _Covid brain_ ) Thanks. | ben509 wrote: | It might be nice to touch on composition as a good way to get | started is to test out individual pieces and be confident they | work when you're putting them together. | | If you're building a complex regular expression, setting smaller | parts in variables and dropping them in with (?:${part}) makes | things a bit more readable. | | It also exposes a real weakness of most regex engines. In | particular, alternation is a first-class operation, but | complement and intersection, while theoretically possible[1] are | typically not. | | A person might guess that to match three keywords is /. | _keyword1._ &. _keyword2._ &. _keyword3._ / | | Or maybe /. _keyword1._ &(. _keyword2._ )!/ to match keyword1 and | not keyword2. | | But those won't work, so it's a good idea to explain some | options, an obvious one being /keyword1/.test() && | !/keyword2/.test() | | In the section on lookaround assertions, it's probably useful to | note that (?=thing1)(?=thing2) can match both, and it's a good | mental model for it, but that it comes with a few gotchas. | | [1]: | https://www.researchgate.net/publication/220994310_Succinctn... | digitalmaster wrote: | Learned more in just the first few pages than I have in my many | years of copy-pasting regex from StackOverflow. Cheers | SPBesui wrote: | Maybe I missed it, but there doesn't seem to be any credit given | for the xkcd comic (https://xkcd.com/208/) shown on the Next | Steps page (https://refrf.shreyasminocha.me/chapters/next-steps). | Does Randall even require it? | shreyasminocha wrote: | He does require it per CC-BY-NC-2.5. I felt it was sufficient | to permalink to the version on his domain, but I shall make it | more explicit. | saberworks wrote: | This looks really nice but I think it suffers from the same issue | a lot of regex tutorials suffer from. It's focusing solely on the | regex and not at all on how to actually execute them. This site | in particular says it's going to use javascript but at least the | first few pages don't show anything except raw regular | expressions. | | For any tutorial about regular expressions I think the second | thing (beyond a very simple example regex) to show should be how | to actually execute one in code. Is it that all the tutorials | want to be language-agnostic? Maybe just show a javascript | example and point out which part is the js function/method call | and which part is the actual regex. | | It's nice to be told what /[aeiou]/ means but without actually | typing it in and executing it (against various inputs, not just | one) it wouldn't really sink in for me. | jehlakj wrote: | Good. Unless you're trying to write a one off script, you | shouldn't manually use them in real world projects. They're a | big source of bugs | ben509 wrote: | Every compiler and interpreter as well as all text formats | like HTML, XML, JSON, etc. have a lexing pass that uses | manually crafted regular expressions. Those are real world | projects. | | The defensible form of this argument is that one should | prefer a serialization library or a properly normalized | database rather than trying to "stringly type" data and then | pull it out via regexes. | wasureru wrote: | "Unless you're trying to write a one off script, you | shouldn't manually use them in real world projects." | | What should one use instead of manual use? | yoz-y wrote: | I've read this a lot but I've yet to encounter bugs in | regexes that wouldn't be otherwise encountered in a parser or | validator that uses other techniques. Code is just code. | thedirt0115 wrote: | I agree that it would be nice if the learner was given some | instruction on how to experiment right away. I'd recommend | https://regex101.com/ as a tool to complement this (or any) | regex tutorial, as well as when you're crafting/debugging | regex. It's language-agnostic in that you don't need to write | any code, just the regex and input -- but you can still pick | different regex flavors like PCRE vs JavaScript. | olq wrote: | Nice guide! As a complete regex-dyslectic i didn't know the slash | '/' could be used to make expressions more readable. | | On another note, since this is supposed to be a book and all, is | there a simple way to get this on one a single page and make it | easier to print? | catblast wrote: | That's not really true. The guide seems to be javascript | centric. The slash and flags is not part of the regex, it's a | delimiter. In certain contexts like sed, perl, php, an | arbitrary delimiter can be used to avoid needing to escape | slashes. If you pass a string to a regex engine with a /, it | does what you would expect, match a literal /. For instance, | python and grep does not interpret slashes and flags. Those are | pretty common. | olq wrote: | Thanks for the heads up, it seems i have some more reading to | do :) | shreyasminocha wrote: | > is there a simple way to get this on one a single page and | make it easier to print? | | That's on my TODO! | | - https://github.com/shreyasminocha/regex-for-regular- | folk/iss... - https://github.com/shreyasminocha/regex-for- | regular-folk/iss... | | I was having some trouble with wkhtmltopdf, but I'll try to | figure this out ASAP. | olq wrote: | I don't know what that is but just having all html pages | merged into one would be enough and super useful for me and | my printing purposes, and probably good for web UX too. Keep | up the good work mate | bogomipz wrote: | This is very nice looking. Especially so given that regexes are | often not easy to look at. Kudos. | vongomben wrote: | Link saved for later. Looks really well done! | Pxtl wrote: | I've always wanted to introduce regexes to non-programmers | because they have always struck me as staggeringly useful. Just | simple things like find/replace in my filenames and find/replace | in documents. | | This guide, along with a simple web-based regex tester would be | great for this... | | But it's missing a 3rd part: regex plugins for common non- | programmers tools, like for ms office, the windows explorer, etc. | cjhveal wrote: | Great work! These examples are super clear. | | One thought: it would be great to highlight a given match on | hover. I know that each match has its own undertie and it's | explicitly mentioned early on but it might help really drive it | home if each match reacted individually when hovered. | wonnage wrote: | I personally had the most trouble with regexes because I didn't | have a good mental model of how they worked. The hard part wasn't | finding the correct symbol/character class I was trying to match, | but coming to grips with repetition, greedy/nongreedy, etc. | | I took a compilers class in college where one of the projects was | to implement a simple regex matcher using NFAs. Bashing my head | against this for a week really helped with being able to "read" a | regex. Not sure if this was due to finally understanding the | algorithm, or the fact that I was just constantly staring at | broken regex matches all day. | | IMO it was a fairly small time investment for something that is | so widely used. | | I'll recommend this post that's been on HN many times: | https://swtch.com/~rsc/regexp/regexp1.html | btilly wrote: | I think it is understanding the algorithm that does it. | | I also recommend that people learn how to read a regex by | writing a small recursive program to match specific regexes. | After you look at a regex and think about how it might work, | intuition follows. | | Actually writing the bit that turns the regex expression into | said program isn't as important though. Doing that by hand 5 | times is enough IMO. | the-pigeon wrote: | I'm not sure how to explain it but the most important thing | I've learned in over a decade of programming experience is to | not use regular expressions for many things they may seem like | a regular expression problem. | | For example, even something as simple a phone number can have | all sorts of weird but valid variants. Be sure you really need | to even validate it's format and not just that it's present. | | Trying to handle all of those variants via regex expression is | doable but a pain. And in practice you as the programmer should | not be defining those variants that are valid as it's up to the | business itself to define what type of data it considers to be | valid for the field. | | That said I've also worked for companies with small engineering | teams where the goal has always been to be as efficient with | development time as possible, as opposed to making a near ideal | system. Software has different needs when it's used by a | thousand people than when it's used by millions. | m463 wrote: | I think what regex's need is a really powerful syntax and | language aware regex editor. | | I've been using regexs for most of my career, and still struggle | to get them right on first writing. | | The #1 problem I run into is: | | what is a literal character and what is a control character? | | for example, both these are very common: | | - match a parenthesis character or a period character | | - use a parenthesis to group a match or use a period to match any | one character | | You would think I would learn it once, and be good. | | but my #2 problem confounds this: | | what is a literal character and what is a control character - in | the language I am using? | | for example I might need to escape a period to make it a literal | for a regex. | | If I am checking the files filexc and file.c and want to match | the second, the regex I want is ^.*\.c$ | | in perl, I could say: $rx = "^.*\\.c\$"; ($" | is a thing) if ($f =~ /$regex/) { ... | | better would be: if ($f =~ /^.*\.c$/) {... | | in python I would write m = | re.search("^.*\\.c$",f) | | better would be: m = re.search(r'^.*\.c$', f) | | in a shell script, I might say: grep "^.*\\.c$" | | EDIT: crap, I had to escape _my comment_ because the asterisk in | the regex was making my text italic | mycall wrote: | Recursive RegEx has always been confusing to me. | | https://www.rexegg.com/regex-recursion.html | asicsp wrote: | Railroad diagrams may help. I use a three step approach to | present one example of recursion, which includes showing the | difference between two-level nesting vs recursive version [1]. | This is the railroad visualization for two-level [2]. | | [1] | https://github.com/learnbyexample/py_regular_expressions/blo... | | [2] https://www.debuggex.com/r/SMLRfiyt0Ag2hXu5 | arthurofbabylon wrote: | In case it's useful for others, I found "Mastering Regular | Expressions" by "Jeffrey E. F. Friedl" to be very effective for | becoming proficient with Regex. | | Also - reading it was just a useful look into systems mapping | (which is what language is!) with insights that apply in many | contexts. | phillipseamore wrote: | Very clear and concise with simple presentation. Good job! | cryptoslug wrote: | Thank you so much for making this! at a certain point, on our | team at least, we have to compile regex resources into a guide | and this is incredibly helpful. | bane wrote: | This is a nice, non-threatening intro. One piece of advice that | I've learned from teaching people regex syntax is that it's much | easier to keep it to three basic topics at first (Repetition, | Alternation, Concatenation), and then describe the rest of the | stuff (character classes, character escapes, etc.) as "syntactic | sugar" that makes those previous topics simpler, or provides more | power. I usually introduce groups pretty early but most people | get them notionally because they're kind of like algebraic | parentheses. And then I'll expand on groups as well to show more | power, escapes, etc. | | For example, | | (a|c|d|e|f|g|...|z) uses only notional groups and basic | alternation while [abcdefghi...xyz] shows character classes, and | [a-z] shows ranges - each step builds on the previous step and | shows how to make them easier. For the learner this seems to act | as building blocks rather than "separate things that are kind of | alike I need to learn" | | This is similar to how you can talk about repetition as | | aaaaa, then aaaaaaaaaaaaa, then a _, then aa_ , then a+, then a?, | a{0,1}, a{0,5}, a{1,5}, a{,10}, etc. which simplifies, then | generalizes the idea of repetition from a very natural concept | build on concatenation to an opaque looking syntax that turns out | to be both general and powerful | | After that, most of the time I need to explain how capturing | works, and how to turn it off and so on. Good tools help here and | it starts to move away from a whiteboard exercise into something | more active. But if students have followed you to this point it | starts to make them feel very powerful as they're suddenly | parsing things apart and transforming them. | | At the end I usually follow up with a big on anchors (^ and $) | and other odds and ends (case insensitivity, global search, | greedy and non-greedy, etc.) and usually turn people loose after | that. I've rarely found people who actually need lookarounds and | other advanced topics and those are usually covered one-on-one as | they need. | | But this is fairly minor quibbling and is just rearrangement of | what's here. I think this is overall a nice clear explanation. | Regex syntax is honestly pretty simple once the syntax magic is | explained. | | What I think would be really helpful is a tool where somebody can | type in a regex, have it checked for syntax and then generate the | list of strings that would match it (within the constraints of | limits on infinite repetition operators, like turning * to {0,2} | or something. | LeonB wrote: | I like the example based approach. I learn from examples far | quicker than I learn from "explanations". If I attempt to learn | from an example and my brain hits an exception, only then do I | start reading the supporting text. | | Nice approach. You've made a valuable thing and implemented a | powerful idea. | ehsankia wrote: | I honestly wish a lot more documentations started like that | with a bunch of examples. I think one I really enjoyed recently | was attrs [0]. | | [0] https://www.attrs.org/en/stable/examples.html | kmundnic wrote: | I've been using tldr [1] instead of man pages lately to get | started with a command (or to remind myself how to use one). | I've learned a lot just by reading the examples shown, and | then read the man pages if I am missing something. | | [1] https://github.com/tldr-pages/tldr | seph-reed wrote: | Examples being greater than explanations is one of the main | reasons I emphasize explanative error messaging, clear simple | typings, and verbose function/variable names over | documentation. | | Docs are really good for discovery and should cover many topics | shallowly so you can glean a big picture quickly. I generally | don't like going to them for specs that could have just been an | error message, a type, or a better naming convention. | nobrains wrote: | Hi, very nice RegEx educational site/book. | | Feedback: | | - In the chapter | https://refrf.shreyasminocha.me/chapters/character-classes an | example is given which uses: o ^ character | outside brackets o $ end of line o + | | But the explanation above does not introduce these yet, so a real | beginner user (like me) is lost. The ambigious characters example | is fine, since it uses all the concepts already explained. | shreyasminocha wrote: | Thanks. Yeah, others pointed this out too. I'll get to it soon! | tragomaskhalos wrote: | Final exam here: https://www.i-programmer.info/news/144-graphics- | and-games/54... | | :) | airstrike wrote: | This is truly wonderful haha Thank you for sharing | donaldihunter wrote: | And https://regexcrossword.com/ | shreyasminocha wrote: | Yep! I included it on the next steps page -- | https://refrf.shreyasminocha.me/chapters/next-steps. Great | fun. | asicsp wrote: | Neatly presented. | | However, I'd suggest to reorganize the chapters so that features | not yet introduced aren't shown in examples without explanations. | For example, you explain anchors and quantifiers many chapters | later but use them liberally in earlier chapters without | explaining them. | shreyasminocha wrote: | Yep, thanks for pointing that out. I was finding it tricky to | present features in isolation without making the examples | trivial. | | I'll work on making things clearer. | nicoburns wrote: | I wouldn't worry too much about making the examples trivial. | That just makes it easy to learn! There are probably lot's of | good orders, but I'd probably go something like: | | - Literal strings - Optional characters - Optional strings of | characters (using groups) - Alternations (using groups) - | Repetitions (using groups) | | Then move onto to things like character classes. | | IMO character classes are quite an advanced feature (or at | least confusing for beginners) because of being character | orientated. They also don't tend to very useful unless you've | already covered repetition. | davinic wrote: | Yes - in the Character classes chapter you had just | introduced the negate operator, then in the next example use | the beginning operator, which happens to be the same | character in a different context. That might be a leap too | far. Great resource! | [deleted] | [deleted] | tuan wrote: | Related: I found this online tool to be very useful for debugging | regex: https://www.debuggex.com/ | johnnythunder wrote: | This is literally the best RegEx tutorial that I've ever gone | through. | pmarreck wrote: | One thing not mentioned here which I think is good to be aware of | as you write intermediate to advanced regexes is understanding | "catastrophic backtracking" and how to mitigate it: | https://www.regular-expressions.info/catastrophic.html | | For some reason I enjoy figuring regexes out. What I usually do | is TDD them, I have a mini test suite of examples of strings I | want to match and strings I don't want to match and I write some | code to apply a candidate regex to them all and validate, and | then I iterate until it passes. Then I rewrite the regex in | extended regex format and add comments so that _other people or | future me_ understand what 's going on. | | Doing what a good regex can do with regular code instead (which | you might do with the goal of readability or maintainability) is | usually much much MUCH slower, FYI | backzerman wrote: | Constructive criticism: I was about to send this to a friend who | is new to programming, but the introduction is just too short. It | would be great if the introduction included one or two | motivational examples for the types of trouble you run into when | you _don 't_ know regular expressions. | shreyasminocha wrote: | Makes sense--thank you! I'll add that in. | twicetwice wrote: | This looks like a great resource! Like others, I vastly prefer an | example-based style, and the examples are really well chosen and | very illustrative. I generally think I know my regexes, but I've | already learned a few tricks. (Backreferences to match different | delimiter options but not mixed delimiters is very cool!) | | Feedback: | | The highlighting of matches is slightly shifted to the left for | me in Firefox 75 but not in Chrome (both on Ubuntu 16.04). The | shift is subtle but enough to make me have to look two or three | times at most examples, as the highlight covers half of the | character before the match and only half of the last character in | the match. Can I suggest adding Firefox to your test regimen, if | you haven't already? :) | | Also, on the Anchors page, I believe "carat" should be spelled | "caret." | | Thanks for this once again! I will definitely be revisiting this | site to brush up and learn new tricks. Especially lookaround, | which I have never quite wrapped my head around! | shreyasminocha wrote: | > The highlighting of matches is slightly shifted to the left | ... Can I suggest adding Firefox to your test regimen, if you | haven't already? :) | | Oh, I thought I had fixed that. I primarily test with Firefox, | so this is a bit of a surprise. I'll check it out--I think it's | something to do with CSS's `letter-spacing`. | | I've fixed the typo, thanks for pointing it out. | | Thanks for the comments! ___________________________________________________________________ (page generated 2020-05-01 23:00 UTC)