[HN Gopher] In search of the perfect URL validation regex (2010) ___________________________________________________________________ In search of the perfect URL validation regex (2010) Author : Jonhoo Score : 105 points Date : 2021-09-25 17:28 UTC (5 hours ago) (HTM) web link (mathiasbynens.be) (TXT) w3m dump (mathiasbynens.be) | mpeg wrote: | I was once failed on a technical interview, partly because on the | coding test I was asked to write a url parser "from scratch, the | way a browser would do it" and I explained it would take way too | long to account for every edge case in the URL RFC but that I | could do a quick and dirty approach for common urls. | | After I did this, the interviewer stopped me and told me in a | negative way that he expected me to use a regex, which kinda | shows he had no idea how a web browser works. | axiosgunnar wrote: | How do browsers parse URLs then? | wolfgang42 wrote: | Here's a polyfill for the JS URL() interface which should | give you a taste: https://github.com/zloirock/core- | js/blob/272ac1b4515c5cfbf34... (I tried finding the one in | Firefox but I couldn't actually work out where it started, | this one is much easier to follow) | | TLDR: it's a traditional parser--a big state machine that | steps through the URL character by character and tokenizes it | into the relevant pieces. | djur wrote: | There's actually a standard for it these days. | | https://url.spec.whatwg.org/#url-parsing | specialist wrote: | Ya. I've also suffered copypasta trials administered by bar | raisers, mensa members, and other self appointed keepers of the | sacred nerd flame. | | My imagined remedies are no 1:1 interviews and recording these | sessions for "possible quality assurance and training | purposes". | elif wrote: | how would you even "parse" a url with a regex? dynamically | defined named subpatterns for each url parameter? I think the | best i could do on paper with a regex is say "yup this is a | url" or maybe "yup i can count the number of params" | | Unless it was a specific url with specific params? | tyingq wrote: | I assume they meant _" some regex implementation, including | replace and/or match groups"_. | | Like, for just the params part (yes, broken and simplistic): | #!/usr/bin/perl $_="a=b&c=d&e=f&whatever=some thing"; | while (s/^([^&]*)=([^&]*)(&|$)//) { print "[$1] | [$2]\n"; } | chrismorgan wrote: | Match groups so you can split it up into scheme, username, | password, host, port, path, query, fragment. Not difficult to | approximate, though for best results with diverse schemes | you'd want an engine that allows repeated named groups, and I | don't know if any do (JavaScript and Python don't). | powersnail wrote: | Python's `regex` package does allow repeated named group. | mercora wrote: | its not very likely this is whats happening here but i feel | like this could be done on purpose to see how you act in this | kind of situation. it kinda tells how you would act once you | inevitably go into a conflict with colleagues arguing over | stuff like that. | codetrotter wrote: | In that case I think the proper response should be: "I am | very sure that browsers don't do it that way. But let's have | a look." And then pull up the source code for Chromium and | Firefox. Assuming it's not whiteboard only. | | And if they still insist even after the source of Chromium | and FF has been consulted. Well then it's time to leave. | Don't want to work with anyone like that. | axiosgunnar wrote: | How do browsers parse URLs then? | Sephr wrote: | See https://chromium.googlesource.com/chromium/src/+/HEAD | /url/#c... | tapland wrote: | Or it could be one of those outsourced interviews. | [deleted] | jhgb wrote: | Did you point out that his two requirements were contradictory? | maciejgryka wrote: | Using https://regex.help/, I got this beauty which passes all the | ones, which should pass. Obviously some room for improvement ;) | But it works! ^(?:http(?:(?:://(?:(?:(?:code\.goo | gle\.com/events/#&product=browser|\-\.~_!\$&'\(\)\*\+,;=:%40:80%2 | f::::::@ex\.com|foo\.(?:bar/\?q=Test%20URL\-encoded%20stuff|com/( | ?:\(something\)\?after=parens|unicode_\(\)_in_parens|b_(?:\(wiki\ | )(?:_blah)?#cite\-1|b(?:_\(wiki\)_\(again\)|/))))|uid(?::password | @ex\.com(?::8080)?/|@ex\.com(?::8080)?/)|www\.ex\.com/wpstyle/\?p | =364|223\.255\.255\.254|udaahrnn\.priikssaa|1(?:42\.42\.1\.1/|337 | \.net)|mthl\.khtbr|df\.ws/123|a\.b\-c\.de|\.ws/|[?]\.ws/|Li Zi | \.Ce Shi |j\.mp)|142\.42\.1\.1:8080/)|\.damowmow\.com/)|s://(?:ww | w\.ex\.com/foo/\?bar=baz&inga=42&quux|foo_bar\.ex\.com/))|://(?:u | id(?::password@ex\.com(?::8080)?|@ex\.com(?::8080)?)|foo\.com/b_b | (?:_\(wiki\))?|[?]\.ws))|ftp://foo\.bar/baz)$ | | I had to replace some words with shorter ones to squeeze under | 1000 char limit and there's no way to provide negative examples | right now. Something to fix! | axiosgunnar wrote: | > [?]\\.ws | | I guess this is the regex equivalent of overfitting :) | saghm wrote: | Yeah, not to mention "code.google.com" being right in there! | maciejgryka wrote: | Yeah, grex (the library powering this) is really cool, but | doesn't generalize very well. I'm sure there are ways to | improve it, but it's not a trivial thing to do. | Sephr wrote: | > Assume that this regex will be used for a public URL shortener | written in PHP, so URLs like http://localhost/, //foo.bar/, | ://foo.bar/, data:text/plain;charset=utf-8,OHAI and | tel:+1234567890 shouldn't pass (even though they're technically | valid) | | At Transcend, we need to allow site owners to regulate any | arbitrary network traffic, so our data flow input UI1 was | designed to detect all valid hosts (including local hosts, IDN, | IPv6 literal addresses, etc) and URLs (host-relative, protocol- | relative, and absolute). If the site owner inputs content that is | not a valid host or URL, then we treat their input as a regex. | | I came up with these simple utilities built on top of the URL | interface standard2 to detect all valid hosts & URLs: | | * isValidHost: | https://gist.github.com/eligrey/6549ad0a635fa07749238911b429... | | Example valid inputs: host.example | hazimeyou.minna (IDN domain; xn--p8j9a0d9c9a.xn--q9jyb4c) | [::1] (IPv6 address) 0xdeadbeef (IPv4 address; | 222.173.190.239) 123.456 (IPv4 address; 123.0.1.200) | 123456789 (IPv4 address; 7.91.205.21) localhost | | * isValidURL (and isValidAbsoluteURL): | https://gist.github.com/eligrey/443d51fab55864005ffb3873204b... | | Example valid inputs to isValidURL: | https://absolute-url.example //relative-protocol.example | /relative-path-example | | 1. https://docs.transcend.io/docs/configuring-data-flows | | 2. https://developer.mozilla.org/en-US/docs/Web/API/URL | mercora wrote: | while not terribly important or outright not required this | fails (treats urls as regex) for link-local addresses with | device identifier (zone-id) applied like | "[fe80::8caa:8cff:fe80:ff32%eth0]" although that would need to | be fixed in the standard if its desired :) | | i've found some reasoning[0] as to why its not supported with | browsers in mind though. | | [0] https://www.w3.org/Bugs/Public/show_bug.cgi?id=27234#c2 | gregsadetsky wrote: | I was just struggling with this -- specifically, our users' "UX" | expectation that entering "example.com" should work when asked | for their website URL. | | Most URL validation rules/regex/librairies/etc. reject | "example.com". However, if you head over to Stripe (for example), | in the account settings, when asked for your company's URL, | Stripe will accept "example.com", and assume "http://" as the | prefix (which yes, can have its own problems) | | What's a good solution? I both want to validate URLs, but also | let users enter "example.com". But if I simply do | if(validateURL(url)) { return true; } else | if(validateURL("http://" + url)) { return true; | } else { return false; } | | i.e. validate the given URL, and as a fallback, try to validate | "http://" + the given url, that opens the door to weird, non-URLs | strings being incorrectly validated... | | Help :-) | Rebelgecko wrote: | This could potentially be abused, but you could actually try to | resolve the DNS to determine if it's valid (could be weird for | some cases like localhost or IP addresses). Or just do a "curl | https://whatever.com" and see what happens (assuming that all | of the websites are running a webserver, although idk if that | is true in your situation) | dilatedmind wrote: | i would suggest bias your implementation against false | negatives. They can always come back and update it if it's | wrong, and their url could just as easily be "valid" but | incorrect, eg any typo in a domain name. | | if it's really important, you could try making a request to the | url and see if it loads, but that still doesn't validate its | the url they intended to input. | | might be cool to load the url with puppeteer and capture a | screenshot of the page. if they can't recognize their own | website, it's on them. | anderskaseorg wrote: | Parse, don't validate. If you need a heuristic that accepts | non-URL strings as if they were valid URLs, you should | _convert_ those non-URL strings to valid URLs so the rest of | your code can just deal with valid URLs. if | (validateURL(url)) { return url; } else if | (validateURL("http://" + url)) { return "http://" + | url; } else { return null; } | JadeNB wrote: | I know we're not golfing, but it pains me to see that | repetition in the middle. Mightn't we write | if (!validateURL(url)) { url = "http://" + url; | if (!validateURL(url)) { url = null; | } } return url; | | to snip a small probability of a bug? | wolfgang42 wrote: | I find that branchiness (and mutation of the variable) | harder to follow. Personally, I'd just take "parse, don't | validate" to its logical conclusion and go for: | const parseUrl = url => validateUrl(url) ? url : null; | return parseUrl(url) || parseUrl('http://'+url) || null; | lelandfe wrote: | Address validators for online checkout are notoriously | inaccurate, though they still help a lot. You just have to | prompt the user, "Did you mean 123 Example St?" | | I'd probably do the same for poorly formatted URLs. When the | user hits Submit, a prompt appears saying, "Did you mean | `https://example.com`?" | dang wrote: | Two past discussions, for the curious: | | _In search of the perfect URL validation regex_ - | https://news.ycombinator.com/item?id=10019795 - Aug 2015 (77 | comments) | | _In search of the perfect URL validation regex_ - | https://news.ycombinator.com/item?id=7928968 - June 2014 (81 | comments) | dmix wrote: | @stephenhay seems to be the winner here if you don't need IP | addresses (or weird dashed URLS). It's only 38 characters long | and easy to understand. | @^(https?|ftp)://[^\s/$.?#].[^\s]*$@iS | | The simpler the better, if you're going to use something that is | not ideal. | loloquwowndueo wrote: | Doesn't cover mailto: which is fairly common. To be | pedantic/strict, mailto: are URIs not URLs. | dmurray wrote: | > I also don't want to allow every possible technically valid URL | -- quite the opposite. | | Well, that should make things a lot easier. What does he mean | here? The rest of the text doesn't make it clear to me, unless | it's meant to be "every possibly valid HTTP, HTTPS, or FTP URL" | which isn't exactly "the opposite". | lelandfe wrote: | The next paragraph _might_ be that clarification, although I | agree it isn 't totally clear what he meant there: | | > Assume that this regex will be used for a public URL | shortener written in PHP, so URLs like http://localhost/, | //foo.bar/, ://foo.bar/, data:text/plain;charset=utf-8,OHAI and | tel:+1234567890 shouldn't pass (even though they're technically | valid). Also, in this case I only want to allow the HTTP, HTTPS | and FTP protocols. | jabo wrote: | Tangentially related, but mentioning to hopefully save someone | time: if you ever find yourself wanting to check if a version | string is semver or not, before inventing your own, there is an | official regex that's provided. | | I just discovered this yesterday and I'm glad I didn't have to | come up with this: | | https://semver.org/#is-there-a-suggested-regular-expression-... | | My use case for it: https://github.com/typesense/typesense- | website/blob/25562d02... | azalemeth wrote: | Honest question: there is a famous and very funny stack exchange | answer on the topic of parsing html with a regex [1] that states | that the problem is in general impossible and if if you find | yourself doing this, something has gone wrong and you should re- | evaluate your life choices / pray to Cthulu. | | So, does this apply to URLs? The fact that these regexes | are....so huge...makes me think that something is fundamentally | wrong. Are URLs describable in a Chomsky Type 3 grammar? Are they | sufficiently regular that using a Regex is sensible? What do the | actual browsers do? | | [1] https://stackoverflow.com/questions/1732348/regex-match- | open... | likium wrote: | Even if you built a URL validation regex that follows rfc3986[1] | and rfc3987[2], you will still get user bug reports because web | browsers follow a different standard. | | For example, <http://example.com./> , <http:///example.com/> and | <https://en.wikipedia.org/wiki/Space (punctuation)> are | classified as invalid urls in the blog, but they are accepted in | the browser. | | As the creator of cURL puts it, there is no URL standard[3]. | | [1]: https://www.ietf.org/rfc/rfc3986.txt | | [2]: https://www.ietf.org/rfc/rfc3987.txt | | [3]: https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/ | jt2190 wrote: | There's also a question of what we're _really_ trying to | validate, IMHO. All of these regex patterns will tell you that | a string looks like a URL, but they won 't actually tell you | if: There's any web server listening at that particular URL; | Whether that server has the resource in that location; If that | server is reachable from where you want to fetch it; etc. | yyyk wrote: | <http://example.com./> is a valid URL, see for example: | | https://jdebp.uk/FGA/web-fully-qualified-domain-name.html | MildlySerious wrote: | Tangentially, Youtube had a bug surface last year where | adding that extra dot let you avoid all ads. Previous | discussion[1] | | [1] https://news.ycombinator.com/item?id=23479435 | dhsysusbsjsi wrote: | Also nearly every paywalled media site | Sephr wrote: | There might not have been a generally accepted standard then, | but there is now: https://url.spec.whatwg.org/ | queuebert wrote: | Uh oh, Regex is approaching sentience. | MaxBarraclough wrote: | Every known sentient being is a finite state machine. Every | finite state machine corresponds to a regular expression, and | vice versa. | JadeNB wrote: | > Every known sentient being is a finite state machine. | | I know this is just a cutesy slogan, but how could you | possibly know whether a living creature is a finite state | machine? What would it even mean? I know I don't respond | identically to identical stimuli presented on different | occasions .... | throwamon wrote: | Obnoxious, I mean, trivial, answer: Just make "occasions" a | variable. Assuming your lifetime is finite, you could | simply assign each point in time to a value, and there you | have it: a finite mapping from each moment to a state. | MaxBarraclough wrote: | > I know this is just a cutesy slogan | | Mostly, yes, but I do think there's a real point here as | well. | | > how could you possibly know whether a living creature is | a finite state machine? | | As I understand it, physicists don't really know whether | the physical world has a finite number of states, or an | infinite number. I think they tend to lean toward finite, | though. | | Even if it's infinite, I doubt it's of consequence. That is | to say, I doubt that sentience depends on the physical | possibility of an infinite number of states. (Of course, if | it turns out the physical world only has a finite number of | states, that demonstrates that sentience is compatible with | the finite-states constraint.) | | > What would it even mean? | | Systems can be modelled as finite state machines. Sentient | entities like people are extremely sophisticated systems, | but that's just a matter of degree, not of category. | | > I know I don't respond identically to identical stimuli | presented on different occasions | | Right, because you're in a different state. You'll never be | in the same state twice. We don't need to resort to non- | determinism. ___________________________________________________________________ (page generated 2021-09-25 23:00 UTC)