hngopher.com

       [HN Gopher] In search of the perfect URL validation regex (2010)
       ___________________________________________________________________
        
       In search of the perfect URL validation regex (2010)
        
       Author : Jonhoo
       Score  : 105 points
       Date   : 2021-09-25 17:28 UTC (5 hours ago)
        
 (HTM) web link (mathiasbynens.be)
 (TXT) w3m dump (mathiasbynens.be)
        
       | mpeg wrote:
       | I was once failed on a technical interview, partly because on the
       | coding test I was asked to write a url parser "from scratch, the
       | way a browser would do it" and I explained it would take way too
       | long to account for every edge case in the URL RFC but that I
       | could do a quick and dirty approach for common urls.
       | 
       | After I did this, the interviewer stopped me and told me in a
       | negative way that he expected me to use a regex, which kinda
       | shows he had no idea how a web browser works.
        
         | axiosgunnar wrote:
         | How do browsers parse URLs then?
        
           | wolfgang42 wrote:
           | Here's a polyfill for the JS URL() interface which should
           | give you a taste: https://github.com/zloirock/core-
           | js/blob/272ac1b4515c5cfbf34... (I tried finding the one in
           | Firefox but I couldn't actually work out where it started,
           | this one is much easier to follow)
           | 
           | TLDR: it's a traditional parser--a big state machine that
           | steps through the URL character by character and tokenizes it
           | into the relevant pieces.
        
           | djur wrote:
           | There's actually a standard for it these days.
           | 
           | https://url.spec.whatwg.org/#url-parsing
        
         | specialist wrote:
         | Ya. I've also suffered copypasta trials administered by bar
         | raisers, mensa members, and other self appointed keepers of the
         | sacred nerd flame.
         | 
         | My imagined remedies are no 1:1 interviews and recording these
         | sessions for "possible quality assurance and training
         | purposes".
        
         | elif wrote:
         | how would you even "parse" a url with a regex? dynamically
         | defined named subpatterns for each url parameter? I think the
         | best i could do on paper with a regex is say "yup this is a
         | url" or maybe "yup i can count the number of params"
         | 
         | Unless it was a specific url with specific params?
        
           | tyingq wrote:
           | I assume they meant _" some regex implementation, including
           | replace and/or match groups"_.
           | 
           | Like, for just the params part (yes, broken and simplistic):
           | #!/usr/bin/perl       $_="a=b&c=d&e=f&whatever=some thing";
           | while (s/^([^&]*)=([^&]*)(&|$)//) {         print "[$1]
           | [$2]\n";       }
        
           | chrismorgan wrote:
           | Match groups so you can split it up into scheme, username,
           | password, host, port, path, query, fragment. Not difficult to
           | approximate, though for best results with diverse schemes
           | you'd want an engine that allows repeated named groups, and I
           | don't know if any do (JavaScript and Python don't).
        
             | powersnail wrote:
             | Python's `regex` package does allow repeated named group.
        
         | mercora wrote:
         | its not very likely this is whats happening here but i feel
         | like this could be done on purpose to see how you act in this
         | kind of situation. it kinda tells how you would act once you
         | inevitably go into a conflict with colleagues arguing over
         | stuff like that.
        
           | codetrotter wrote:
           | In that case I think the proper response should be: "I am
           | very sure that browsers don't do it that way. But let's have
           | a look." And then pull up the source code for Chromium and
           | Firefox. Assuming it's not whiteboard only.
           | 
           | And if they still insist even after the source of Chromium
           | and FF has been consulted. Well then it's time to leave.
           | Don't want to work with anyone like that.
        
             | axiosgunnar wrote:
             | How do browsers parse URLs then?
        
               | Sephr wrote:
               | See https://chromium.googlesource.com/chromium/src/+/HEAD
               | /url/#c...
        
           | tapland wrote:
           | Or it could be one of those outsourced interviews.
        
         | [deleted]
        
         | jhgb wrote:
         | Did you point out that his two requirements were contradictory?
        
       | maciejgryka wrote:
       | Using https://regex.help/, I got this beauty which passes all the
       | ones, which should pass. Obviously some room for improvement ;)
       | But it works!                 ^(?:http(?:(?:://(?:(?:(?:code\.goo
       | gle\.com/events/#&product=browser|\-\.~_!\$&'\(\)\*\+,;=:%40:80%2
       | f::::::@ex\.com|foo\.(?:bar/\?q=Test%20URL\-encoded%20stuff|com/(
       | ?:\(something\)\?after=parens|unicode_\(\)_in_parens|b_(?:\(wiki\
       | )(?:_blah)?#cite\-1|b(?:_\(wiki\)_\(again\)|/))))|uid(?::password
       | @ex\.com(?::8080)?/|@ex\.com(?::8080)?/)|www\.ex\.com/wpstyle/\?p
       | =364|223\.255\.255\.254|udaahrnn\.priikssaa|1(?:42\.42\.1\.1/|337
       | \.net)|mthl\.khtbr|df\.ws/123|a\.b\-c\.de|\.ws/|[?]\.ws/|Li Zi
       | \.Ce Shi |j\.mp)|142\.42\.1\.1:8080/)|\.damowmow\.com/)|s://(?:ww
       | w\.ex\.com/foo/\?bar=baz&inga=42&quux|foo_bar\.ex\.com/))|://(?:u
       | id(?::password@ex\.com(?::8080)?|@ex\.com(?::8080)?)|foo\.com/b_b
       | (?:_\(wiki\))?|[?]\.ws))|ftp://foo\.bar/baz)$
       | 
       | I had to replace some words with shorter ones to squeeze under
       | 1000 char limit and there's no way to provide negative examples
       | right now. Something to fix!
        
         | axiosgunnar wrote:
         | > [?]\\.ws
         | 
         | I guess this is the regex equivalent of overfitting :)
        
           | saghm wrote:
           | Yeah, not to mention "code.google.com" being right in there!
        
           | maciejgryka wrote:
           | Yeah, grex (the library powering this) is really cool, but
           | doesn't generalize very well. I'm sure there are ways to
           | improve it, but it's not a trivial thing to do.
        
       | Sephr wrote:
       | > Assume that this regex will be used for a public URL shortener
       | written in PHP, so URLs like http://localhost/, //foo.bar/,
       | ://foo.bar/, data:text/plain;charset=utf-8,OHAI and
       | tel:+1234567890 shouldn't pass (even though they're technically
       | valid)
       | 
       | At Transcend, we need to allow site owners to regulate any
       | arbitrary network traffic, so our data flow input UI1 was
       | designed to detect all valid hosts (including local hosts, IDN,
       | IPv6 literal addresses, etc) and URLs (host-relative, protocol-
       | relative, and absolute). If the site owner inputs content that is
       | not a valid host or URL, then we treat their input as a regex.
       | 
       | I came up with these simple utilities built on top of the URL
       | interface standard2 to detect all valid hosts & URLs:
       | 
       | * isValidHost:
       | https://gist.github.com/eligrey/6549ad0a635fa07749238911b429...
       | 
       | Example valid inputs:                 host.example
       | hazimeyou.minna (IDN domain; xn--p8j9a0d9c9a.xn--q9jyb4c)
       | [::1] (IPv6 address)       0xdeadbeef (IPv4 address;
       | 222.173.190.239)       123.456 (IPv4 address; 123.0.1.200)
       | 123456789 (IPv4 address; 7.91.205.21)       localhost
       | 
       | * isValidURL (and isValidAbsoluteURL):
       | https://gist.github.com/eligrey/443d51fab55864005ffb3873204b...
       | 
       | Example valid inputs to isValidURL:
       | https://absolute-url.example       //relative-protocol.example
       | /relative-path-example
       | 
       | 1. https://docs.transcend.io/docs/configuring-data-flows
       | 
       | 2. https://developer.mozilla.org/en-US/docs/Web/API/URL
        
         | mercora wrote:
         | while not terribly important or outright not required this
         | fails (treats urls as regex) for link-local addresses with
         | device identifier (zone-id) applied like
         | "[fe80::8caa:8cff:fe80:ff32%eth0]" although that would need to
         | be fixed in the standard if its desired :)
         | 
         | i've found some reasoning[0] as to why its not supported with
         | browsers in mind though.
         | 
         | [0] https://www.w3.org/Bugs/Public/show_bug.cgi?id=27234#c2
        
       | gregsadetsky wrote:
       | I was just struggling with this -- specifically, our users' "UX"
       | expectation that entering "example.com" should work when asked
       | for their website URL.
       | 
       | Most URL validation rules/regex/librairies/etc. reject
       | "example.com". However, if you head over to Stripe (for example),
       | in the account settings, when asked for your company's URL,
       | Stripe will accept "example.com", and assume "http://" as the
       | prefix (which yes, can have its own problems)
       | 
       | What's a good solution? I both want to validate URLs, but also
       | let users enter "example.com". But if I simply do
       | if(validateURL(url)) {           return true;         } else
       | if(validateURL("http://" + url)) {           return true;
       | } else {           return false;         }
       | 
       | i.e. validate the given URL, and as a fallback, try to validate
       | "http://" + the given url, that opens the door to weird, non-URLs
       | strings being incorrectly validated...
       | 
       | Help :-)
        
         | Rebelgecko wrote:
         | This could potentially be abused, but you could actually try to
         | resolve the DNS to determine if it's valid (could be weird for
         | some cases like localhost or IP addresses). Or just do a "curl
         | https://whatever.com" and see what happens (assuming that all
         | of the websites are running a webserver, although idk if that
         | is true in your situation)
        
         | dilatedmind wrote:
         | i would suggest bias your implementation against false
         | negatives. They can always come back and update it if it's
         | wrong, and their url could just as easily be "valid" but
         | incorrect, eg any typo in a domain name.
         | 
         | if it's really important, you could try making a request to the
         | url and see if it loads, but that still doesn't validate its
         | the url they intended to input.
         | 
         | might be cool to load the url with puppeteer and capture a
         | screenshot of the page. if they can't recognize their own
         | website, it's on them.
        
         | anderskaseorg wrote:
         | Parse, don't validate. If you need a heuristic that accepts
         | non-URL strings as if they were valid URLs, you should
         | _convert_ those non-URL strings to valid URLs so the rest of
         | your code can just deal with valid URLs.                   if
         | (validateURL(url)) {           return url;         } else if
         | (validateURL("http://" + url)) {           return "http://" +
         | url;         } else {           return null;         }
        
           | JadeNB wrote:
           | I know we're not golfing, but it pains me to see that
           | repetition in the middle. Mightn't we write
           | if (!validateURL(url)) {             url = "http://" + url;
           | if (!validateURL(url)) {                 url = null;
           | }         }         return url;
           | 
           | to snip a small probability of a bug?
        
             | wolfgang42 wrote:
             | I find that branchiness (and mutation of the variable)
             | harder to follow. Personally, I'd just take "parse, don't
             | validate" to its logical conclusion and go for:
             | const parseUrl = url => validateUrl(url) ? url : null;
             | return parseUrl(url) || parseUrl('http://'+url) || null;
        
         | lelandfe wrote:
         | Address validators for online checkout are notoriously
         | inaccurate, though they still help a lot. You just have to
         | prompt the user, "Did you mean 123 Example St?"
         | 
         | I'd probably do the same for poorly formatted URLs. When the
         | user hits Submit, a prompt appears saying, "Did you mean
         | `https://example.com`?"
        
       | dang wrote:
       | Two past discussions, for the curious:
       | 
       |  _In search of the perfect URL validation regex_ -
       | https://news.ycombinator.com/item?id=10019795 - Aug 2015 (77
       | comments)
       | 
       |  _In search of the perfect URL validation regex_ -
       | https://news.ycombinator.com/item?id=7928968 - June 2014 (81
       | comments)
        
       | dmix wrote:
       | @stephenhay seems to be the winner here if you don't need IP
       | addresses (or weird dashed URLS). It's only 38 characters long
       | and easy to understand.
       | @^(https?|ftp)://[^\s/$.?#].[^\s]*$@iS
       | 
       | The simpler the better, if you're going to use something that is
       | not ideal.
        
         | loloquwowndueo wrote:
         | Doesn't cover mailto: which is fairly common. To be
         | pedantic/strict, mailto: are URIs not URLs.
        
       | dmurray wrote:
       | > I also don't want to allow every possible technically valid URL
       | -- quite the opposite.
       | 
       | Well, that should make things a lot easier. What does he mean
       | here? The rest of the text doesn't make it clear to me, unless
       | it's meant to be "every possibly valid HTTP, HTTPS, or FTP URL"
       | which isn't exactly "the opposite".
        
         | lelandfe wrote:
         | The next paragraph _might_ be that clarification, although I
         | agree it isn 't totally clear what he meant there:
         | 
         | > Assume that this regex will be used for a public URL
         | shortener written in PHP, so URLs like http://localhost/,
         | //foo.bar/, ://foo.bar/, data:text/plain;charset=utf-8,OHAI and
         | tel:+1234567890 shouldn't pass (even though they're technically
         | valid). Also, in this case I only want to allow the HTTP, HTTPS
         | and FTP protocols.
        
       | jabo wrote:
       | Tangentially related, but mentioning to hopefully save someone
       | time: if you ever find yourself wanting to check if a version
       | string is semver or not, before inventing your own, there is an
       | official regex that's provided.
       | 
       | I just discovered this yesterday and I'm glad I didn't have to
       | come up with this:
       | 
       | https://semver.org/#is-there-a-suggested-regular-expression-...
       | 
       | My use case for it: https://github.com/typesense/typesense-
       | website/blob/25562d02...
        
       | azalemeth wrote:
       | Honest question: there is a famous and very funny stack exchange
       | answer on the topic of parsing html with a regex [1] that states
       | that the problem is in general impossible and if if you find
       | yourself doing this, something has gone wrong and you should re-
       | evaluate your life choices / pray to Cthulu.
       | 
       | So, does this apply to URLs? The fact that these regexes
       | are....so huge...makes me think that something is fundamentally
       | wrong. Are URLs describable in a Chomsky Type 3 grammar? Are they
       | sufficiently regular that using a Regex is sensible? What do the
       | actual browsers do?
       | 
       | [1] https://stackoverflow.com/questions/1732348/regex-match-
       | open...
        
       | likium wrote:
       | Even if you built a URL validation regex that follows rfc3986[1]
       | and rfc3987[2], you will still get user bug reports because web
       | browsers follow a different standard.
       | 
       | For example, <http://example.com./> , <http:///example.com/> and
       | <https://en.wikipedia.org/wiki/Space (punctuation)> are
       | classified as invalid urls in the blog, but they are accepted in
       | the browser.
       | 
       | As the creator of cURL puts it, there is no URL standard[3].
       | 
       | [1]: https://www.ietf.org/rfc/rfc3986.txt
       | 
       | [2]: https://www.ietf.org/rfc/rfc3987.txt
       | 
       | [3]: https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/
        
         | jt2190 wrote:
         | There's also a question of what we're _really_ trying to
         | validate, IMHO. All of these regex patterns will tell you that
         | a string looks like a URL, but they won 't actually tell you
         | if: There's any web server listening at that particular URL;
         | Whether that server has the resource in that location; If that
         | server is reachable from where you want to fetch it; etc.
        
         | yyyk wrote:
         | <http://example.com./> is a valid URL, see for example:
         | 
         | https://jdebp.uk/FGA/web-fully-qualified-domain-name.html
        
           | MildlySerious wrote:
           | Tangentially, Youtube had a bug surface last year where
           | adding that extra dot let you avoid all ads. Previous
           | discussion[1]
           | 
           | [1] https://news.ycombinator.com/item?id=23479435
        
             | dhsysusbsjsi wrote:
             | Also nearly every paywalled media site
        
         | Sephr wrote:
         | There might not have been a generally accepted standard then,
         | but there is now: https://url.spec.whatwg.org/
        
       | queuebert wrote:
       | Uh oh, Regex is approaching sentience.
        
         | MaxBarraclough wrote:
         | Every known sentient being is a finite state machine. Every
         | finite state machine corresponds to a regular expression, and
         | vice versa.
        
           | JadeNB wrote:
           | > Every known sentient being is a finite state machine.
           | 
           | I know this is just a cutesy slogan, but how could you
           | possibly know whether a living creature is a finite state
           | machine? What would it even mean? I know I don't respond
           | identically to identical stimuli presented on different
           | occasions ....
        
             | throwamon wrote:
             | Obnoxious, I mean, trivial, answer: Just make "occasions" a
             | variable. Assuming your lifetime is finite, you could
             | simply assign each point in time to a value, and there you
             | have it: a finite mapping from each moment to a state.
        
             | MaxBarraclough wrote:
             | > I know this is just a cutesy slogan
             | 
             | Mostly, yes, but I do think there's a real point here as
             | well.
             | 
             | > how could you possibly know whether a living creature is
             | a finite state machine?
             | 
             | As I understand it, physicists don't really know whether
             | the physical world has a finite number of states, or an
             | infinite number. I think they tend to lean toward finite,
             | though.
             | 
             | Even if it's infinite, I doubt it's of consequence. That is
             | to say, I doubt that sentience depends on the physical
             | possibility of an infinite number of states. (Of course, if
             | it turns out the physical world only has a finite number of
             | states, that demonstrates that sentience is compatible with
             | the finite-states constraint.)
             | 
             | > What would it even mean?
             | 
             | Systems can be modelled as finite state machines. Sentient
             | entities like people are extremely sophisticated systems,
             | but that's just a matter of degree, not of category.
             | 
             | > I know I don't respond identically to identical stimuli
             | presented on different occasions
             | 
             | Right, because you're in a different state. You'll never be
             | in the same state twice. We don't need to resort to non-
             | determinism.
        
       ___________________________________________________________________
       (page generated 2021-09-25 23:00 UTC)