* * * * * How can a “commercial grade” web robot be so badly written? Alex Schroeder was checking the status of web requests [1], and it made me wonder about the stats on my own server. One quick script later and I had some numbers: Table: Status of requests for boston.conman.org so far this month Status result requests percent ------------------------------ 200 OKAY 53457 82.83 206 PARTIAL_CONTENT 12 0.02 301 MOVE_PERM 2421 3.75 304 NOT_MODIFIED 6185 9.58 400 BAD_REQUEST 101 0.16 401 UNAUTHORIZED 147 0.23 404 NOT_FOUND 2000 3.10 405 METHOD_NOT_ALLOWED 41 0.06 410 GONE 5 0.01 500 INTERNAL_ERROR 173 0.27 ------------------------------ Total - 64542 100.01 I'll have to check the INTERNAL_ERRORs and into those 12 PARTIAL_CONTENT responses, but the rest seem okay. I was curious to see what I didn't have that was being requested, when I noticed that the MJ12Bot [2] was producing the majority of NOT_FOUND responses. Yes, sadly, most of the traffic around here is from bots [3]. Lots and lots of bots. Table: Top agents requesting pages requests percentage user agent ------------------------------ 16952 26 The Knowledge AI 9159 14 Mozilla/5.0 (compatible; SemrushBot/3~bl; +http://www.semrush.com/bot.html) 5633 9 Mozilla/5.0 (compatible; VelenPublicWebCrawler/1.0; +https://velen.io) 4272 7 Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/) 4046 6 Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) 3170 5 Mozilla/5.0 (compatible; Go-http-client/1.1; +centurybot9@gmail.com) 2146 3 Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) 1197 2 Mozilla/5.0 (compatible; DotBot/1.1; http://www.opensiteexplorer.org/dotbot, help@moz.com) 1146 2 istellabot/t.1.13 ------------------------------ 47721 74 Total (out of 64542) But it's been that way for years now. C'est la vie. So I started looking closer at MJ12Bot and the requests it was generating, and … they were odd: * //%22http://www.thomasedison.com//%22 * //%22https://github.com/spc476/NaNoGenMo-2018/blob/master/run.lua/%22 * //%22/2018/08/24.1/%22 * //%22https://kottke.org/19/04/life-sized-lego-electronics/%22 And so on. As they describe it: > Why do you keep crawling 404 or 301 pages? > > We have a long memory and want to ensure that temporary errors, website > down pages or other temporary changes to sites do not cause irreparable > changes to your site profile when they shouldn't. Also if there are still > links to these pages they will continue to be found and followed. Google > have published a statement since they are also asked this question, their > reason is of course the same as ours and their answer can be found here: > Google 404 policy. [4] > But those requests? They have a real issue with their bot. Looking over the requests, I see that they're pages I've linked to, but for whatever reason, their bot is making requests for remote pages on my server. Worse yet, they're quoted! The %22 parts—that's an encoded double quote. It's as if their bot saw “” and treated it as not only a link on my server, but escaped the quotes when making the request! Pssst! MJ12Bot! Quotes are optional! Both “” and “” are equivalent! Sigh. Annoyed, I sent them the following email: > From: Sean Conner > To: bot@majestic12.co.uk > Subject: Your robot is making bogus requests to my webserver > Date: Tue, 9 Jul 2019 17:49:02 -0400 > > I've read your page on the mj12 bot, and I don't necessarily mind the 404s > your bot generates, but I think there's a problem with your bot making > totally bogus requests, such as: > > //%22https://www.youtube.com/watch?v=LnxSTShwDdQ%5C%22 > //%22https://www.zaxbys.com//%22 > //%22/2003/11/%22 > //%22gopher://auzymoto.net/0/glog/post0011/%22 > //%22https://github.com/spc476/NaNoGenMo-2018/blob/master/valley.l/%22 > > I'm not a proxy server, so requesting a URL will not work, and even if I > was a proxy server, the request itself is malformed so badly that I have to > conclude your programmers are incompetent and don't care. > > Could you at the very least fix your robot so it makes proper requests? > I then received a canned reply saying that they have, in fact, received my email and are looking into it. Nice. But I did a bit more investigation, and the results aren't pretty: Table: Requests and results for MJ12Bot Status result number percentage ------------------------------ 200 OKAY 505 23.34 301 MOVE_PERM 4 0.18 404 NOT_FOUND 1655 76.48 ------------------------------ Total - 2164 100.00 So not only are they responsible for 83% of the bad requests I've seen, but nearly 77% of the requests they make are bad! Just amazing programmers they have! [1] https://alexschroeder.ch/wiki/2019-07-09_Web_Requests [2] http://mj12bot.com/ [3] https://en.wikipedia.org/wiki/Internet_bot [4] https://www.seroundtable.com/google-404-memory-16616.html Email author at sean@conman.org .