* * * * * Notes on blocking the MJ12Bot The MJ12Bot [1] is the first robot listed in the Wikipedia's [2] robots.txt [3] file, which I find amusing for obvious reasons [4]. In the Hacker News comments [5] there's a thread [6] specifically about the MJ12Bot, and I replied to a comment about blocking it [7]. It's not that easy, because it's a distributed bot that has used 136 unique IP (Internet Protocol) addresses just last month. Because of that comment, I decided I should expand on some of those numbers here. The first table is the number of addresses from January through June, 2019 to show they're not all from a single netblock, The address format “A.B.C.D” will represent a unique IP address, like 172.16.15.2; “A.B.C” will represent the IP addresses 172.16.15.0 to 172.16.15.255; “A.B” will represent the range 172.16.0.0 to 172.16.255.255 and finally “A” will represent the range 172.0.0.0 to 172.255.255.255. Table: Number of distinct IP addresses used by MJ12Bot in 2019 when hitting my site Address format number ------------------------------ A.B.C.D 312 A.B.C 256 A.B 86 A 53 Next are the unique addresses from all of 2018 used by MJ12Bot: Table: Number of distinct IP addresses used by MJ12Bot in 2018 when hitting my site Address format number ------------------------------ A.B.C.D 474 A,B.C 370 A.B 125 A 66 This wide distribution can easily explain why Wikipedia found it to ignore any rate limits set. Each individual node of MJ12Bot probably followed the rate limit, but it's a hard problem to coordinate across … what? 500 machines across the world? It seems the best bet is to ban MJ12Bot via robots.txt: -----[ data ]----- User-agent: MJ12bot Disallow: / -----[ END OF LINE ]----- While I haven't added MJ12Bot to my own robots.txt [8] file, it hasn't hit my site since they removed me from their crawl list [9], so it appears it can be tamed. [1] https://mj12bot.com/ [2] https://www.wikipedia.org/ [3] https://en.wikipedia.org/robots.txt [4] gopher://gopher.conman.org/0Phlog:2019/07/09-12 [5] https://news.ycombinator.com/item?id=20453189 [6] https://news.ycombinator.com/item?id=20453542 [7] https://news.ycombinator.com/item?id=20455003 [8] http://boston.conman.org/robots.txt [9] gopher://gopher.conman.org/0Phlog:2019/07/12.1 Email author at sean@conman.org .