hngopher.com

       [HN Gopher] 10% of the top million sites are dead
       ___________________________________________________________________
        
       10% of the top million sites are dead
        
       Author : Soupy
       Score  : 235 points
       Date   : 2022-07-15 17:22 UTC (5 hours ago)
        
 (HTM) web link (ccampbell.io)
 (TXT) w3m dump (ccampbell.io)
        
       | gumby wrote:
       | His 'www' logic is flawed: https://www.example.com and
       | https://example.com need not return the same results, but his
       | checking code sends the output straight to /dev/null so he has no
       | way of knowing.
        
         | cbarrick wrote:
         | In theory, sure.
         | 
         | In practice, how many orgs serve on both example.com and
         | www.example.com yet operate each as entirely separate sites?
         | 
         | I cannot think of any example.
        
           | gumby wrote:
           | MIT was, for decades, though they seem to have changed.
        
       | phkahler wrote:
       | Read that again folks:
       | 
       | "a very reasonable but basic check would be to check each domain
       | and verify that it was online and responsive to http requests.
       | With only a million domains, this could be run from my own
       | computer relatively simply and it would give us a very quick
       | temperature check on whether the list truly was representative of
       | the "top sites on the internet". "
       | 
       | This took him 50 minutes to run. Think about that when you want
       | to host something smaller than a large commercial site. We live
       | in the future now, where bandwidth is relatively high and
       | computers are fast. Point being that you don't need to rent or
       | provision "big infrastructure" unless you're actually quite big.
        
         | cratermoon wrote:
         | > you don't need to rent or provision "big infrastructure"
         | unless you're actually quite big.
         | 
         | Or if you have hard response-time requirements. I really don't
         | think it would be good to, for example, wait an hour to process
         | the data from 800K earthquake sensors and send out an alert to
         | nearby affected areas.
        
         | stevemk14ebr wrote:
         | your point has a truth behind it for sure, but there's a large
         | difference between serving requests and making requests. Many
         | sites are simple html and css pages, but many others also have
         | complex backends. It's those that often are hard to scale and
         | why the cloud is hugely popular, maintaining and scaling the
         | backend is hard
        
           | phkahler wrote:
           | Oh absolutely, but he also said this:
           | 
           | I found that my local system could easily handle 512 parallel
           | processes, with my CPU @ ~35% utilization, 2GB of RAM usage,
           | and a constant 1.5MB down on the network.
           | 
           | Another thing that happened in the early web days was Apache.
           | People needed a web server and it did the job correctly.
           | Nobody ever really noticed that it had terrible performance,
           | so early on infrastructure went to multiple servers and load
           | balancers and all that jazz. Now with nginx, fast multi-core,
           | and speedy networks even at home, it's possible to run sites
           | with a hundred thousand users a day at home on a laptop. Not
           | that you'd really want to do exactly that but it could be
           | done.
           | 
           | Because of this I think an alternative to github would be
           | open source projects hosted on peoples home machines. CI/CD
           | might require distributing work to those with the right
           | hardware variants though.
        
         | [deleted]
        
         | jayd16 wrote:
         | The flip side is anyone can run these kinds of tools against
         | your site easily and cheaply.
        
       | kozziollek wrote:
       | Most of cities in Poland have their own $city.pl domain and allow
       | websites to buy $website.$city.pl. That might not be well known.
       | And cities have theri websites, so I guess it's OK.
       | 
       | But info.pl and biz.pl? Did nobody hear about country variants of
       | gTLDs?!
        
         | drdaeman wrote:
         | Those are called Public Suffixes or effective TLDs (eTLDs):
         | https://en.wikipedia.org/wiki/Public_Suffix_List
         | 
         | And you're entirely correct that author should've referred to
         | such list.
        
       | macintux wrote:
       | Title is misleading: that's the outcome, but the bulk of the
       | story is the data processing to reach that conclusion.
        
         | hinkley wrote:
         | It happens. Most of the stuff we do these days invokes a number
         | of disciplines. I forget sometimes that maybe ten percent of us
         | just play with random CS domains for "fun" and that most people
         | are coming into big problems blind, even sometimes the
         | explorers (though having comfort with exploring random fields
         | is a skill set unto itself).
         | 
         | Before the Cloud, when people would ask for a book on
         | distributed computing, which wasn't that often, I would tell
         | them seriously "Practical Parallel Rendering". That book was
         | almost ten years old by then. 20 now. It's ostensibly a book
         | about CGI, but CGI is about distributed work pools, so half the
         | book is a whirlwind tour of distributed computing and queuing
         | theory. Once they start talking at length about raytracing, you
         | can stop reading if CGI isn't your thing, but that's more than
         | halfway through the book.
         | 
         | I still have to explain some of that stuff to people, and it
         | catches them off guard because they think surely this little
         | task is not so sophisticated as that...
         | 
         | I think this is where the art comes in. You can make something
         | fiddly that takes constant supervision, so much so that you get
         | frustrated trying to explain it to others, or you can make
         | something where you push a button and magic comes out.
        
       | crikeyjoe wrote:
        
       | allknowingfrog wrote:
       | I don't have any particular opinions on the author's conclusions,
       | but I learned a thing or two about the power of terminal commands
       | by reading through the article. I had no idea that xargs had a
       | parallel mode.
        
         | thelamest wrote:
         | Probably not news to anyone who works with big data(tm), but I
         | learned, after additional searches, that using (something like)
         | duckdb as a CSV parser makes sense, especially if the
         | alternative is loading the entire thing to memory with
         | (something like) base R. This was informative for me:
         | https://hbs-rcs.github.io/large_data_in_R/.
        
       | zinekeller wrote:
       | TLDR: Campbell's methodology is flawed, does not consider edge
       | cases (one of which (equating apex-only and www-prefixed domains)
       | I consider reckless), and didn't understand how Majestic collects
       | and processes its data.
       | 
       | Longer version: This isn't comprehensive, but I think of two main
       | reasons why:
       | 
       | - The Majestic Million lists only the registrable part (with some
       | exceptions), and this sometimes lead to central CDNs being
       | listed. For example, the Majestic Million lists wixsite.com (for
       | those who are unaware is a CDN domain used by Wix.com with
       | separate subdomains), but if you visit wixsite.com you wouldn't
       | get anything. Same with Azure, subdomains of azureedge.net and
       | azurewebsites.net do exist (for example
       | https://peering.azurewebsites.net/) but azureedge.net and
       | azurewebsites.net themselves don't exist. Without similar
       | filtering, using the Cisco list (https://s3-us-
       | west-1.amazonaws.com/umbrella-static/index.htm...) would quickly
       | lead you to this precise problem (mainly because the number one
       | is "com", but phew at least http://ai./ does exist!)
       | 
       | - Also, shame on the author considering www-prefixed and apex-
       | only as one and the same. For some websites, it isn't. Take this
       | example: jma.go.jp (Japan Meteorological Agency), which doesn't
       | respond (actually NODATA) on http://jma.go.jp/ but is fine on
       | https://www.jma.go.jp/. Similarly, beian.gov.cn (Chinese ICP
       | Licence Administrator) wouldn't respond at all but
       | _www_.beian.gov.cn will. And for ncbi.nlm.nih.gov (National
       | Center for Biotechnology Information) ? I can 't blame Majestic:
       | https://www.ncbi.nlm.nih.gov/ and https://ncbi.nlm.nih.gov/ don't
       | redirect to a canonical domain, and unless you've compared the
       | HTTP pages there's no way you would know that they are the same
       | website!
       | 
       | Edit: I've downloaded out the CSV to check my claims, and it
       | shows:                 wixsite.com 0       beian.gov.cn 0
       | 
       | Please, for the love of sanity, consider what the Majestic
       | Million (and similar lists) criterion on inclusion. I can't
       | believe it to say, but can we crowd-source "Falsehoods
       | programmers believe about domains"?
       | 
       | Also addendum to crawling but I consider "probably forgivable":
       | 
       | - Some websites are only available in certain countries (internal
       | Russian websites don't respond at all outside Russia for
       | example). This can skew the numbers a little bit.
        
         | [deleted]
        
         | zepearl wrote:
         | > _Take this example: jma.go.jp (Japan Meteorological Agency),
         | which doesn 't respond (actually NODATA) on http://jma.go.jp/
         | but is fine on https://www.jma.go.jp/. Similarly, beian.gov.cn
         | (Chinese ICP Licence Administrator) wouldn't respond at all but
         | www.beian.gov.cn will._
         | 
         | I can confirm stuff like that - I'm writing a crawler&indexer-
         | program (prototype in Python, now writing the final version in
         | Rust) and assuming anything while crawling is NOK. I ended up
         | adding URLs to my "to-index"-list by considering only links
         | explicitly mentioned by other websites (or by pages within the
         | same site).
        
         | cratermoon wrote:
         | It even says right at the top of the Majestic Million site "The
         | million domains we find with the most referring subnets", not
         | implying anything about reachability for http(s) requests.
        
       | nr2x wrote:
       | Majestic is a shit list. Mystery solved.
        
       | softwaredoug wrote:
       | My current beliefs about how people use and trust information on
       | the Web.
       | 
       | First, trust is _everything_ on the Web, it is the thing people
       | first think of when arriving on some information. But how people
       | evaluate trust has changed dramatically over the last 10 years.
       | 
       | - Trust now comes almost exclusively from social proof. Searching
       | reddit, youtube, etc and other extremely _moderated_ sources of
       | information, where the most work is done to ensure content comes
       | from actual human beings. How many of us now google `<topic>
       | reddit` instead of just `<topic>`?
       | 
       | - Of course a lot of this trust is misplaced. There's a very thin
       | line between influencers and cult leaders / snake oil salesmen.
       | Our last President used this hack really effectively.
       | 
       | - Few trust Google's definition of trust anymore -- essentially
       | page rank. This made more sense when the Web essentially was
       | social, where inbound links were very organic. Now with the trust
       | in general Web sites evaporated, the main 'inbound links' anyone
       | cares about come from individuals or community they trust or
       | identify with. They don't trust Googles algorithm (its too
       | opaque, and too easily gamed).
       | 
       | This of course means the fracturing of truth away from elites.
       | Sometimes this could be a good thing, but in many cases _cough_
       | Covid _cough_ it might be pretty disastrous for misinformation
        
         | wolverine876 wrote:
         | > How many of us now google `<topic> reddit` instead of just
         | `<topic>`?
         | 
         | One of us lives in a bubble. I don't trust Reddit for anything,
         | or YouTube or any social media. IME, it's mis/disinformation -
         | not only a lack of information, but a negative; it leaves me
         | believing something false. My experience is, and plenty of
         | research shows, that we have no way to sort truth from fiction
         | without prior expertise in the domain. The misinformation and
         | disinformation on social media, and its persuasiveness, is very
         | well known. The results are evident before us, in the madness
         | and disasters, in dead people, in threats to freedom,
         | prosperity, and stability.
         | 
         | Why would people in this community, who are aware of these
         | issues, trust social media? How is that working out?
         | 
         | > This of course means the fracturing of truth away from
         | elites. Sometimes this could be a good thing
         | 
         | I think that's mis/disinformation. 'Elite' is a loaded,
         | negative (in this context) word. It makes the question about
         | power and the conclusion inevitable.
         | 
         | Making it about power distracts from the core issue of
         | knowledge, which is truth. I want to hear from the one person,
         | or one of the few people, with real knowledge about a topic; I
         | don't want to hear from others.
         | 
         |  _In matters of science the authority of thousands is not worth
         | the humble reasoning of one single person._
        
           | Brian_K_White wrote:
           | They already acknowledge the problem of trusting the crowd,
           | but you seem to not acknowledge the problem of trusting a
           | central dispensary. In fact it's unwise to trust either one.
           | Everything has to be evaluated case by case. The same source
           | should be trusted for one thing today, and not for some other
           | thing tomorrow.
        
         | mountainriver wrote:
         | > How many of us now google `<topic> reddit` instead of just
         | `<topic>`
         | 
         | I sure hope not, Reddit is horrible place for information
        
           | failTide wrote:
           | I use the strategy for a few things - including when I want
           | to get reviews of a product or service. There's still
           | potential for manipulation there, but you can judge the
           | replies based on the user history - and you know that
           | businesses aren't able to delete or hide bad reviews there.
           | 
           | But in general I agree with you - reddit is full of
           | misinformation, propaganda and astroturfing
        
           | romanhn wrote:
           | When I have a specific technical question, I append
           | "stackoverflow" to my search queries. When I want to read a
           | discussion, I add "reddit" (or "hacker news").
        
       | zX41ZdbW wrote:
       | This looks surprisingly similar to the unfinished research that I
       | did: https://github.com/ClickHouse/ClickHouse/issues/18842
        
       | mouzogu wrote:
       | whenever i go through my bookmarks, i tend to find maybe 5-10%
       | are now 404.
       | 
       | this is why i like the archive.ph project so much and using it
       | more as a kind of bookmarking service.
        
         | system2 wrote:
         | archive.ph = Russian federation website. Blocked by most
         | firewalls by default.
        
         | syedkarim wrote:
         | What's the benefit to using archive.ph instead of archive.org
         | (Internet Archive)? Seems like the latter is much more likely
         | to be around for awhile.
        
           | mouzogu wrote:
           | i find archive.ph does a better job of preserving the page as
           | is (it also takes a screenshot) compared to internet archive
           | which can be flaky at best.
           | 
           | i also find archive.ph much faster at searching, and the
           | browser extension is really useful too.
           | 
           | the faq does a great job of explaining too
           | https://archive.ph/faq
        
             | yellow_lead wrote:
             | Isn't archive.ph/today the one with questionable funding
             | sources and backing? Who is behind it and can it be trusted
             | for longevity?
        
               | mouzogu wrote:
               | yeah funding is a grey area...
               | 
               | fwiw the website is only accessible by VPN in a lot of
               | countries, which is say a lot for me..and i don't think
               | they've taken down any content, although i cant say for
               | sure.
        
               | NavinF wrote:
               | In this case the less we know, the longer it will last.
               | Notice how this site ignores robots.txt and copyright
               | claims by litigious companies that would like to see
               | their past erased.
               | 
               | The data saved on your NAS will outlast this site
               | regardless of who owns/funds it.
        
               | fragmede wrote:
               | How do you figure?
        
               | NavinF wrote:
               | What do you mean? There's a line of companies waiting to
               | sue anyone involved with that site. That's been the case
               | for many years.
        
             | mgdlbp wrote:
             | archive.today does that by rewriting the page to mostly
             | static HTML at the time of capture.
             | 
             | archive.org indexes all URLs first-class and presents as
             | close to what was originally served as possible. It also
             | stores arbitrary binary files and captures JS and Flash
             | interactivity with remarkable fidelity.
             | 
             | When logged in, the archive.org Save Page Now interface
             | gains the options of taking a screenshot and non-
             | recursively saving all linked pages. I cannot reason why--
             | the more saved, the better, right?
             | 
             | archive.org has a browser extension too
        
       | yajjackson wrote:
       | Tangential, but I love the format for your site. Any plans to do
       | a "How I built this blog" post?
        
         | kerbersos wrote:
         | Likely using Hugo with the congo theme
        
           | Soupy wrote:
           | Yup, nailed it. Hugo with Congo theme (and a few minor layout
           | tweaks). Hosted on cloudflare pages for free
        
       | the_biot wrote:
       | By what possible criteria are these the "top" million sites, if
       | 10% are dead? I'd start with questioning _that_ data.
        
         | kjeetgill wrote:
         | Dude, it's the second sentence of the first paragraph:
         | 
         | > For my purposes, the Majestic Million dataset felt like the
         | perfect fit as it is ranked by the number of links that point
         | to that domain (as well as taking into account diversity of the
         | origin domains as well).
        
           | MatthiasPortzel wrote:
           | And moreover, the author's conclusion is that the dataset is
           | bad.
           | 
           | > While I had expected some cleanliness issues, I wasn't
           | expecting to see this level of quality problems from a
           | dataset that I've seen referenced pretty extensively across
           | the web
        
           | winddude wrote:
           | part of the problem is it's not the number of links, it's
           | referring subnets. Fairly certain this includes, script tags.
        
           | the_biot wrote:
           | Yeah, but they're still providing a dataset that's just plain
           | bad. It's hardly relevant how many sites link to some other
           | site, if it's dead.
        
             | Brian_K_White wrote:
             | It's only bad data if it does not include what it claims to
             | include.
             | 
             | If the dataset is defined as inlinks, and it is inlinks,
             | then the data is good.
        
         | deltree7 wrote:
         | Exactly!
         | 
         | Garbage In == Garbage Out
        
       | winddude wrote:
       | No they're not.
        
       | gojomo wrote:
       | Many issues with this analysis, some others have already
       | mentioned, including:
       | 
       | * The 'domains' collected by the source, as those "with the most
       | referring subnets", aren't necessarily 'websites' that now, or
       | ever, respnded to HTTP
       | 
       | * In many cases any responding web server will be on the `www.`
       | subdomain, rather than the domain that was listed/probed - & not
       | everyone sets up `www.` to respond/redirect. (Author
       | misinterprets appearances of `www.domain` and `domain` in his
       | source list as errant duplicates, when in fact that may be an
       | indicator that those `www.domain` entries also have significant
       | `subdomain.www.domain` extensions - depending on what Majestic
       | means by 'subnets'.)
       | 
       | * Many sites may block `curl` requests because they _only_ want
       | attended human browser traffic, and such blocking (while usually
       | accompanied with some error response) _can_ be a more aggressive
       | drop-connection.
       | 
       | * `curl` given a naked hostname likely attempts a plain HTTP
       | connection, and given that even browsers now auto-prefix `https:`
       | for a naked hostname, some active sites likely have _nothing_
       | listening on plain-HTTP port anymore.
       | 
       | * Author's burst of activity could've triggered other rate-
       | limits/failures - either at shared hosts/inbound proxies
       | servicing many of the target domains, or at local ISP egresses or
       | DNS services. He'd need to drill-down into individual failures to
       | get a beter idea to what extent this might be happening.
       | 
       | If you want to probe if _domains_ are still active:
       | 
       | * confirm they're still registered via a `whois`-like lookup
       | 
       | * examine their DNS records for evidence of current services
       | 
       | * ping them, or any DNS-evident subdomains
       | 
       | * if there are any MX records, check if the related SMTP server
       | will confirm any likely email addresses (like postmaster@) as
       | deliverable. (But: don't send an actual email message.)
       | 
       | * (more at risk of being perceived as aggressive) scan any extant
       | domains (from DNS) for open ports running any popular (not just
       | HTTP) services
       | 
       | If you want to probe if _web sites_ are still active, start with
       | an actual list of web site URLs that were known to have been
       | active at some point.
        
         | spc476 wrote:
         | It dawned on me when I hit the Majestic query page [1] and saw
         | the link to "Commission a bespoke Majestic Analytics report."
         | They run a bot that scans the web, and (my opinion, no real
         | evidence) they probably don't include sites that block the
         | MJ12bot. This could explain why my site isn't in the list, I
         | had some issues with their bot [2] and _they_ blocked
         | themselves from crawling my site.
         | 
         | So, is this a list of the actual top 1,000,000 sites? Or just
         | the top 1,000,000 sites they crawl?
         | 
         | [1] https://majestic.com/reports/majestic-million
         | 
         | [2] http://boston.conman.org/2019/07/09-12
        
           | useruserabc wrote:
           | As near as I can tell, these are the top 1,000,000 domains
           | referred to by other websites they crawled.
           | 
           | The report is described as "The million domains we find with
           | the most referring subnets"[1] and a referring subnet is a
           | host with a webpage which points at the domain.
           | 
           | So to the grandparent, presumably if something is "linking"
           | to these domains, they probably were meant to be websites.
           | 
           | [1] https://majestic.com/reports/majestic-million [2]
           | https://majestic.com/help/glossary#RefSubnets,
           | https://majestic.com/help/glossary#RefIPs and also
           | https://majestic.com/help/glossary#Csubnet
        
       | bioemerl wrote:
       | I'm honestly amazed that out of the top million sites, which
       | probably includes a ton of tiny tiny sites that are idle or
       | abandoned, only ten percent are offline.
        
         | MonkeyMalarky wrote:
         | How many are placeholder pages thrown up by registrars like
         | Network Solutions?
        
           | denton-scratch wrote:
           | If they're placeholder _pages_ , they're not dead. Those 10%
           | are not responding at all; the requests aren't reaching any
           | HTTP server.
        
             | winddude wrote:
             | at least from his computer/script. A number could have been
             | blocked simply detecting him as a bot.
        
             | zamadatix wrote:
             | Not all placeholder pages will forever stay placeholder
             | pages though. Some may get sold, become a site, then stop
             | being a site again. Some may not get sold, come up for
             | renewal and be deemed unlikely to be worth trying to sell
             | anymore (renewal is cheap for a registrar but the registry
             | will still charge a small fee).
             | 
             | Of course the vast majority with enough interest to make
             | this list will either be sold and be an active page or
             | still be an active placeholder but I wouldn't rule out
             | there being a good count of pages towards the lower end of
             | the top million being placeholders that were eventually
             | deemed not worth trying for anymore.
        
         | mike_hock wrote:
         | Yeah, I'd expect a list of 1,000,000 "top" "sites" to contain
         | much more than what can be called a "site," especially in 2022
         | when the internet has been all but destroyed and all that's
         | left is a corporate oligopoly.
        
         | ehsankia wrote:
         | How is "top" defined here? If they were dead, wouldn't they
         | fairly quickly stop being "top"?
         | 
         | EDIT: the article uses a list sorted by inlinks, and I guess
         | other websites don't necessarily update broken links, but that
         | may be less true in the modern age where we have tools and
         | automated services to automatically warn us about dead links on
         | our websites.
        
           | nine_k wrote:
           | I can expect large SEO spam clusters of "sites" with many
           | links inside a cluster to make them look legit. For some time
           | such bits of SEO spam were on top of certain google searches
           | and enjoyed significant traffic, putting them firmly into
           | "top 1M".
           | 
           | Once a particular SEO trick is understood and "deoptimized"
           | by Google, these "sites" no longer make money, and get
           | abandoned.
        
         | Swizec wrote:
         | Blows my mind that my blog is 210863rd on that list. That makes
         | the web feel somehow smaller than I thought it was.
        
           | wincent wrote:
           | Eyeing you jealously from my position at 237,014 on the
           | list... We're almost neighbors, I guess.
        
       | gravitate wrote:
       | > Domain normalization is a bitch
       | 
       | I'm a no-www advocate. All my sites can be accessed from the Apex
       | domain. But some people for whatever reason like to prepend www
       | to my domains, so I wrote a rule in Apache's .HTACCESS to rewrite
       | the www to the Apex.
       | 
       | Here's a tutorial for doing that: https://techstream.org/Web-
       | Development/HTACCESS/WWW-to-Non-W...
        
         | noizejoy wrote:
         | > I'm a no-www advocate.
         | 
         | I used to feel the same way. -- Until the arrival of so many
         | new TLDs.
         | 
         | Since then I always use www, because mentioning www.alice.band
         | in a sentence is much more of a hint to a general audience as
         | to what I'm referring to than just alice.band
        
           | gravitate wrote:
           | I hear you. But a redirect is a good solution in that case.
        
             | noizejoy wrote:
             | Yes it is.
             | 
             | I just redirect the other way round, so those ever rarer
             | individuals typing in domains are also served fine on my
             | websites. And also to automatically grab the https.
             | 
             | I just find it's ever slightly more "honest" to have the
             | server name I mention, also be the one that's actually
             | being served. -- And that's also because I'm quite annoyed
             | at URL shorteners and all kinds of redirect trickery having
             | being weaponized over the years.
             | 
             | So I optimize for honesty and facilitate convenience.
             | 
             | But this pretty subtle stuff and I'm not advocating
             | anymore. -- I don't think it's that big of a deal either
             | way and I'm just expressing my little personal vote and
             | priorities on the big Internet. :-)
             | 
             | So my post wasn't intended to change your mind, but more as
             | a bit of an alternative view and what made me get there.
        
         | macintux wrote:
         | 25 years ago I added a rule to my employer's firewall to allow
         | the bare domain to work on our web server.
         | 
         | Inbound email immediately broke. I was still very new, and
         | didn't want to prolong the downtime, so I reverted instead of
         | troubleshooting.
         | 
         | A few months after I left, I sent an email to a former co-
         | worker, my replacement, and got the same bounce message. I rang
         | him up and verified that he had just set up the same firewall
         | rule.
         | 
         | Been much too long to have any clue now what we did wrong.
        
           | JackMcMack wrote:
           | You probably created a cname from the apex to www? This
           | problem still exists today.
           | 
           | From https://en.wikipedia.org/wiki/CNAME_record: "If a CNAME
           | record is present at a node, no other data should be present;
           | this ensures that the data for a canonical name and its
           | aliases cannot be different."
           | 
           | So if you're looking up the MX record for domain, but happen
           | to find a cname for domain to www.domain , it will follow
           | that and won't find any MX records for www.domain.
           | 
           | The correct approach is to create a cname record from
           | www.domain to domain, and have the A record (and MX and other
           | records) on the apex.
           | 
           | Most DNS providers have a proprietary workaround to create
           | dns-redirects on the apex (such as AWS Route53 Alias records)
           | and serve them as A records, but those rarely play nice with
           | external resources.
        
             | tux2bsd wrote:
             | > You probably created a cname from the apex to
             | 
             | You can't do that, period.
             | 
             | A lot of "cloud" and other GUI interfaces deceive people
             | into thinking it's possible, they just do A record fuckery
             | behind the scenes (clever in it's own right but it causes
             | misunderstanding).
        
       | MonkeyMalarky wrote:
       | Last time I tried to crawl that many domains, I ran into problems
       | with my ISP's DNS server. I ended up using a pool of public DNS
       | servers to spread out all the requests. I'm surprised that wasn't
       | an issue for the author?
        
         | wumpus wrote:
         | You have to run your own resolver. Crawling 101.
        
           | MonkeyMalarky wrote:
           | This is of course the correct answer. It just felt like
           | shaving a big yak at the time.
        
             | mh- wrote:
             | A properly configured unbound running locally can be a
             | decent compromise.
        
               | denton-scratch wrote:
               | That _is_ running your own resolver. Unbound is a
               | resolver.
        
               | mh- wrote:
               | well, yes, but I guess I think of unbound in a different
               | category from setting up (e.g.) bind. but, my experience
               | configuring bind is probably more than 20 years out of
               | date.
               | 
               | you're right to make that correction though, so thank
               | you. :)
        
               | fullstop wrote:
               | BIND is odd in that it combines a recursive resolver with
               | an authoritative name server, and this has actually led
               | to a number of security vulnerabilities over the years.
               | Other alternatives, such as djb's dnscache/tinydns and
               | NLNet Labs' Unbound/nsd separate the two to avoid this
               | entirety.
        
       | altdataseller wrote:
       | All these top million lists are very good at telling you the top
       | most 10K-50K sites on the web. After that, you're going into
       | 'crapshoot' land, where the 500,000th most popular site is very
       | likely to be a site that got some traffic a long time ago, but
       | now isn't even up.
       | 
       | So I would take this data with a grain of salt. You're better off
       | just analyzing the top 100K sites on these lists.
        
         | giantrobot wrote:
         | > where the 500,000th most popular site is very likely to be a
         | site that got some traffic a long time ago, but now isn't even
         | up.
         | 
         | That's literally the phenomenon the article is describing.
        
           | altdataseller wrote:
           | Ok let me reword it differently: the 500,000th most popular
           | site on these lists most likely isnt the 500,000th most
           | visited and it might not even be in the top 5 million. These
           | data sources are so bad at capturing popularity after 50k
           | sites or so simply because they dont have enough data
        
             | iruoy wrote:
             | I haven't tested this, but the "Cisco Umbrella 1 Million"
             | is generated daily from DNS request made to the Cisco
             | Umbrella DNS service. That seems to be a very good and
             | recent dataset.
             | 
             | It does count more than just visiting websites though. If
             | all Windows computers query the IP of microsoft.com once a
             | day that'll move them up quite a bit. And things in their
             | top 10 like googleapis.com and doubleclick.net are
             | obviously not visited directly.
             | 
             | So while it is quite a reliable and recent dataset, it is
             | not a good test of popularity.
        
       | spaceman_2020 wrote:
       | Not surprising. We're far away from the glory days of the
       | vibrant, chaotic web.
       | 
       | In countries like India that onboarded most users through
       | smartphones instead of computers, websites are not even
       | necessary. There's a huge dearth of local-focused web content as
       | well since there just isn't enough demand.
        
         | [deleted]
        
       | superb-owl wrote:
       | One of the few things I like about blockchain is the promise of a
       | less ephemeral web.
        
         | bergenty wrote:
         | Is that actually true? Don't most nodes hold heavily compressed
         | pointers only while there are only a percentage of nodes that
         | host the entire blockchain. I mean if what you're saying is
         | true then each node needs to have a copy of the entire internet
         | which isn't reasonable.
        
         | matkoniecz wrote:
         | One of many things I dislike about cryptoscams is making
         | promises which are lies.
        
         | deltree7 wrote:
         | spoken like someone who is clueless about Blockchain
        
       | pahool wrote:
       | zombo.com still kicking!
        
         | system2 wrote:
         | The png rotates with this:
         | 
         | .rotate {animation: rotation .5s infinite linear;}
         | 
         | I think it wasn't like this before. They must've updated it at
         | one point.
        
       | smugma wrote:
       | I downloaded the file and looked at the second 000 in his file,
       | which refers to wixsite.com.
       | 
       | It appears that wixsite.com isn't valid but www.wixsite.com is,
       | and redirects to wix.com.
       | 
       | It's misleading to say that the sites are dead. As noted
       | elsewhere, his source data is crap (other sites I checked such as
       | wixstatic.com don't appear to be valid) but his methodology is
       | bad, or at least his describing the sites as dead is misleading.
        
         | code123456789 wrote:
         | wixsite.com is a domain for free sites built on Wix, so if your
         | username on Wix is smugma, and your site name is mysite, then
         | you'll have a URL like smugma.wixsite.com/mysite for your Home
         | page.
         | 
         | That's why this domain is in the top
        
           | smugma wrote:
           | Correct, that's why it's in the top. Your example further
           | confirms why the author's methodology is broken.
        
         | winddude wrote:
         | 100% agree his methodology is broken. Another example like this
         | is googleapis.com. If I remember correctly there a quite a
         | number of domains like this in magestic million.
         | 
         | Not to mention a number of his requests may have been blocked.
        
         | zinekeller wrote:
         | > other sites I checked such as wixstatic.com don't appear to
         | be valid
         | 
         | But docs.wixstatic.com _is_ valid.
        
         | quickthrower2 wrote:
         | He takes this into account by generously considering _any_
         | returned response code as "not dead".
         | 
         | > there's a longtail of sites that had a variety of non-200
         | reponse codes but just to be conservative we'll assume that
         | they are all valid
        
           | mort96 wrote:
           | That doesn't take this into account, no. `curl wixsite.com`
           | returns a "Could not resolve host" error; it doesn't return a
           | response code, so the author would consider it invalid, even
           | though `curl www.wixsite.com` does return a response (a 301
           | redirect to www.wix.com).
        
             | quickthrower2 wrote:
             | Oh how does that work then? How does the browser get to the
             | redirect when curl doesn't get any response at all? Is this
             | a DNS thing?
        
               | chrisweekly wrote:
               | apex domain is different from www cname
        
       | zzzeek wrote:
       | irony that the site is not responding?
        
       | ghostly_s wrote:
       | Wow, I would not have suspected `tee` is able to handle multiple
       | processes writing to the same file. Doesn't seem to be mentioned
       | on the man-page, either.
        
       | ocdtrekkie wrote:
       | I've been working on trying to migrate sites I ran in 2008 or so
       | into my new preferred hosting strategy lately: I know zero people
       | look at them, since many were functionally broken at present, but
       | I don't like the idea of actually removing them from the web. So
       | I'm patching them up, migrating them to a more maintainable
       | setting, and keeping them going. Maybe someday some historian
       | will get something out of it.
        
       | tete wrote:
       | The biggest problem I find is that it seems to be pretty
       | "outdated" to keep redirects in place, if you move stuff. So many
       | links to news websites, etc. will cause a redirect to either / or
       | a 404 (which is a very odd thing to redirect to in my opinion).
       | 
       | If you are unlucky an article you wanted to find also completely
       | disappeared. This is scary, because it's basically history
       | disappearing.
       | 
       | I also wonder what will happen to text on websites that are some
       | ajax and javascript breaks because a third party goes down. While
       | the internet archive seems to be building tools for people to use
       | to mitigate this I found that they barely worked on websites that
       | do something like this.
       | 
       | Another worry is the ever-increasing size of these scripts making
       | archiving more expensive.
        
         | Kye wrote:
         | You can often pop the URL into the Wayback Machine to bring up
         | the last live copy. It's better at handling dynamic stuff the
         | more recent it is. Older stuff, especially early AJAX pages,
         | are just gone because the crawler couldn't handle it at the
         | time. It's far from a perfect solution, especially in light of
         | the big publishers finally getting their excuse to go after the
         | Internet Archive legally. It's a good silo, but just as
         | vulnerable as any other.
        
       ___________________________________________________________________
       (page generated 2022-07-15 23:00 UTC)