[HN Gopher] 10% of the top million sites are dead ___________________________________________________________________ 10% of the top million sites are dead Author : Soupy Score : 235 points Date : 2022-07-15 17:22 UTC (5 hours ago) (HTM) web link (ccampbell.io) (TXT) w3m dump (ccampbell.io) | gumby wrote: | His 'www' logic is flawed: https://www.example.com and | https://example.com need not return the same results, but his | checking code sends the output straight to /dev/null so he has no | way of knowing. | cbarrick wrote: | In theory, sure. | | In practice, how many orgs serve on both example.com and | www.example.com yet operate each as entirely separate sites? | | I cannot think of any example. | gumby wrote: | MIT was, for decades, though they seem to have changed. | phkahler wrote: | Read that again folks: | | "a very reasonable but basic check would be to check each domain | and verify that it was online and responsive to http requests. | With only a million domains, this could be run from my own | computer relatively simply and it would give us a very quick | temperature check on whether the list truly was representative of | the "top sites on the internet". " | | This took him 50 minutes to run. Think about that when you want | to host something smaller than a large commercial site. We live | in the future now, where bandwidth is relatively high and | computers are fast. Point being that you don't need to rent or | provision "big infrastructure" unless you're actually quite big. | cratermoon wrote: | > you don't need to rent or provision "big infrastructure" | unless you're actually quite big. | | Or if you have hard response-time requirements. I really don't | think it would be good to, for example, wait an hour to process | the data from 800K earthquake sensors and send out an alert to | nearby affected areas. | stevemk14ebr wrote: | your point has a truth behind it for sure, but there's a large | difference between serving requests and making requests. Many | sites are simple html and css pages, but many others also have | complex backends. It's those that often are hard to scale and | why the cloud is hugely popular, maintaining and scaling the | backend is hard | phkahler wrote: | Oh absolutely, but he also said this: | | I found that my local system could easily handle 512 parallel | processes, with my CPU @ ~35% utilization, 2GB of RAM usage, | and a constant 1.5MB down on the network. | | Another thing that happened in the early web days was Apache. | People needed a web server and it did the job correctly. | Nobody ever really noticed that it had terrible performance, | so early on infrastructure went to multiple servers and load | balancers and all that jazz. Now with nginx, fast multi-core, | and speedy networks even at home, it's possible to run sites | with a hundred thousand users a day at home on a laptop. Not | that you'd really want to do exactly that but it could be | done. | | Because of this I think an alternative to github would be | open source projects hosted on peoples home machines. CI/CD | might require distributing work to those with the right | hardware variants though. | [deleted] | jayd16 wrote: | The flip side is anyone can run these kinds of tools against | your site easily and cheaply. | kozziollek wrote: | Most of cities in Poland have their own $city.pl domain and allow | websites to buy $website.$city.pl. That might not be well known. | And cities have theri websites, so I guess it's OK. | | But info.pl and biz.pl? Did nobody hear about country variants of | gTLDs?! | drdaeman wrote: | Those are called Public Suffixes or effective TLDs (eTLDs): | https://en.wikipedia.org/wiki/Public_Suffix_List | | And you're entirely correct that author should've referred to | such list. | macintux wrote: | Title is misleading: that's the outcome, but the bulk of the | story is the data processing to reach that conclusion. | hinkley wrote: | It happens. Most of the stuff we do these days invokes a number | of disciplines. I forget sometimes that maybe ten percent of us | just play with random CS domains for "fun" and that most people | are coming into big problems blind, even sometimes the | explorers (though having comfort with exploring random fields | is a skill set unto itself). | | Before the Cloud, when people would ask for a book on | distributed computing, which wasn't that often, I would tell | them seriously "Practical Parallel Rendering". That book was | almost ten years old by then. 20 now. It's ostensibly a book | about CGI, but CGI is about distributed work pools, so half the | book is a whirlwind tour of distributed computing and queuing | theory. Once they start talking at length about raytracing, you | can stop reading if CGI isn't your thing, but that's more than | halfway through the book. | | I still have to explain some of that stuff to people, and it | catches them off guard because they think surely this little | task is not so sophisticated as that... | | I think this is where the art comes in. You can make something | fiddly that takes constant supervision, so much so that you get | frustrated trying to explain it to others, or you can make | something where you push a button and magic comes out. | crikeyjoe wrote: | allknowingfrog wrote: | I don't have any particular opinions on the author's conclusions, | but I learned a thing or two about the power of terminal commands | by reading through the article. I had no idea that xargs had a | parallel mode. | thelamest wrote: | Probably not news to anyone who works with big data(tm), but I | learned, after additional searches, that using (something like) | duckdb as a CSV parser makes sense, especially if the | alternative is loading the entire thing to memory with | (something like) base R. This was informative for me: | https://hbs-rcs.github.io/large_data_in_R/. | zinekeller wrote: | TLDR: Campbell's methodology is flawed, does not consider edge | cases (one of which (equating apex-only and www-prefixed domains) | I consider reckless), and didn't understand how Majestic collects | and processes its data. | | Longer version: This isn't comprehensive, but I think of two main | reasons why: | | - The Majestic Million lists only the registrable part (with some | exceptions), and this sometimes lead to central CDNs being | listed. For example, the Majestic Million lists wixsite.com (for | those who are unaware is a CDN domain used by Wix.com with | separate subdomains), but if you visit wixsite.com you wouldn't | get anything. Same with Azure, subdomains of azureedge.net and | azurewebsites.net do exist (for example | https://peering.azurewebsites.net/) but azureedge.net and | azurewebsites.net themselves don't exist. Without similar | filtering, using the Cisco list (https://s3-us- | west-1.amazonaws.com/umbrella-static/index.htm...) would quickly | lead you to this precise problem (mainly because the number one | is "com", but phew at least http://ai./ does exist!) | | - Also, shame on the author considering www-prefixed and apex- | only as one and the same. For some websites, it isn't. Take this | example: jma.go.jp (Japan Meteorological Agency), which doesn't | respond (actually NODATA) on http://jma.go.jp/ but is fine on | https://www.jma.go.jp/. Similarly, beian.gov.cn (Chinese ICP | Licence Administrator) wouldn't respond at all but | _www_.beian.gov.cn will. And for ncbi.nlm.nih.gov (National | Center for Biotechnology Information) ? I can 't blame Majestic: | https://www.ncbi.nlm.nih.gov/ and https://ncbi.nlm.nih.gov/ don't | redirect to a canonical domain, and unless you've compared the | HTTP pages there's no way you would know that they are the same | website! | | Edit: I've downloaded out the CSV to check my claims, and it | shows: wixsite.com 0 beian.gov.cn 0 | | Please, for the love of sanity, consider what the Majestic | Million (and similar lists) criterion on inclusion. I can't | believe it to say, but can we crowd-source "Falsehoods | programmers believe about domains"? | | Also addendum to crawling but I consider "probably forgivable": | | - Some websites are only available in certain countries (internal | Russian websites don't respond at all outside Russia for | example). This can skew the numbers a little bit. | [deleted] | zepearl wrote: | > _Take this example: jma.go.jp (Japan Meteorological Agency), | which doesn 't respond (actually NODATA) on http://jma.go.jp/ | but is fine on https://www.jma.go.jp/. Similarly, beian.gov.cn | (Chinese ICP Licence Administrator) wouldn't respond at all but | www.beian.gov.cn will._ | | I can confirm stuff like that - I'm writing a crawler&indexer- | program (prototype in Python, now writing the final version in | Rust) and assuming anything while crawling is NOK. I ended up | adding URLs to my "to-index"-list by considering only links | explicitly mentioned by other websites (or by pages within the | same site). | cratermoon wrote: | It even says right at the top of the Majestic Million site "The | million domains we find with the most referring subnets", not | implying anything about reachability for http(s) requests. | nr2x wrote: | Majestic is a shit list. Mystery solved. | softwaredoug wrote: | My current beliefs about how people use and trust information on | the Web. | | First, trust is _everything_ on the Web, it is the thing people | first think of when arriving on some information. But how people | evaluate trust has changed dramatically over the last 10 years. | | - Trust now comes almost exclusively from social proof. Searching | reddit, youtube, etc and other extremely _moderated_ sources of | information, where the most work is done to ensure content comes | from actual human beings. How many of us now google `<topic> | reddit` instead of just `<topic>`? | | - Of course a lot of this trust is misplaced. There's a very thin | line between influencers and cult leaders / snake oil salesmen. | Our last President used this hack really effectively. | | - Few trust Google's definition of trust anymore -- essentially | page rank. This made more sense when the Web essentially was | social, where inbound links were very organic. Now with the trust | in general Web sites evaporated, the main 'inbound links' anyone | cares about come from individuals or community they trust or | identify with. They don't trust Googles algorithm (its too | opaque, and too easily gamed). | | This of course means the fracturing of truth away from elites. | Sometimes this could be a good thing, but in many cases _cough_ | Covid _cough_ it might be pretty disastrous for misinformation | wolverine876 wrote: | > How many of us now google `<topic> reddit` instead of just | `<topic>`? | | One of us lives in a bubble. I don't trust Reddit for anything, | or YouTube or any social media. IME, it's mis/disinformation - | not only a lack of information, but a negative; it leaves me | believing something false. My experience is, and plenty of | research shows, that we have no way to sort truth from fiction | without prior expertise in the domain. The misinformation and | disinformation on social media, and its persuasiveness, is very | well known. The results are evident before us, in the madness | and disasters, in dead people, in threats to freedom, | prosperity, and stability. | | Why would people in this community, who are aware of these | issues, trust social media? How is that working out? | | > This of course means the fracturing of truth away from | elites. Sometimes this could be a good thing | | I think that's mis/disinformation. 'Elite' is a loaded, | negative (in this context) word. It makes the question about | power and the conclusion inevitable. | | Making it about power distracts from the core issue of | knowledge, which is truth. I want to hear from the one person, | or one of the few people, with real knowledge about a topic; I | don't want to hear from others. | | _In matters of science the authority of thousands is not worth | the humble reasoning of one single person._ | Brian_K_White wrote: | They already acknowledge the problem of trusting the crowd, | but you seem to not acknowledge the problem of trusting a | central dispensary. In fact it's unwise to trust either one. | Everything has to be evaluated case by case. The same source | should be trusted for one thing today, and not for some other | thing tomorrow. | mountainriver wrote: | > How many of us now google `<topic> reddit` instead of just | `<topic>` | | I sure hope not, Reddit is horrible place for information | failTide wrote: | I use the strategy for a few things - including when I want | to get reviews of a product or service. There's still | potential for manipulation there, but you can judge the | replies based on the user history - and you know that | businesses aren't able to delete or hide bad reviews there. | | But in general I agree with you - reddit is full of | misinformation, propaganda and astroturfing | romanhn wrote: | When I have a specific technical question, I append | "stackoverflow" to my search queries. When I want to read a | discussion, I add "reddit" (or "hacker news"). | zX41ZdbW wrote: | This looks surprisingly similar to the unfinished research that I | did: https://github.com/ClickHouse/ClickHouse/issues/18842 | mouzogu wrote: | whenever i go through my bookmarks, i tend to find maybe 5-10% | are now 404. | | this is why i like the archive.ph project so much and using it | more as a kind of bookmarking service. | system2 wrote: | archive.ph = Russian federation website. Blocked by most | firewalls by default. | syedkarim wrote: | What's the benefit to using archive.ph instead of archive.org | (Internet Archive)? Seems like the latter is much more likely | to be around for awhile. | mouzogu wrote: | i find archive.ph does a better job of preserving the page as | is (it also takes a screenshot) compared to internet archive | which can be flaky at best. | | i also find archive.ph much faster at searching, and the | browser extension is really useful too. | | the faq does a great job of explaining too | https://archive.ph/faq | yellow_lead wrote: | Isn't archive.ph/today the one with questionable funding | sources and backing? Who is behind it and can it be trusted | for longevity? | mouzogu wrote: | yeah funding is a grey area... | | fwiw the website is only accessible by VPN in a lot of | countries, which is say a lot for me..and i don't think | they've taken down any content, although i cant say for | sure. | NavinF wrote: | In this case the less we know, the longer it will last. | Notice how this site ignores robots.txt and copyright | claims by litigious companies that would like to see | their past erased. | | The data saved on your NAS will outlast this site | regardless of who owns/funds it. | fragmede wrote: | How do you figure? | NavinF wrote: | What do you mean? There's a line of companies waiting to | sue anyone involved with that site. That's been the case | for many years. | mgdlbp wrote: | archive.today does that by rewriting the page to mostly | static HTML at the time of capture. | | archive.org indexes all URLs first-class and presents as | close to what was originally served as possible. It also | stores arbitrary binary files and captures JS and Flash | interactivity with remarkable fidelity. | | When logged in, the archive.org Save Page Now interface | gains the options of taking a screenshot and non- | recursively saving all linked pages. I cannot reason why-- | the more saved, the better, right? | | archive.org has a browser extension too | yajjackson wrote: | Tangential, but I love the format for your site. Any plans to do | a "How I built this blog" post? | kerbersos wrote: | Likely using Hugo with the congo theme | Soupy wrote: | Yup, nailed it. Hugo with Congo theme (and a few minor layout | tweaks). Hosted on cloudflare pages for free | the_biot wrote: | By what possible criteria are these the "top" million sites, if | 10% are dead? I'd start with questioning _that_ data. | kjeetgill wrote: | Dude, it's the second sentence of the first paragraph: | | > For my purposes, the Majestic Million dataset felt like the | perfect fit as it is ranked by the number of links that point | to that domain (as well as taking into account diversity of the | origin domains as well). | MatthiasPortzel wrote: | And moreover, the author's conclusion is that the dataset is | bad. | | > While I had expected some cleanliness issues, I wasn't | expecting to see this level of quality problems from a | dataset that I've seen referenced pretty extensively across | the web | winddude wrote: | part of the problem is it's not the number of links, it's | referring subnets. Fairly certain this includes, script tags. | the_biot wrote: | Yeah, but they're still providing a dataset that's just plain | bad. It's hardly relevant how many sites link to some other | site, if it's dead. | Brian_K_White wrote: | It's only bad data if it does not include what it claims to | include. | | If the dataset is defined as inlinks, and it is inlinks, | then the data is good. | deltree7 wrote: | Exactly! | | Garbage In == Garbage Out | winddude wrote: | No they're not. | gojomo wrote: | Many issues with this analysis, some others have already | mentioned, including: | | * The 'domains' collected by the source, as those "with the most | referring subnets", aren't necessarily 'websites' that now, or | ever, respnded to HTTP | | * In many cases any responding web server will be on the `www.` | subdomain, rather than the domain that was listed/probed - & not | everyone sets up `www.` to respond/redirect. (Author | misinterprets appearances of `www.domain` and `domain` in his | source list as errant duplicates, when in fact that may be an | indicator that those `www.domain` entries also have significant | `subdomain.www.domain` extensions - depending on what Majestic | means by 'subnets'.) | | * Many sites may block `curl` requests because they _only_ want | attended human browser traffic, and such blocking (while usually | accompanied with some error response) _can_ be a more aggressive | drop-connection. | | * `curl` given a naked hostname likely attempts a plain HTTP | connection, and given that even browsers now auto-prefix `https:` | for a naked hostname, some active sites likely have _nothing_ | listening on plain-HTTP port anymore. | | * Author's burst of activity could've triggered other rate- | limits/failures - either at shared hosts/inbound proxies | servicing many of the target domains, or at local ISP egresses or | DNS services. He'd need to drill-down into individual failures to | get a beter idea to what extent this might be happening. | | If you want to probe if _domains_ are still active: | | * confirm they're still registered via a `whois`-like lookup | | * examine their DNS records for evidence of current services | | * ping them, or any DNS-evident subdomains | | * if there are any MX records, check if the related SMTP server | will confirm any likely email addresses (like postmaster@) as | deliverable. (But: don't send an actual email message.) | | * (more at risk of being perceived as aggressive) scan any extant | domains (from DNS) for open ports running any popular (not just | HTTP) services | | If you want to probe if _web sites_ are still active, start with | an actual list of web site URLs that were known to have been | active at some point. | spc476 wrote: | It dawned on me when I hit the Majestic query page [1] and saw | the link to "Commission a bespoke Majestic Analytics report." | They run a bot that scans the web, and (my opinion, no real | evidence) they probably don't include sites that block the | MJ12bot. This could explain why my site isn't in the list, I | had some issues with their bot [2] and _they_ blocked | themselves from crawling my site. | | So, is this a list of the actual top 1,000,000 sites? Or just | the top 1,000,000 sites they crawl? | | [1] https://majestic.com/reports/majestic-million | | [2] http://boston.conman.org/2019/07/09-12 | useruserabc wrote: | As near as I can tell, these are the top 1,000,000 domains | referred to by other websites they crawled. | | The report is described as "The million domains we find with | the most referring subnets"[1] and a referring subnet is a | host with a webpage which points at the domain. | | So to the grandparent, presumably if something is "linking" | to these domains, they probably were meant to be websites. | | [1] https://majestic.com/reports/majestic-million [2] | https://majestic.com/help/glossary#RefSubnets, | https://majestic.com/help/glossary#RefIPs and also | https://majestic.com/help/glossary#Csubnet | bioemerl wrote: | I'm honestly amazed that out of the top million sites, which | probably includes a ton of tiny tiny sites that are idle or | abandoned, only ten percent are offline. | MonkeyMalarky wrote: | How many are placeholder pages thrown up by registrars like | Network Solutions? | denton-scratch wrote: | If they're placeholder _pages_ , they're not dead. Those 10% | are not responding at all; the requests aren't reaching any | HTTP server. | winddude wrote: | at least from his computer/script. A number could have been | blocked simply detecting him as a bot. | zamadatix wrote: | Not all placeholder pages will forever stay placeholder | pages though. Some may get sold, become a site, then stop | being a site again. Some may not get sold, come up for | renewal and be deemed unlikely to be worth trying to sell | anymore (renewal is cheap for a registrar but the registry | will still charge a small fee). | | Of course the vast majority with enough interest to make | this list will either be sold and be an active page or | still be an active placeholder but I wouldn't rule out | there being a good count of pages towards the lower end of | the top million being placeholders that were eventually | deemed not worth trying for anymore. | mike_hock wrote: | Yeah, I'd expect a list of 1,000,000 "top" "sites" to contain | much more than what can be called a "site," especially in 2022 | when the internet has been all but destroyed and all that's | left is a corporate oligopoly. | ehsankia wrote: | How is "top" defined here? If they were dead, wouldn't they | fairly quickly stop being "top"? | | EDIT: the article uses a list sorted by inlinks, and I guess | other websites don't necessarily update broken links, but that | may be less true in the modern age where we have tools and | automated services to automatically warn us about dead links on | our websites. | nine_k wrote: | I can expect large SEO spam clusters of "sites" with many | links inside a cluster to make them look legit. For some time | such bits of SEO spam were on top of certain google searches | and enjoyed significant traffic, putting them firmly into | "top 1M". | | Once a particular SEO trick is understood and "deoptimized" | by Google, these "sites" no longer make money, and get | abandoned. | Swizec wrote: | Blows my mind that my blog is 210863rd on that list. That makes | the web feel somehow smaller than I thought it was. | wincent wrote: | Eyeing you jealously from my position at 237,014 on the | list... We're almost neighbors, I guess. | gravitate wrote: | > Domain normalization is a bitch | | I'm a no-www advocate. All my sites can be accessed from the Apex | domain. But some people for whatever reason like to prepend www | to my domains, so I wrote a rule in Apache's .HTACCESS to rewrite | the www to the Apex. | | Here's a tutorial for doing that: https://techstream.org/Web- | Development/HTACCESS/WWW-to-Non-W... | noizejoy wrote: | > I'm a no-www advocate. | | I used to feel the same way. -- Until the arrival of so many | new TLDs. | | Since then I always use www, because mentioning www.alice.band | in a sentence is much more of a hint to a general audience as | to what I'm referring to than just alice.band | gravitate wrote: | I hear you. But a redirect is a good solution in that case. | noizejoy wrote: | Yes it is. | | I just redirect the other way round, so those ever rarer | individuals typing in domains are also served fine on my | websites. And also to automatically grab the https. | | I just find it's ever slightly more "honest" to have the | server name I mention, also be the one that's actually | being served. -- And that's also because I'm quite annoyed | at URL shorteners and all kinds of redirect trickery having | being weaponized over the years. | | So I optimize for honesty and facilitate convenience. | | But this pretty subtle stuff and I'm not advocating | anymore. -- I don't think it's that big of a deal either | way and I'm just expressing my little personal vote and | priorities on the big Internet. :-) | | So my post wasn't intended to change your mind, but more as | a bit of an alternative view and what made me get there. | macintux wrote: | 25 years ago I added a rule to my employer's firewall to allow | the bare domain to work on our web server. | | Inbound email immediately broke. I was still very new, and | didn't want to prolong the downtime, so I reverted instead of | troubleshooting. | | A few months after I left, I sent an email to a former co- | worker, my replacement, and got the same bounce message. I rang | him up and verified that he had just set up the same firewall | rule. | | Been much too long to have any clue now what we did wrong. | JackMcMack wrote: | You probably created a cname from the apex to www? This | problem still exists today. | | From https://en.wikipedia.org/wiki/CNAME_record: "If a CNAME | record is present at a node, no other data should be present; | this ensures that the data for a canonical name and its | aliases cannot be different." | | So if you're looking up the MX record for domain, but happen | to find a cname for domain to www.domain , it will follow | that and won't find any MX records for www.domain. | | The correct approach is to create a cname record from | www.domain to domain, and have the A record (and MX and other | records) on the apex. | | Most DNS providers have a proprietary workaround to create | dns-redirects on the apex (such as AWS Route53 Alias records) | and serve them as A records, but those rarely play nice with | external resources. | tux2bsd wrote: | > You probably created a cname from the apex to | | You can't do that, period. | | A lot of "cloud" and other GUI interfaces deceive people | into thinking it's possible, they just do A record fuckery | behind the scenes (clever in it's own right but it causes | misunderstanding). | MonkeyMalarky wrote: | Last time I tried to crawl that many domains, I ran into problems | with my ISP's DNS server. I ended up using a pool of public DNS | servers to spread out all the requests. I'm surprised that wasn't | an issue for the author? | wumpus wrote: | You have to run your own resolver. Crawling 101. | MonkeyMalarky wrote: | This is of course the correct answer. It just felt like | shaving a big yak at the time. | mh- wrote: | A properly configured unbound running locally can be a | decent compromise. | denton-scratch wrote: | That _is_ running your own resolver. Unbound is a | resolver. | mh- wrote: | well, yes, but I guess I think of unbound in a different | category from setting up (e.g.) bind. but, my experience | configuring bind is probably more than 20 years out of | date. | | you're right to make that correction though, so thank | you. :) | fullstop wrote: | BIND is odd in that it combines a recursive resolver with | an authoritative name server, and this has actually led | to a number of security vulnerabilities over the years. | Other alternatives, such as djb's dnscache/tinydns and | NLNet Labs' Unbound/nsd separate the two to avoid this | entirety. | altdataseller wrote: | All these top million lists are very good at telling you the top | most 10K-50K sites on the web. After that, you're going into | 'crapshoot' land, where the 500,000th most popular site is very | likely to be a site that got some traffic a long time ago, but | now isn't even up. | | So I would take this data with a grain of salt. You're better off | just analyzing the top 100K sites on these lists. | giantrobot wrote: | > where the 500,000th most popular site is very likely to be a | site that got some traffic a long time ago, but now isn't even | up. | | That's literally the phenomenon the article is describing. | altdataseller wrote: | Ok let me reword it differently: the 500,000th most popular | site on these lists most likely isnt the 500,000th most | visited and it might not even be in the top 5 million. These | data sources are so bad at capturing popularity after 50k | sites or so simply because they dont have enough data | iruoy wrote: | I haven't tested this, but the "Cisco Umbrella 1 Million" | is generated daily from DNS request made to the Cisco | Umbrella DNS service. That seems to be a very good and | recent dataset. | | It does count more than just visiting websites though. If | all Windows computers query the IP of microsoft.com once a | day that'll move them up quite a bit. And things in their | top 10 like googleapis.com and doubleclick.net are | obviously not visited directly. | | So while it is quite a reliable and recent dataset, it is | not a good test of popularity. | spaceman_2020 wrote: | Not surprising. We're far away from the glory days of the | vibrant, chaotic web. | | In countries like India that onboarded most users through | smartphones instead of computers, websites are not even | necessary. There's a huge dearth of local-focused web content as | well since there just isn't enough demand. | [deleted] | superb-owl wrote: | One of the few things I like about blockchain is the promise of a | less ephemeral web. | bergenty wrote: | Is that actually true? Don't most nodes hold heavily compressed | pointers only while there are only a percentage of nodes that | host the entire blockchain. I mean if what you're saying is | true then each node needs to have a copy of the entire internet | which isn't reasonable. | matkoniecz wrote: | One of many things I dislike about cryptoscams is making | promises which are lies. | deltree7 wrote: | spoken like someone who is clueless about Blockchain | pahool wrote: | zombo.com still kicking! | system2 wrote: | The png rotates with this: | | .rotate {animation: rotation .5s infinite linear;} | | I think it wasn't like this before. They must've updated it at | one point. | smugma wrote: | I downloaded the file and looked at the second 000 in his file, | which refers to wixsite.com. | | It appears that wixsite.com isn't valid but www.wixsite.com is, | and redirects to wix.com. | | It's misleading to say that the sites are dead. As noted | elsewhere, his source data is crap (other sites I checked such as | wixstatic.com don't appear to be valid) but his methodology is | bad, or at least his describing the sites as dead is misleading. | code123456789 wrote: | wixsite.com is a domain for free sites built on Wix, so if your | username on Wix is smugma, and your site name is mysite, then | you'll have a URL like smugma.wixsite.com/mysite for your Home | page. | | That's why this domain is in the top | smugma wrote: | Correct, that's why it's in the top. Your example further | confirms why the author's methodology is broken. | winddude wrote: | 100% agree his methodology is broken. Another example like this | is googleapis.com. If I remember correctly there a quite a | number of domains like this in magestic million. | | Not to mention a number of his requests may have been blocked. | zinekeller wrote: | > other sites I checked such as wixstatic.com don't appear to | be valid | | But docs.wixstatic.com _is_ valid. | quickthrower2 wrote: | He takes this into account by generously considering _any_ | returned response code as "not dead". | | > there's a longtail of sites that had a variety of non-200 | reponse codes but just to be conservative we'll assume that | they are all valid | mort96 wrote: | That doesn't take this into account, no. `curl wixsite.com` | returns a "Could not resolve host" error; it doesn't return a | response code, so the author would consider it invalid, even | though `curl www.wixsite.com` does return a response (a 301 | redirect to www.wix.com). | quickthrower2 wrote: | Oh how does that work then? How does the browser get to the | redirect when curl doesn't get any response at all? Is this | a DNS thing? | chrisweekly wrote: | apex domain is different from www cname | zzzeek wrote: | irony that the site is not responding? | ghostly_s wrote: | Wow, I would not have suspected `tee` is able to handle multiple | processes writing to the same file. Doesn't seem to be mentioned | on the man-page, either. | ocdtrekkie wrote: | I've been working on trying to migrate sites I ran in 2008 or so | into my new preferred hosting strategy lately: I know zero people | look at them, since many were functionally broken at present, but | I don't like the idea of actually removing them from the web. So | I'm patching them up, migrating them to a more maintainable | setting, and keeping them going. Maybe someday some historian | will get something out of it. | tete wrote: | The biggest problem I find is that it seems to be pretty | "outdated" to keep redirects in place, if you move stuff. So many | links to news websites, etc. will cause a redirect to either / or | a 404 (which is a very odd thing to redirect to in my opinion). | | If you are unlucky an article you wanted to find also completely | disappeared. This is scary, because it's basically history | disappearing. | | I also wonder what will happen to text on websites that are some | ajax and javascript breaks because a third party goes down. While | the internet archive seems to be building tools for people to use | to mitigate this I found that they barely worked on websites that | do something like this. | | Another worry is the ever-increasing size of these scripts making | archiving more expensive. | Kye wrote: | You can often pop the URL into the Wayback Machine to bring up | the last live copy. It's better at handling dynamic stuff the | more recent it is. Older stuff, especially early AJAX pages, | are just gone because the crawler couldn't handle it at the | time. It's far from a perfect solution, especially in light of | the big publishers finally getting their excuse to go after the | Internet Archive legally. It's a good silo, but just as | vulnerable as any other. ___________________________________________________________________ (page generated 2022-07-15 23:00 UTC)