[HN Gopher] Cached Chrome Top Million Websites
       Cached Chrome Top Million Websites
       Author : edent
       Score  : 199 points
       Date   : 2022-12-31 13:49 UTC (9 hours ago)
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
       | cronaday wrote:
       | This is _very_ ethically dubious. Google is collecting _raw URLs_
       | from Chrome users who turned on history syncing across their
       | _own_ devices, then reusing the data and funneling it through
       | Stanford. No way Chrome users understand or approve of this.
       | The paper tries to justify its ethics with Google's privacy
       | policy, which is laughable. There are so many papers about how
       | meaningless privacy policies are. If Apple or Mozilla did
       | anything remotely like this, Hacker News would riot.
       | Edit: I don't want to be a conspiracy theorist, but this post
       | suddenly got a bunch of downvotes at the same time as defensive
       | comments from a current Googler and recent ex-Googler. Then one
       | of my responses below to a Chrome developer got flagged for no
       | obvious reason. Hmm.
         | jeffbee wrote:
         | Maybe your posts would get better votes if you made any effort
         | at all to back up your claim on unethical behavior. You
         | provided nothing.
         | dang wrote:
         | Can you please make your substantive points without breaking
         | the site guidelines? You did that here with your last
         | paragraph, and worse at
         | https://news.ycombinator.com/item?id=34197958.
         | If you wouldn't mind reviewing
         | https://news.ycombinator.com/newsguidelines.html and taking the
         | intended spirit of the site more to heart, we'd be grateful.
         | soneca wrote:
         | > _" If Apple or Mozilla did anything remotely like this,
         | Hacker News would riot."_
         | My perception is that, collectively, HN hates and criticizes
         | Google much more than Apple and Mozilla. I mean, _much_ more.
         | This last sentence accusation sounded bizarre to me.
           | dmitriid wrote:
           | > I mean, much more.
           | Because Google is a web advertisement company that dominates
           | many large spheres: search, browsers (including standards
           | committees), email, mobile (Android is 77% market share) etc.
           | All are things that we've come to view as crucial to modern
           | life.
           | And time and again they've shown that they only view that
           | dominance as a funnel for ad revenue, data collection, and
           | whatever benefits them at this particular moment.
           | hericium wrote:
           | > My perception is that, collectively, HN hates and
           | criticizes Google much more than Apple and Mozilla.
           | Not the entirety of HN. As I have more than once delicately
           | pointed out[1], Mozilla is Google's bitch.
           | [1] https://news.ycombinator.com/item?id=30732539
           | cronaday wrote:
           | Just suggesting that prior browser and OS privacy blowups
           | involving those companies have been over less worrisome
           | things, not that those companies are subject to more or less
           | criticism. Looking back on outraged discussions of Mozilla's
           | telemetry is kinda quaint in comparison.
           | marcosdumay wrote:
           | > HN hates and criticizes Google much more than Apple and
           | Mozilla.
           | That is mostly because Apple almost never does something like
           | this, and Mozilla literally never does.
         | chiefalchemist wrote:
         | re: Edit.
         | I've noticed similar behavior in HN voting. Down vote spikes
         | but few if any comments in-line with the voting. Not sure if
         | it's bots, human-based click farms, or too just don't
         | understand that disagreement is not grounds for down voting.
         | Perhaps a bit of all three?
           | pvg wrote:
           | _just don 't understand that disagreement is not grounds for
           | down voting._
           | It is perfectly fine on HN and always has been.
             | chiefalchemist wrote:
             | Nah. I see it differently.
             | The Guidelines are clear about why we're here and
             | expectations. The emphasis is on discussion, learning and
             | objectivity. Yes, disagreement is mentioned (i.e., allowed)
             | but even that needs to be constructive, yes?
             | A down vote - with no discussion - well, frankly in the
             | context of the Guidelines, is:
             | 1) Not in the spirit of the guidelines; 2) Perhaps
             | redundant to 1, but lazy; 3) At best, small-minded and
             | childish;
             | If people want to pout about reading something they don't
             | like, this isn't the place for them.
             | Yeah, I see who you are. And I'm ok w/ pushing back. That's
             | what make HN what it is ;)
               | pvg wrote:
               | You see it differently but it's just not accurate, it's
               | not in the guidelines super-explicitly so easy to miss.
               | https://news.ycombinator.com/item?id=22910444
               | https://news.ycombinator.com/item?id=16131314 and there
               | are many many others
               |  _Yeah, I see who you are._
               | I'm literally a random scold on the internet, I just
               | happen to be right about this.
               | chiefalchemist wrote:
               | I disagree.
               | I'm not going to explain why.
               | How does that feel? What value does it add? (Sweet FA,
               | eh.)
               | You're right, you might be right. But that does make it
               | right. I get zero satisfaction from context-less down
               | votes. I don't do them. I ignore them when I get them
               | (i.e., they have zero influence on my HN behavior). If
               | I'm changing my mind over some lazy a-holes' click, I'm
               | losing. Big time.
               | I can't imagine why anyone feels any differently. The
               | reality is, there are pointless noise. There's not enough
               | context to drive anything actionable for anybody.
               | But while I have your attention: how about a feature
               | request: Karma points that consider the discussion below
               | a top-parent comment.
         | jefftk wrote:
         | Google has written publicly about how this system works:
         | https://developer.chrome.com/docs/crux/methodology/
         | https://www.google.com/chrome/privacy/whitepaper.html#usages...
         | This includes only listing publicly discoverable pages, only
         | including data from users who have turned on "Make searches and
         | browsing better (Sends URLs of pages you visit to Google)", and
         | only including pages that are visited by a minimum number of
         | users.
           | blast wrote:
           | "Users who have turned on" implies that they opted in. Is
           | this behavior opt-in or opt-out?
           | cronaday wrote:
           | > Google has written publicly about how this system works
           | If this is news to Hackers News, there is no way that regular
           | Chrome users are aware of it. Saying something in a privacy
           | policy or on a developer website just can't be enough for
           | _analyzing a person 's URL data_.
           | > This includes only listing publicly discoverable pages,
           | only including data from users who have turned on "Make
           | searches and browsing better (Sends URLs of pages you visit
           | to Google)", and only including pages that are visited by a
           | minimum number of users.
           | Since when does aggregating this type of data make it fair
           | game? This is _analyzing a person 's URL data_ from _their
           | own devices_. There has always been a big bright red line for
           | browsers touching a user 's browsing history. Google crossed
           | that line.
           | Also, I just checked on a fresh Chrome install. The "Make
           | searches and browsing better" option is _enabled by default_
           | and _buried in Chrome settings_. How is that acceptable
           | consent for _analyzing a person 's URL data_?
             | jasonlotito wrote:
             | > If this is news to Hackers News
             | This isn't news to Hacker News. This might be new to you,
             | but that does not mean it's some new information that's
             | been hidden.
             | > there is no way that regular Chrome users are aware of
             | it.
             | There is no way that a regular Chrome user is aware of all
             | the features Chrome offers, let alone all the details. Not
             | because Google is hiding this from them, but because there
             | is just a LOT Chrome does.
             | I don't even use Chrome, and none of this is new.
             | It's fine that this is all new to you, but it's not new to
             | you because anyone has kept this secret. At this point,
             | you've chosen to remain ignorant.
               | dang wrote:
               | Please don't cross into personal attack.
               | https://news.ycombinator.com/newsguidelines.html
               | Edit: you've unfortunately been breaking the site
               | guidelines a ton lately. Seriously not cool, and well
               | past the line at which we start banning the account.
               | I don't want to ban you, but if you keep this up, we'll
               | have to. If you'd please review
               | https://news.ycombinator.com/newsguidelines.html and
               | stick to the rules, we'd appreciate it.
               | marcosdumay wrote:
               | Is it opt-in or opt-out? And if it's opt-in, does it come
               | with infinite nagging until you opt-in?
               | I know login-in and syncing your data are "opt-in"
               | options that come with infinite nagging (so, actually
               | required options). The information that there are
               | different levels of syncing is news to me.
               | lucb1e wrote:
               | > This might be new to you, but that does not mean it's
               | some new information that's been hidden.
               | I downloaded Chrome on a new laptop an hour ago (at my
               | employer's request, I'd use Firefox myself) and was
               | certainly not aware of this.
               | This information was not on any screen at any point.
               | There was a default-checked checkmark for some general
               | statistics sharing which I only noticed after clicking
               | download (because it was small and _below_ the download
               | button), but didn 't click through to the privacy policy
               | to learn more.
               | Guess I should have read the privacy policy. I'm trying
               | to find what it said now, but I can't see it anymore
               | because different terms apply to Linux downloads and
               | there's no button to download the Windows version.
               | Basically, visiting the same page in Firefox on Linux
               | (instead of Edge on Windows, which I don't have access to
               | atm) gives me different contents and no checkmark.
               | cronaday wrote:
               | If a user can somehow someway somewhere learn about what
               | a company is doing, then it's OK? Really?
               | jefftk wrote:
               | I could see two main arguments for this not being okay:
               | * Chrome is secretly collecting data.
               | * Chrome is doing something users would object to if they
               | knew and understood it.
               | I don't think either of these are the case here: they are
               | sharing data about what sites people generally visit in
               | an aggregated form doesn't reveal any individual's
               | browsing (what's to object to?) and talk about it in the
               | place people would go to learn about what data they
               | collect.
               | jstummbillig wrote:
               | What is it you are proposing? If it were every
               | institutions obligation to make sure, that all its
               | instrumental functions were obvious to every potential
               | user and keep any user from engaging with the institution
               | under any false assumptions, nothing in our society would
               | work.
               | That it not to say that scrutiny is not important. You
               | should certainly be allowed to point at any individual
               | function and demand more upfront transparency, over what
               | is currently being offered. But be aware of the massive
               | additional cognitive load you create, for everyone, when
               | you are not just demanding information _availability_ ,
               | but that this information is being _delivered_ to anyone
               | it _might_ concern. Any individuals preference to not
               | care about a function would have to take the backseat to
               | the opinion, that they have to at least somewhat consider
               | the function before engaging.
               | Considering how expensive this process is, "Google Chrome
               | CrUX" would probably be pretty far down on the list for
               | me personally, as "crucial things everyone should
               | definitely know about before possibly engaging" goes, but
               | to each their own.
               | dmitriid wrote:
               | > It's fine that this is all new to you, but it's not new
               | to you because anyone has kept this secret. At this
               | point, you've chosen to remain ignorant.
               | Ah yes. Blame the user for not understanding yet another
               | piece in Google's gargantuan data collecting machinery.
               | Recent court cases revealed that Google's own employees
               | don't know what's tracked and how to turn it off. But I'm
               | sure it's only ignorance that keeps users uninformed.
           | vachina wrote:
           | If crux is what Google is willing to make public, it makes
           | one wonder what else is collected and stored for their own
           | use (i.e. their moat).
           | I'm not using Chrome on all my devices.
           | addingnumbers wrote:
           | > only including data from users who have turned on "Make
           | searches and browsing better (Sends URLs of pages you visit
           | to Google)"
           | One big problem there is that we don't know what percentage
           | of users for whom "turned on" is a euphemism for "didn't
           | notice."
           | kevin_thibedeau wrote:
           | Does this setting apply to Android assistant?
         | tristor wrote:
         | I very much agree with you. This type of data collection MUST
         | be opt-in to be ethical, and in Chrome it's enabled by default
         | and buried. The VAST majority of users have no idea this is
         | even happening. It is grossly unethical and it is obvious that
         | it is so, but unsurprisingly folks at Google are happy to do
         | things like this given their salaries.
         | [deleted]
         | pvg wrote:
         | _Edit: I don 't want to be a conspiracy theorist_
         | That's not merely a good idea but also
         | https://news.ycombinator.com/newsguidelines.html
         |  _Please don 't post insinuations about astroturfing, shilling,
         | bots, brigading, foreign agents and the like. It degrades
         | discussion and is usually mistaken. If you're worried about
         | abuse, email hn@ycombinator.com and we'll look at the data._
         | There's also just not writing in the high-dudgeon flamewar
         | style which helps with the downvotes.
         | dadrian wrote:
         | 1. They're not funneling it through Stanford. They're posting
         | it publicly, but on BigQuery
         | https://developer.chrome.com/docs/crux/
         | 2. Chrome prompts you to opt-out of metrics collection on
         | install.
         | None of the reasons you've listed for this being ethically
         | dubious are true.
           | dmitriid wrote:
           | This is ethically dubious: "Chrome prompts you to opt-out of
           | metrics collection on install.".
           | Because the _default_ should be  "opted out by default, let
           | the user opt-in if they so wish"
           | cronaday wrote:
           | > 1. They
           | You appear to work on Chrome at Google and have cofounded a
           | business with the Stanford person. That seems relevant
           | context.
           | > 1. They're not funneling it through Stanford. They're
           | posting it publicly, but on BigQuery
           | https://developer.chrome.com/docs/crux/
           | The paper says there was special data access.
           | > 2. Chrome prompts you to opt-out of metrics collection on
           | install.
           | An opt out for reusing personal URL data is wholly
           | unacceptable.
             | dang wrote:
             | Please don't cross into personal attack.
             | https://news.ycombinator.com/newsguidelines.html
         | [deleted]
       | wirthjason wrote:
       | Curious where PornHub and other sites rank. I always hear how
       | that porn sites are in the top X of all traffics but people don't
       | talk about due to its nature.
       | I'm always amazed that they have a data science team. It's not
       | something many would expect from the porn industry. I certainly
       | didn't expect it.
       | https://www.pornhub.com/insights/2022-year-in-review
         | layer8 wrote:
         | A quick grep shows that there are almost 2.5K domains at 1M
         | rank with "porn" in their name.
         | The data science teams likely provide a considerable ROI in
         | that industry.
           | system2 wrote:
           | Majority of porn sites do not have porn word in them though.
           | I wish there was a categorization of these.
           | Giorgi wrote:
           | does it though? Pretty sure adult results are being filtered
           | off by Google tool named "SafeSearch". It removes anything
           | adult from SERP and it is on by default.
             | yorwba wrote:
             | This appears to have some unintuitive consequences. When I
             | searched for "porn" in a cookie-less session just now,
             | there were still porn results, but no well-known sites (at
             | least I didn't recognize the names). Searching for
             | literally "pornhub", the first result is "porhub.com"
             | without the "n".
             | Seems like the "SafeSearch" filter is based on a list of
             | "adult domains" instead of the indexed content at the URL.
         | mtmail wrote:
         | "New Year's Eve kicked holiday ass with a massive -40% drop in
         | worldwide traffic from 6pm to Midnight on December 31st." It's
         | Dec/31, 1pm in New York right now.
         | E39M5S62 wrote:
         | If you ignore the content, large-scale adult sites are just
         | like any other high traffic (bandwidth, RPS) site out there. A
         | lot of planning goes into where their content delivery PoPs
         | should be placed.
         | mtmail wrote:
         | "Pornhub's statisticians make use of Google Analytics to figure
         | out the most likely age and gender of visitors. This data is
         | anonymized from billions of visits to Pornhub annually, giving
         | us one of the richest and most diverse data sets to analyze
         | traffic from all around the world."
         | oars wrote:
         | "Pornhub's statisticians make use of Google Analytics to figure
         | out the most likely age and gender of visitors. This data is
         | anonymized from billions of visits to Pornhub annually, giving
         | us one of the richest and most diverse data sets to analyze
         | traffic from all around the world."
       | mg wrote:
       | Top level domains by popularity:                   grep -oP
       | '\.[a-z]+(?=,)' current.csv | sort | uniq -c | sort -n
       | ...       15840 .pl       17914 .it       20182 .de       21690
       | .in       27812 .ru       29194 .jp       30359 .org       35741
       | .br       36675 .net      406052 .com
       | .com domains by popularity:                   grep -oP
       | '[a-z0-9-]+\.com(?=,)' current.csv | sort | uniq -c | sort -n
       | ...         365 tistory.com         370 fc2.com         408
       | skipthegames.com         489 online.com         515 wordpress.com
       | 707 uptodown.com         880 schoology.com        2570 fandom.com
       | 2651 instructure.com        3244 blogspot.com
         | voytec wrote:
         | sort -r will reverse the order from most to less popular.
           | layer8 wrote:
           | .
             | codetrotter wrote:
             | uniq -c is in the pipeline because it counts the number of
             | uniques
               | layer8 wrote:
               | Ah, right, missed that.
         | slim wrote:
         | it's amazing that fandom is number 3 and wikipedia is not even
         | there
           | csande17 wrote:
           | Wikipedia uses a .org domain, so it won't show up on "most
           | popular .com domains" lists. (And I think the parent comment
           | is searching for domains with lots of subdomains, which is
           | why providers like Blogspot and Fandom show up.)
         | [deleted]
         | azeemba wrote:
         | It might be worth updating this comment and explaining your
         | second query.
         | People seem to think it is somehow measuring visits to those
         | origins. But it's measuring how many unique subdomains are
         | listed for those domains
         | kristianp wrote:
         | Loading the data into the duckdb cli [0] and doing the first
         | query:                   create table current as select * from
         | '202211.csv';         select * from current;
         | +------------------------------------+---------+         |
         | origin               |  rank   |         |              varchar
         | |  int32  |
         | +------------------------------------+---------+         |
         | https://hochi.news                 |    1000 |         |
         | https://www.xnxx.xxx               |    1000 |         |
         | https://www.wordreference.com      |    1000 |         |
         | https://finance.naver.com          |    1000 |         |
         | https://www.macys.com              |    1000 |         |
         | https://www.xv-videos1.com         |    1000 |         |
         | https://fr.xhamster.com            |    1000 |         |
         | https://poki.com                   |    1000 |         |
         | https://salonboard.com             |    1000 |         |
         | https://clgt.one                   |    1000 |
         | select tld, count(*)          from (select
         | reverse(substr(reverse(origin),1, position('.' in
         | reverse(origin))-1)) tld                  from current)
         | group by tld          order by count(*) desc;
         | +-----------+--------------+         |    tld    | count_star()
         | |         |  varchar  |    int64     |
         | +-----------+--------------+         | com       |       406052
         | |         | net       |        36675 |         | br        |
         | 35741 |         | org       |        30359 |         | jp
         | |        29194 |         | ru        |        27812 |         |
         | in        |        21690 |         | de        |        20182 |
         | | it        |        17914 |         | pl        |        15840
         | |         | *         |            * |         | *         |
         | * |         | *         |            * |         | za:5002   |
         | 1 |         | lk:8090   |            1 |         | org:1445  |
         | 1 |         | co:14443  |            1 |         | ar:3016   |
         | 1 |         | net:8001  |            1 |         | care:9624 |
         | 1 |         | au:8443   |            1 |         | com:333   |
         | 1 |         | edu:9016  |            1 |
         | +-----------+--------------+         |   2076 rows (20 shown)
         | |         +--------------------------+
         | [0] https://duckdb.org/docs/installation/
         | [deleted]
         | egman_ekki wrote:
         | Rather amazing seeing almost abandoned blogspot.com there at
         | the top.
         | Also interesting I haven't heard about half of them. Some are
         | nsfw, apparently.
           | mometsi wrote:
           | These aren't sorted by number of visits, but by the number of
           | rows in the list of most visited sites. Essentially which
           | sites have the greatest number of frequently visited
           | subdomains.
       | anonu wrote:
       | > grep http: current.csv | wc -l
       | 54679
       | So over 5% of the top 1m sites still don't use HTTPS.
         | _nhynes wrote:
         | "The 5% rule"
         | alfu wrote:
         | If I am not mistaken, 8310 sites offer http and https:
         | grep -o -E "://.*?," current.csv | sort | uniq -c | grep -v "1
         | ://" | wc -l         8310
         | philipphutterer wrote:
         | How about websites that are browsed http first and then
         | redirected? People might browse for a domain without the https
         | prefix for convenience (or old links) and the browser defaults
         | to http.
         | Proven wrote:
         | [dead]
         | forgotmypw17 wrote:
         | [flagged]
           | bedatadriven wrote:
           | What accessibility challenges does https pose?
             | forgotmypw17 wrote:
             | The accessibility challenges are all the extra different
             | failure modes HTTPS presents, such as client date offset,
             | older devices, expired certificates, hostname mismatches,
             | and many others.
             | Security is not the only priority in existence. Sometimes
             | people just want to access the information. And when that
             | is the case, HTTPS can be a huge impediment.
               | judge2020 wrote:
               | > Security is not the only priority in existence.
               | Sometimes people just want to access the information. And
               | when that is the case, HTTPS can be a huge impediment.
               | I suppose you'd be fine if your government started
               | replacing the content of Wikipedia with their own
               | propaganda/removing critical information about themselves
               | from traffic?
               | forgotmypw17 wrote:
               | So far they haven't. Meanwhile, millions of people with
               | older devices cannot use them to access wikipedia.
               | slim wrote:
               | you're aware that your government is already doing this,
               | right ? (your argument is invalid)
               | judge2020 wrote:
               | how? Edits on Wikipedia are public, including historical
               | monthly backups available over bt all the way back to
               | 2006, and I can ensure Wikipedia servers are serving it
               | correctly by cross-referencing that and the edits. With
               | http, any ISP (whose operators all tend to favor
               | government cooperation) or switch in the middle could sed
               | content to remove or slightly alter known-critical
               | content.
               | 2Gkashmiri wrote:
               | yeah... i havent gotten a good response why localhost
               | should scream "insecure" or why i should wikipedia fail
               | if my rtc clock is wonky.
               | i am not denying "security from snoops while paying with
               | credit cards" and all that banking shit or messaging.
               | heck, email is sent over the clear but we are told to use
               | https to connect to the website (for webmails) using
               | https for "security"...
               | sure sure security is all good and snazzy but i regularly
               | come across websites who have had certs expired and the
               | website makes it appear as if the sky will fall if i
               | click on continue.
               | then we have ISPs who use DPI (my current ISP, reliance
               | jio is doing it from day 1) so whats the point of
               | pretending anyway?
               | jefftk wrote:
               | _> why localhost should scream  "insecure"_
               | Localhost, even with HTTP, is a secure context:
               | https://developer.mozilla.org/en-
               | US/docs/Web/Security/Secure...
               | What tool is screaming at you that localhost is insecure?
               | 2Gkashmiri wrote:
               | browsers. padlock icon is crossed out
               | jefftk wrote:
               | I just tested this and don't see that. I compared
               | http://neverssl.com to running "python3 -m http.server"
               | and visiting http://localhost:8000
               | * Chrome: "Not Secure" on neverssl, "i" in a circle on
               | localhost
               | * Firefox: Padlock with a red line through it on
               | neverssl, page icon on localhost
               | * Safari: "Not Secure" on neverssl, no message on
               | localhost
               | scrose wrote:
               | They may be using a self-signed cert so it's
               | https://localhost and the browser is flagging the cert
               | rather than localhost itself.
               | rileymat2 wrote:
               | > email is sent over the clear but we are told to use
               | https to connect to the website (for webmails) using
               | https for "security"
               | This is not true, most is encrypted in transit. It is not
               | end to end, because your email service stores them
               | (perhaps encrypted perhaps not).
               | Edit: https://transparencyreport.google.com/safer-
               | email/overview?h...
               | You can see 84% of outbound is encrypted. This probably
               | is generally a good proxy for the state of email tls
               | transport.
         | zX41ZdbW wrote:
         | I have prepared a nice report: the rank of the websites in
         | groups 1..10, 11..100, ... the percentage of TLS and an example
         | of non-TLS website:
         | https://play.clickhouse.com/play?user=play#U0VMRUNUIGZsb29yK...
         | SELECT             floor(log10(rank)) AS r,             count()
         | AS total,             sum(log LIKE '%TLS%') AS tls,
         | round(tls / total, 2) AS ratio,             anyIf(domain, log
         | NOT LIKE '%TLS%')         FROM minicrawl         WHERE log LIKE
         | '%Content-Length:%'         GROUP BY r         ORDER BY r
         | +-r-+---total-+-----tls-+-ratio-+-anyIf(domain, notLike(log,
         | '%TLS%'))-+         | 0 |       6 |       6 |     1 |
         | |         | 1 |      61 |      58 |  0.95 | baidu.com
         | |         | 2 |     599 |     562 |  0.94 | google.cn
         | |         | 3 |    5591 |    5057 |   0.9 | volganet.ru
         | |         | 4 |   51279 |   44291 |  0.86 | furbo.co
         | |         | 5 |  476181 |  361910 |  0.76 | funygold.com
         | |         | 6 | 3797023 | 2927052 |  0.77 | funyo.vip
         | |         +---+---------+---------+-------+--------------------
         | ------------------+              7 rows in set. Elapsed: 0.844
         | sec. Processed 7.59 million rows, 43.74 GB (8.99 million
         | rows/s., 51.83 GB/s.)
           | kedmi wrote:
           | Excuse my ignorance, what CLI tool did you use to execute
           | this query? Thanks!
             | [deleted]
             | zX41ZdbW wrote:
             | clickhouse-client
             | Download it as:                 curl
             | https://clickhouse.com/ | sh
             | Connect to the demo service:                 clickhouse-
             | client --host play.clickhouse.com --user play --secure
       | zX41ZdbW wrote:
       | If you are interested in the research on technologies used on the
       | Internet, I recommend playing with the "Minicrawl" dataset.
       | It contains data about ~7 million top websites, and for every
       | website, it also contains: - the full content of the main page; -
       | the verbose output of curl, containing various timing info; the
       | HTTP headers, protocol info...
       | Using this dataset, you can build a service similar to
       | https://builtwith.com/ for your research.
       | Data: https://clickhouse-public-
       | datasets.s3.amazonaws.com/minicraw... (129 GB compressed, ~1 TB
       | uncompressed).
       | Description:
       | https://github.com/ClickHouse/ClickHouse/issues/18842
       | You can easily try it with clickhouse-local without downloading:
       | $ curl https://clickhouse.com/ | sh            $ ./clickhouse
       | local          ClickHouse local version (official
       | build).              milovidov-desktop :) DESCRIBE
       | url('https://clickhouse-public-
       | datasets.s3.amazonaws.com/minicrawl/data.native.zst')
       | DESCRIBE TABLE url('https://clickhouse-public-
       | datasets.s3.amazonaws.com/minicrawl/data.native.zst')
       | Query id: 6746232f-7f5f-4c5a-ac68-d749d949a2dc              +-nam
       | e----+-type---+-default_type-+-default_expression-+-comment-+-cod
       | ec_expression-+-ttl_expression-+         | rank    | UInt32 |
       | |                    |         |                  |
       | |         | domain  | String |              |
       | |         |                  |                |         | log
       | | String |              |                    |         |
       | |                |         | content | String |              |
       | |         |                  |                |         +--------
       | -+--------+--------------+--------------------+---------+--------
       | ----------+----------------+              4 rows in set. Elapsed:
       | 1.390 sec.               milovidov-desktop :) SELECT rank,
       | domain, log, substringUTF8(content, 1, 100) FROM
       | url('https://clickhouse-public-
       | datasets.s3.amazonaws.com/minicrawl/data.native.zst') LIMIT 1
       | FORMAT Vertical              SELECT             rank,
       | domain,             log,             substringUTF8(content, 1,
       | 100)         FROM url('https://clickhouse-public-
       | datasets.s3.amazonaws.com/minicrawl/data.native.zst')
       | LIMIT 1         FORMAT Vertical              Query id:
       | 8dba6976-0bf6-4ce8-a0f1-aa579c828175              Row 1:
       | ------         rank:                           1907977
       | domain:                         0--0.uk         log:
       | *   Trying         * Connected to 0--0.uk
       | ( port 80 (#0)         > GET / HTTP/1.1         >
       | Host: 0--0.uk         > Accept: */*         > User-Agent:
       | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101
       | Firefox/84.0         >          * Mark bundle as not supporting
       | multiuse         < HTTP/1.1 302 Moved Temporarily         <
       | Server: nginx         < Date: Sun, 29 May 2022 06:27:14 GMT
       | < Content-Type: text/html         < Content-Length: 154         <
       | Connection: keep-alive         < Location: https://0--0.uk/
       | <          * Ignoring the response-body         { [154 bytes
       | data]         * Connection #0 to host 0--0.uk left intact
       | * Issue another request to this URL: 'https://0--0.uk/'         *
       | Trying         * Connected to 0--0.uk
       | ( port 443 (#1)         * ALPN, offering h2
       | * ALPN, offering http/1.1         *  CAfile: /etc/ssl/certs/ca-
       | certificates.crt         *  CApath: /etc/ssl/certs         *
       | TLSv1.0 (OUT), TLS header, Certificate Status (22):         } [5
       | bytes data]         * TLSv1.3 (OUT), TLS handshake, Client hello
       | (1):         } [512 bytes data]         * TLSv1.2 (IN), TLS
       | header, Certificate Status (22):         { [5 bytes data]
       | * TLSv1.3 (IN), TLS handshake, Server hello (2):         { [108
       | bytes data]         * TLSv1.2 (IN), TLS header, Certificate
       | Status (22):         { [5 bytes data]         * TLSv1.2 (IN), TLS
       | handshake, Certificate (11):         { [4150 bytes data]
       | * TLSv1.2 (IN), TLS header, Certificate Status (22):         { [5
       | bytes data]         * TLSv1.2 (IN), TLS handshake, Server key
       | exchange (12):         { [333 bytes data]         * TLSv1.2 (IN),
       | TLS header, Certificate Status (22):         { [5 bytes data]
       | * TLSv1.2 (IN), TLS handshake, Server finished (14):         { [4
       | bytes data]         * TLSv1.2 (OUT), TLS header, Certificate
       | Status (22):         } [5 bytes data]         * TLSv1.2 (OUT),
       | TLS handshake, Client key exchange (16):         } [70 bytes
       | data]         * TLSv1.2 (OUT), TLS header, Finished (20):
       | } [5 bytes data]         * TLSv1.2 (OUT), TLS change cipher,
       | Change cipher spec (1):         } [1 bytes data]         *
       | TLSv1.2 (OUT), TLS header, Certificate Status (22):         } [5
       | bytes data]         * TLSv1.2 (OUT), TLS handshake, Finished
       | (20):         } [16 bytes data]         * TLSv1.2 (IN), TLS
       | header, Finished (20):         { [5 bytes data]         * TLSv1.2
       | (IN), TLS header, Certificate Status (22):         { [5 bytes
       | data]         * TLSv1.2 (IN), TLS handshake, Finished (20):
       | { [16 bytes data]         * SSL connection using TLSv1.2 / ECDHE-
       | RSA-AES128-GCM-SHA256         * ALPN, server accepted to use
       | http/1.1         * Server certificate:         *  subject:
       | CN=mail.htservices.co.uk         *  start date: May 15 18:36:37
       | 2022 GMT         *  expire date: Aug 13 18:36:36 2022 GMT
       | *  subjectAltName: host "0--0.uk" matched cert's "0--0.uk"
       | *  issuer: C=US; O=Let's Encrypt; CN=R3         *  SSL
       | certificate verify ok.         * TLSv1.2 (OUT), TLS header,
       | Supplemental data (23):         } [5 bytes data]         > GET /
       | HTTP/1.1         > Host: 0--0.uk         > Accept: */*         >
       | User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0)
       | Gecko/20100101 Firefox/84.0         >          * TLSv1.2 (IN),
       | TLS header, Supplemental data (23):         { [5 bytes data]
       | * Mark bundle as not supporting multiuse         < HTTP/1.1 200
       | OK         < Server: nginx         < Date: Sun, 29 May 2022
       | 06:27:15 GMT         < Content-Type: text/html;charset=utf-8
       | < Transfer-Encoding: chunked         < Connection: keep-alive
       | < X-Frame-Options: SAMEORIGIN         < Expires: -1         <
       | Cache-Control: no-store, no-cache, must-revalidate, max-age=0
       | < Pragma: no-cache         < Content-Language: en-US         <
       | Set-Cookie: ZM_TEST=true;Secure         < Set-Cookie: ZM_LOGIN_CS
       | RF=b2dda010-d795-4759-a9c3-80349f3b46ed;Secure;HttpOnly         <
       | Vary: User-Agent         < X-UA-Compatible: IE=edge         <
       | Vary: Accept-Encoding, User-Agent         <          { [13068
       | bytes data]         * Connection #1 to host 0--0.uk left intact
       | substringUTF8(content, 1, 100): <!DOCTYPE html>         <!-- set
       | this class so CSS definitions that now use REM size, would work
       | relative to              1 row in set. Elapsed: 0.539 sec.
       | Processed 4.60 thousand rows, 273.86 MB (8.54 thousand rows/s.,
       | 508.28 MB/s.)
         | simonw wrote:
         | How does that work? How can clickehouse-local run queries
         | against a 129 GB file hosted on S3 without downloading the
         | whole thing?
         | Is it using HTTP range header tricks, like DuckDB does for
         | querying Parquet files?
         | https://duckdb.org/docs/extensions/httpfs.html
         | If so, what's the data.native.zst file format? Is it similar to
         | Parquet?
           | zX41ZdbW wrote:
           | Yes, the native format is very similar to Parquet.
           | It works for Parquet as well:                 SELECT * FROM
           | url('https://clickhouse-public-
           | datasets.s3.amazonaws.com/hits.parquet') LIMIT 1
           | And for CSV or TSV:                 SELECT * FROM
           | url('https://clickhouse-public-datasets.s3.amazonaws.com/gith
           | ub_events/tsv/github_events_v3.tsv.xz') LIMIT 1
           | And for ndJSON:                 SELECT repo_name, created_at,
           | event_type FROM s3('https://clickhouse-public-datasets.s3.ama
           | zonaws.com/github_events/partitioned_json/github_events_*.gz'
           | , JSONLines, 'repo_name String, actor_login String,
           | created_at String, event_type String') WHERE actor_login =
           | 'simonw' LIMIT 10
           | Note: the query above is kind of slow. Here is the query from
           | preloaded data - your activity in GitHub issues:
           | https://play.clickhouse.com/play?user=play#U0VMRUNUIGNyZWF0Z.
           | ..
             | simonw wrote:
             | Another question about that demo.
             | https://clickhouse.com/docs/en/getting-started/example-
             | datas... says "Dataset contains all events on GitHub from
             | 2011 to Dec 6 2020" - but I'm seeing results in there from
             | a couple of hours ago.
             | Do you know if that's continually updated and, if so, is
             | that documented anywhere?
           | cldellow wrote:
           | > How does that work?
           | Disclaimer: I'm not a Clickhouse user, but I have a bit of
           | experience with Parquet.
           | It looks like the native format is (very briefly) described
           | here:
           | https://clickhouse.com/docs/en/interfaces/formats/#native
           | It looks similar at a high level to Parquet: binary, columnar
           | and has metadata that permits requesting a subset of the
           | data.
           | Looking at:
           | > Processed 4.60 thousand rows, 273.86 MB
           | I'd guess it's chunking the rows into groups of ~4,000.
           | The OP must have a nice connection if that completed in 0.5
           | seconds! (Or perhaps the 273.86MB is the uncompressed size
           | after zstd compression, or perhaps there were other parts of
           | the session that caused that chunk to get cached, and it was
           | elided from what was pasted in to HN.)
           | EDIT: I was curious, so I ran the tool and watched bandwidth
           | on iftop. It uses about ~50MB each time I run the query. From
           | this, I conclude: it does not cache things, the 273.86MB is
           | the uncompressed size, and OP has a much better internet
           | connection than me. :)
       | deterrence wrote:
       | [flagged]
       | kristianp wrote:
       | This raises the question: how much in the way of user telemetry
       | does Chrome send back to google?
         | tgsovlerkhgsel wrote:
         | By default, a lot. However, they also are (or at least used to
         | be, it seems to be quite outdated now) really good at
         | documenting their telemetry publicly:
         | https://www.google.com/chrome/privacy/whitepaper.html
         | (I haven't checked whether the documentation is
         | complete/accurate, of course.)
       | est wrote:
       | Looks like not a single Chinese site made to top 1k. I guess it's
       | reasonable because all Google services were blocked so CrUX can't
       | gather any data.
         | themoonisachees wrote:
         | Do Chinese people use chrome? One would think the download page
         | is blocked as well, so the demographic for chrome users should
         | be way smaller.
         | Also to consider: China uses in-app browsing a lot, with
         | interactive experiences very similar to websites built right in
         | the bilibili/ali/wechat apps.
           | moffkalast wrote:
           | > in-app browsing
           | But that's also just chromium isn't it, much like a PWA?
           | Unless they made something of their own.
         | kristianp wrote:
         | > The CrUX dataset is based on data collected from Google
         | Chrome and is thus biased away from countries with limited
         | Chrome usage (e.g., China).
       | Mortiffer wrote:
       | Thanks for pointing this out. Can definitely put this dataset to
       | use
       (page generated 2022-12-31 23:00 UTC)