[HN Gopher] Cached Chrome Top Million Websites ___________________________________________________________________ Cached Chrome Top Million Websites Author : edent Score : 199 points Date : 2022-12-31 13:49 UTC (9 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | cronaday wrote: | This is _very_ ethically dubious. Google is collecting _raw URLs_ | from Chrome users who turned on history syncing across their | _own_ devices, then reusing the data and funneling it through | Stanford. No way Chrome users understand or approve of this. | | The paper tries to justify its ethics with Google's privacy | policy, which is laughable. There are so many papers about how | meaningless privacy policies are. If Apple or Mozilla did | anything remotely like this, Hacker News would riot. | | Edit: I don't want to be a conspiracy theorist, but this post | suddenly got a bunch of downvotes at the same time as defensive | comments from a current Googler and recent ex-Googler. Then one | of my responses below to a Chrome developer got flagged for no | obvious reason. Hmm. | jeffbee wrote: | Maybe your posts would get better votes if you made any effort | at all to back up your claim on unethical behavior. You | provided nothing. | dang wrote: | Can you please make your substantive points without breaking | the site guidelines? You did that here with your last | paragraph, and worse at | https://news.ycombinator.com/item?id=34197958. | | If you wouldn't mind reviewing | https://news.ycombinator.com/newsguidelines.html and taking the | intended spirit of the site more to heart, we'd be grateful. | soneca wrote: | > _" If Apple or Mozilla did anything remotely like this, | Hacker News would riot."_ | | My perception is that, collectively, HN hates and criticizes | Google much more than Apple and Mozilla. I mean, _much_ more. | This last sentence accusation sounded bizarre to me. | dmitriid wrote: | > I mean, much more. | | Because Google is a web advertisement company that dominates | many large spheres: search, browsers (including standards | committees), email, mobile (Android is 77% market share) etc. | All are things that we've come to view as crucial to modern | life. | | And time and again they've shown that they only view that | dominance as a funnel for ad revenue, data collection, and | whatever benefits them at this particular moment. | hericium wrote: | > My perception is that, collectively, HN hates and | criticizes Google much more than Apple and Mozilla. | | Not the entirety of HN. As I have more than once delicately | pointed out[1], Mozilla is Google's bitch. | | [1] https://news.ycombinator.com/item?id=30732539 | cronaday wrote: | Just suggesting that prior browser and OS privacy blowups | involving those companies have been over less worrisome | things, not that those companies are subject to more or less | criticism. Looking back on outraged discussions of Mozilla's | telemetry is kinda quaint in comparison. | marcosdumay wrote: | > HN hates and criticizes Google much more than Apple and | Mozilla. | | That is mostly because Apple almost never does something like | this, and Mozilla literally never does. | chiefalchemist wrote: | re: Edit. | | I've noticed similar behavior in HN voting. Down vote spikes | but few if any comments in-line with the voting. Not sure if | it's bots, human-based click farms, or too just don't | understand that disagreement is not grounds for down voting. | | Perhaps a bit of all three? | pvg wrote: | _just don 't understand that disagreement is not grounds for | down voting._ | | It is perfectly fine on HN and always has been. | chiefalchemist wrote: | Nah. I see it differently. | | The Guidelines are clear about why we're here and | expectations. The emphasis is on discussion, learning and | objectivity. Yes, disagreement is mentioned (i.e., allowed) | but even that needs to be constructive, yes? | | A down vote - with no discussion - well, frankly in the | context of the Guidelines, is: | | 1) Not in the spirit of the guidelines; 2) Perhaps | redundant to 1, but lazy; 3) At best, small-minded and | childish; | | If people want to pout about reading something they don't | like, this isn't the place for them. | | Yeah, I see who you are. And I'm ok w/ pushing back. That's | what make HN what it is ;) | pvg wrote: | You see it differently but it's just not accurate, it's | not in the guidelines super-explicitly so easy to miss. | | https://news.ycombinator.com/item?id=22910444 | | https://news.ycombinator.com/item?id=16131314 and there | are many many others | | _Yeah, I see who you are._ | | I'm literally a random scold on the internet, I just | happen to be right about this. | chiefalchemist wrote: | I disagree. | | I'm not going to explain why. | | How does that feel? What value does it add? (Sweet FA, | eh.) | | You're right, you might be right. But that does make it | right. I get zero satisfaction from context-less down | votes. I don't do them. I ignore them when I get them | (i.e., they have zero influence on my HN behavior). If | I'm changing my mind over some lazy a-holes' click, I'm | losing. Big time. | | I can't imagine why anyone feels any differently. The | reality is, there are pointless noise. There's not enough | context to drive anything actionable for anybody. | | But while I have your attention: how about a feature | request: Karma points that consider the discussion below | a top-parent comment. | jefftk wrote: | Google has written publicly about how this system works: | https://developer.chrome.com/docs/crux/methodology/ | https://www.google.com/chrome/privacy/whitepaper.html#usages... | | This includes only listing publicly discoverable pages, only | including data from users who have turned on "Make searches and | browsing better (Sends URLs of pages you visit to Google)", and | only including pages that are visited by a minimum number of | users. | blast wrote: | "Users who have turned on" implies that they opted in. Is | this behavior opt-in or opt-out? | cronaday wrote: | > Google has written publicly about how this system works | | If this is news to Hackers News, there is no way that regular | Chrome users are aware of it. Saying something in a privacy | policy or on a developer website just can't be enough for | _analyzing a person 's URL data_. | | > This includes only listing publicly discoverable pages, | only including data from users who have turned on "Make | searches and browsing better (Sends URLs of pages you visit | to Google)", and only including pages that are visited by a | minimum number of users. | | Since when does aggregating this type of data make it fair | game? This is _analyzing a person 's URL data_ from _their | own devices_. There has always been a big bright red line for | browsers touching a user 's browsing history. Google crossed | that line. | | Also, I just checked on a fresh Chrome install. The "Make | searches and browsing better" option is _enabled by default_ | and _buried in Chrome settings_. How is that acceptable | consent for _analyzing a person 's URL data_? | jasonlotito wrote: | > If this is news to Hackers News | | This isn't news to Hacker News. This might be new to you, | but that does not mean it's some new information that's | been hidden. | | > there is no way that regular Chrome users are aware of | it. | | There is no way that a regular Chrome user is aware of all | the features Chrome offers, let alone all the details. Not | because Google is hiding this from them, but because there | is just a LOT Chrome does. | | I don't even use Chrome, and none of this is new. | | It's fine that this is all new to you, but it's not new to | you because anyone has kept this secret. At this point, | you've chosen to remain ignorant. | dang wrote: | Please don't cross into personal attack. | | https://news.ycombinator.com/newsguidelines.html | | Edit: you've unfortunately been breaking the site | guidelines a ton lately. Seriously not cool, and well | past the line at which we start banning the account. | | I don't want to ban you, but if you keep this up, we'll | have to. If you'd please review | https://news.ycombinator.com/newsguidelines.html and | stick to the rules, we'd appreciate it. | marcosdumay wrote: | Is it opt-in or opt-out? And if it's opt-in, does it come | with infinite nagging until you opt-in? | | I know login-in and syncing your data are "opt-in" | options that come with infinite nagging (so, actually | required options). The information that there are | different levels of syncing is news to me. | lucb1e wrote: | > This might be new to you, but that does not mean it's | some new information that's been hidden. | | I downloaded Chrome on a new laptop an hour ago (at my | employer's request, I'd use Firefox myself) and was | certainly not aware of this. | | This information was not on any screen at any point. | There was a default-checked checkmark for some general | statistics sharing which I only noticed after clicking | download (because it was small and _below_ the download | button), but didn 't click through to the privacy policy | to learn more. | | Guess I should have read the privacy policy. I'm trying | to find what it said now, but I can't see it anymore | because different terms apply to Linux downloads and | there's no button to download the Windows version. | Basically, visiting the same page in Firefox on Linux | (instead of Edge on Windows, which I don't have access to | atm) gives me different contents and no checkmark. | cronaday wrote: | If a user can somehow someway somewhere learn about what | a company is doing, then it's OK? Really? | jefftk wrote: | I could see two main arguments for this not being okay: | | * Chrome is secretly collecting data. | | * Chrome is doing something users would object to if they | knew and understood it. | | I don't think either of these are the case here: they are | sharing data about what sites people generally visit in | an aggregated form doesn't reveal any individual's | browsing (what's to object to?) and talk about it in the | place people would go to learn about what data they | collect. | jstummbillig wrote: | What is it you are proposing? If it were every | institutions obligation to make sure, that all its | instrumental functions were obvious to every potential | user and keep any user from engaging with the institution | under any false assumptions, nothing in our society would | work. | | That it not to say that scrutiny is not important. You | should certainly be allowed to point at any individual | function and demand more upfront transparency, over what | is currently being offered. But be aware of the massive | additional cognitive load you create, for everyone, when | you are not just demanding information _availability_ , | but that this information is being _delivered_ to anyone | it _might_ concern. Any individuals preference to not | care about a function would have to take the backseat to | the opinion, that they have to at least somewhat consider | the function before engaging. | | Considering how expensive this process is, "Google Chrome | CrUX" would probably be pretty far down on the list for | me personally, as "crucial things everyone should | definitely know about before possibly engaging" goes, but | to each their own. | dmitriid wrote: | > It's fine that this is all new to you, but it's not new | to you because anyone has kept this secret. At this | point, you've chosen to remain ignorant. | | Ah yes. Blame the user for not understanding yet another | piece in Google's gargantuan data collecting machinery. | | Recent court cases revealed that Google's own employees | don't know what's tracked and how to turn it off. But I'm | sure it's only ignorance that keeps users uninformed. | vachina wrote: | If crux is what Google is willing to make public, it makes | one wonder what else is collected and stored for their own | use (i.e. their moat). | | I'm not using Chrome on all my devices. | addingnumbers wrote: | > only including data from users who have turned on "Make | searches and browsing better (Sends URLs of pages you visit | to Google)" | | One big problem there is that we don't know what percentage | of users for whom "turned on" is a euphemism for "didn't | notice." | kevin_thibedeau wrote: | Does this setting apply to Android assistant? | tristor wrote: | I very much agree with you. This type of data collection MUST | be opt-in to be ethical, and in Chrome it's enabled by default | and buried. The VAST majority of users have no idea this is | even happening. It is grossly unethical and it is obvious that | it is so, but unsurprisingly folks at Google are happy to do | things like this given their salaries. | [deleted] | pvg wrote: | _Edit: I don 't want to be a conspiracy theorist_ | | That's not merely a good idea but also | | https://news.ycombinator.com/newsguidelines.html | | _Please don 't post insinuations about astroturfing, shilling, | bots, brigading, foreign agents and the like. It degrades | discussion and is usually mistaken. If you're worried about | abuse, email hn@ycombinator.com and we'll look at the data._ | | There's also just not writing in the high-dudgeon flamewar | style which helps with the downvotes. | dadrian wrote: | 1. They're not funneling it through Stanford. They're posting | it publicly, but on BigQuery | https://developer.chrome.com/docs/crux/ | | 2. Chrome prompts you to opt-out of metrics collection on | install. | | None of the reasons you've listed for this being ethically | dubious are true. | dmitriid wrote: | This is ethically dubious: "Chrome prompts you to opt-out of | metrics collection on install.". | | Because the _default_ should be "opted out by default, let | the user opt-in if they so wish" | cronaday wrote: | > 1. They | | You appear to work on Chrome at Google and have cofounded a | business with the Stanford person. That seems relevant | context. | | > 1. They're not funneling it through Stanford. They're | posting it publicly, but on BigQuery | https://developer.chrome.com/docs/crux/ | | The paper says there was special data access. | | > 2. Chrome prompts you to opt-out of metrics collection on | install. | | An opt out for reusing personal URL data is wholly | unacceptable. | dang wrote: | Please don't cross into personal attack. | | https://news.ycombinator.com/newsguidelines.html | [deleted] | wirthjason wrote: | Curious where PornHub and other sites rank. I always hear how | that porn sites are in the top X of all traffics but people don't | talk about due to its nature. | | I'm always amazed that they have a data science team. It's not | something many would expect from the porn industry. I certainly | didn't expect it. | | https://www.pornhub.com/insights/2022-year-in-review | layer8 wrote: | A quick grep shows that there are almost 2.5K domains at 1M | rank with "porn" in their name. | | The data science teams likely provide a considerable ROI in | that industry. | system2 wrote: | Majority of porn sites do not have porn word in them though. | I wish there was a categorization of these. | Giorgi wrote: | does it though? Pretty sure adult results are being filtered | off by Google tool named "SafeSearch". It removes anything | adult from SERP and it is on by default. | yorwba wrote: | This appears to have some unintuitive consequences. When I | searched for "porn" in a cookie-less session just now, | there were still porn results, but no well-known sites (at | least I didn't recognize the names). Searching for | literally "pornhub", the first result is "porhub.com" | without the "n". | | Seems like the "SafeSearch" filter is based on a list of | "adult domains" instead of the indexed content at the URL. | mtmail wrote: | "New Year's Eve kicked holiday ass with a massive -40% drop in | worldwide traffic from 6pm to Midnight on December 31st." It's | Dec/31, 1pm in New York right now. | E39M5S62 wrote: | If you ignore the content, large-scale adult sites are just | like any other high traffic (bandwidth, RPS) site out there. A | lot of planning goes into where their content delivery PoPs | should be placed. | mtmail wrote: | "Pornhub's statisticians make use of Google Analytics to figure | out the most likely age and gender of visitors. This data is | anonymized from billions of visits to Pornhub annually, giving | us one of the richest and most diverse data sets to analyze | traffic from all around the world." | oars wrote: | "Pornhub's statisticians make use of Google Analytics to figure | out the most likely age and gender of visitors. This data is | anonymized from billions of visits to Pornhub annually, giving | us one of the richest and most diverse data sets to analyze | traffic from all around the world." | mg wrote: | Top level domains by popularity: grep -oP | '\.[a-z]+(?=,)' current.csv | sort | uniq -c | sort -n | ... 15840 .pl 17914 .it 20182 .de 21690 | .in 27812 .ru 29194 .jp 30359 .org 35741 | .br 36675 .net 406052 .com | | .com domains by popularity: grep -oP | '[a-z0-9-]+\.com(?=,)' current.csv | sort | uniq -c | sort -n | ... 365 tistory.com 370 fc2.com 408 | skipthegames.com 489 online.com 515 wordpress.com | 707 uptodown.com 880 schoology.com 2570 fandom.com | 2651 instructure.com 3244 blogspot.com | voytec wrote: | sort -r will reverse the order from most to less popular. | layer8 wrote: | . | codetrotter wrote: | uniq -c is in the pipeline because it counts the number of | uniques | layer8 wrote: | Ah, right, missed that. | slim wrote: | it's amazing that fandom is number 3 and wikipedia is not even | there | csande17 wrote: | Wikipedia uses a .org domain, so it won't show up on "most | popular .com domains" lists. (And I think the parent comment | is searching for domains with lots of subdomains, which is | why providers like Blogspot and Fandom show up.) | [deleted] | azeemba wrote: | It might be worth updating this comment and explaining your | second query. | | People seem to think it is somehow measuring visits to those | origins. But it's measuring how many unique subdomains are | listed for those domains | kristianp wrote: | Loading the data into the duckdb cli [0] and doing the first | query: create table current as select * from | '202211.csv'; select * from current; | +------------------------------------+---------+ | | origin | rank | | varchar | | int32 | | +------------------------------------+---------+ | | https://hochi.news | 1000 | | | https://www.xnxx.xxx | 1000 | | | https://www.wordreference.com | 1000 | | | https://finance.naver.com | 1000 | | | https://www.macys.com | 1000 | | | https://www.xv-videos1.com | 1000 | | | https://fr.xhamster.com | 1000 | | | https://poki.com | 1000 | | | https://salonboard.com | 1000 | | | https://clgt.one | 1000 | | select tld, count(*) from (select | reverse(substr(reverse(origin),1, position('.' in | reverse(origin))-1)) tld from current) | group by tld order by count(*) desc; | +-----------+--------------+ | tld | count_star() | | | varchar | int64 | | +-----------+--------------+ | com | 406052 | | | net | 36675 | | br | | 35741 | | org | 30359 | | jp | | 29194 | | ru | 27812 | | | in | 21690 | | de | 20182 | | | it | 17914 | | pl | 15840 | | | * | * | | * | | * | | * | * | | za:5002 | | 1 | | lk:8090 | 1 | | org:1445 | | 1 | | co:14443 | 1 | | ar:3016 | | 1 | | net:8001 | 1 | | care:9624 | | 1 | | au:8443 | 1 | | com:333 | | 1 | | edu:9016 | 1 | | +-----------+--------------+ | 2076 rows (20 shown) | | +--------------------------+ | | [0] https://duckdb.org/docs/installation/ | [deleted] | egman_ekki wrote: | Rather amazing seeing almost abandoned blogspot.com there at | the top. | | Also interesting I haven't heard about half of them. Some are | nsfw, apparently. | mometsi wrote: | These aren't sorted by number of visits, but by the number of | rows in the list of most visited sites. Essentially which | sites have the greatest number of frequently visited | subdomains. | anonu wrote: | > grep http: current.csv | wc -l | | 54679 | | So over 5% of the top 1m sites still don't use HTTPS. | _nhynes wrote: | "The 5% rule" | alfu wrote: | If I am not mistaken, 8310 sites offer http and https: | grep -o -E "://.*?," current.csv | sort | uniq -c | grep -v "1 | ://" | wc -l 8310 | philipphutterer wrote: | How about websites that are browsed http first and then | redirected? People might browse for a domain without the https | prefix for convenience (or old links) and the browser defaults | to http. | Proven wrote: | [dead] | forgotmypw17 wrote: | [flagged] | bedatadriven wrote: | What accessibility challenges does https pose? | forgotmypw17 wrote: | The accessibility challenges are all the extra different | failure modes HTTPS presents, such as client date offset, | older devices, expired certificates, hostname mismatches, | and many others. | | Security is not the only priority in existence. Sometimes | people just want to access the information. And when that | is the case, HTTPS can be a huge impediment. | judge2020 wrote: | > Security is not the only priority in existence. | Sometimes people just want to access the information. And | when that is the case, HTTPS can be a huge impediment. | | I suppose you'd be fine if your government started | replacing the content of Wikipedia with their own | propaganda/removing critical information about themselves | from traffic? | forgotmypw17 wrote: | So far they haven't. Meanwhile, millions of people with | older devices cannot use them to access wikipedia. | slim wrote: | you're aware that your government is already doing this, | right ? (your argument is invalid) | judge2020 wrote: | how? Edits on Wikipedia are public, including historical | monthly backups available over bt all the way back to | 2006, and I can ensure Wikipedia servers are serving it | correctly by cross-referencing that and the edits. With | http, any ISP (whose operators all tend to favor | government cooperation) or switch in the middle could sed | content to remove or slightly alter known-critical | content. | 2Gkashmiri wrote: | yeah... i havent gotten a good response why localhost | should scream "insecure" or why i should wikipedia fail | if my rtc clock is wonky. | | i am not denying "security from snoops while paying with | credit cards" and all that banking shit or messaging. | heck, email is sent over the clear but we are told to use | https to connect to the website (for webmails) using | https for "security"... | | sure sure security is all good and snazzy but i regularly | come across websites who have had certs expired and the | website makes it appear as if the sky will fall if i | click on continue. | | then we have ISPs who use DPI (my current ISP, reliance | jio is doing it from day 1) so whats the point of | pretending anyway? | jefftk wrote: | _> why localhost should scream "insecure"_ | | Localhost, even with HTTP, is a secure context: | https://developer.mozilla.org/en- | US/docs/Web/Security/Secure... | | What tool is screaming at you that localhost is insecure? | 2Gkashmiri wrote: | browsers. padlock icon is crossed out | jefftk wrote: | I just tested this and don't see that. I compared | http://neverssl.com to running "python3 -m http.server" | and visiting http://localhost:8000 | | * Chrome: "Not Secure" on neverssl, "i" in a circle on | localhost | | * Firefox: Padlock with a red line through it on | neverssl, page icon on localhost | | * Safari: "Not Secure" on neverssl, no message on | localhost | scrose wrote: | They may be using a self-signed cert so it's | https://localhost and the browser is flagging the cert | rather than localhost itself. | rileymat2 wrote: | > email is sent over the clear but we are told to use | https to connect to the website (for webmails) using | https for "security" | | This is not true, most is encrypted in transit. It is not | end to end, because your email service stores them | (perhaps encrypted perhaps not). | | Edit: https://transparencyreport.google.com/safer- | email/overview?h... | | You can see 84% of outbound is encrypted. This probably | is generally a good proxy for the state of email tls | transport. | zX41ZdbW wrote: | I have prepared a nice report: the rank of the websites in | groups 1..10, 11..100, ... the percentage of TLS and an example | of non-TLS website: | | https://play.clickhouse.com/play?user=play#U0VMRUNUIGZsb29yK... | SELECT floor(log10(rank)) AS r, count() | AS total, sum(log LIKE '%TLS%') AS tls, | round(tls / total, 2) AS ratio, anyIf(domain, log | NOT LIKE '%TLS%') FROM minicrawl WHERE log LIKE | '%Content-Length:%' GROUP BY r ORDER BY r | +-r-+---total-+-----tls-+-ratio-+-anyIf(domain, notLike(log, | '%TLS%'))-+ | 0 | 6 | 6 | 1 | | | | 1 | 61 | 58 | 0.95 | baidu.com | | | 2 | 599 | 562 | 0.94 | google.cn | | | 3 | 5591 | 5057 | 0.9 | volganet.ru | | | 4 | 51279 | 44291 | 0.86 | furbo.co | | | 5 | 476181 | 361910 | 0.76 | funygold.com | | | 6 | 3797023 | 2927052 | 0.77 | funyo.vip | | +---+---------+---------+-------+-------------------- | ------------------+ 7 rows in set. Elapsed: 0.844 | sec. Processed 7.59 million rows, 43.74 GB (8.99 million | rows/s., 51.83 GB/s.) | kedmi wrote: | Excuse my ignorance, what CLI tool did you use to execute | this query? Thanks! | [deleted] | zX41ZdbW wrote: | clickhouse-client | | Download it as: curl | https://clickhouse.com/ | sh | | Connect to the demo service: clickhouse- | client --host play.clickhouse.com --user play --secure | zX41ZdbW wrote: | If you are interested in the research on technologies used on the | Internet, I recommend playing with the "Minicrawl" dataset. | | It contains data about ~7 million top websites, and for every | website, it also contains: - the full content of the main page; - | the verbose output of curl, containing various timing info; the | HTTP headers, protocol info... | | Using this dataset, you can build a service similar to | https://builtwith.com/ for your research. | | Data: https://clickhouse-public- | datasets.s3.amazonaws.com/minicraw... (129 GB compressed, ~1 TB | uncompressed). | | Description: | https://github.com/ClickHouse/ClickHouse/issues/18842 | | You can easily try it with clickhouse-local without downloading: | $ curl https://clickhouse.com/ | sh $ ./clickhouse | local ClickHouse local version 22.13.1.294 (official | build). milovidov-desktop :) DESCRIBE | url('https://clickhouse-public- | datasets.s3.amazonaws.com/minicrawl/data.native.zst') | DESCRIBE TABLE url('https://clickhouse-public- | datasets.s3.amazonaws.com/minicrawl/data.native.zst') | Query id: 6746232f-7f5f-4c5a-ac68-d749d949a2dc +-nam | e----+-type---+-default_type-+-default_expression-+-comment-+-cod | ec_expression-+-ttl_expression-+ | rank | UInt32 | | | | | | | | | domain | String | | | | | | | | log | | String | | | | | | | | content | String | | | | | | | +-------- | -+--------+--------------+--------------------+---------+-------- | ----------+----------------+ 4 rows in set. Elapsed: | 1.390 sec. milovidov-desktop :) SELECT rank, | domain, log, substringUTF8(content, 1, 100) FROM | url('https://clickhouse-public- | datasets.s3.amazonaws.com/minicrawl/data.native.zst') LIMIT 1 | FORMAT Vertical SELECT rank, | domain, log, substringUTF8(content, 1, | 100) FROM url('https://clickhouse-public- | datasets.s3.amazonaws.com/minicrawl/data.native.zst') | LIMIT 1 FORMAT Vertical Query id: | 8dba6976-0bf6-4ce8-a0f1-aa579c828175 Row 1: | ------ rank: 1907977 | domain: 0--0.uk log: | * Trying 213.32.47.30:80... * Connected to 0--0.uk | (213.32.47.30) port 80 (#0) > GET / HTTP/1.1 > | Host: 0--0.uk > Accept: */* > User-Agent: | Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 | Firefox/84.0 > * Mark bundle as not supporting | multiuse < HTTP/1.1 302 Moved Temporarily < | Server: nginx < Date: Sun, 29 May 2022 06:27:14 GMT | < Content-Type: text/html < Content-Length: 154 < | Connection: keep-alive < Location: https://0--0.uk/ | < * Ignoring the response-body { [154 bytes | data] * Connection #0 to host 0--0.uk left intact | * Issue another request to this URL: 'https://0--0.uk/' * | Trying 213.32.47.30:443... * Connected to 0--0.uk | (213.32.47.30) port 443 (#1) * ALPN, offering h2 | * ALPN, offering http/1.1 * CAfile: /etc/ssl/certs/ca- | certificates.crt * CApath: /etc/ssl/certs * | TLSv1.0 (OUT), TLS header, Certificate Status (22): } [5 | bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello | (1): } [512 bytes data] * TLSv1.2 (IN), TLS | header, Certificate Status (22): { [5 bytes data] | * TLSv1.3 (IN), TLS handshake, Server hello (2): { [108 | bytes data] * TLSv1.2 (IN), TLS header, Certificate | Status (22): { [5 bytes data] * TLSv1.2 (IN), TLS | handshake, Certificate (11): { [4150 bytes data] | * TLSv1.2 (IN), TLS header, Certificate Status (22): { [5 | bytes data] * TLSv1.2 (IN), TLS handshake, Server key | exchange (12): { [333 bytes data] * TLSv1.2 (IN), | TLS header, Certificate Status (22): { [5 bytes data] | * TLSv1.2 (IN), TLS handshake, Server finished (14): { [4 | bytes data] * TLSv1.2 (OUT), TLS header, Certificate | Status (22): } [5 bytes data] * TLSv1.2 (OUT), | TLS handshake, Client key exchange (16): } [70 bytes | data] * TLSv1.2 (OUT), TLS header, Finished (20): | } [5 bytes data] * TLSv1.2 (OUT), TLS change cipher, | Change cipher spec (1): } [1 bytes data] * | TLSv1.2 (OUT), TLS header, Certificate Status (22): } [5 | bytes data] * TLSv1.2 (OUT), TLS handshake, Finished | (20): } [16 bytes data] * TLSv1.2 (IN), TLS | header, Finished (20): { [5 bytes data] * TLSv1.2 | (IN), TLS header, Certificate Status (22): { [5 bytes | data] * TLSv1.2 (IN), TLS handshake, Finished (20): | { [16 bytes data] * SSL connection using TLSv1.2 / ECDHE- | RSA-AES128-GCM-SHA256 * ALPN, server accepted to use | http/1.1 * Server certificate: * subject: | CN=mail.htservices.co.uk * start date: May 15 18:36:37 | 2022 GMT * expire date: Aug 13 18:36:36 2022 GMT | * subjectAltName: host "0--0.uk" matched cert's "0--0.uk" | * issuer: C=US; O=Let's Encrypt; CN=R3 * SSL | certificate verify ok. * TLSv1.2 (OUT), TLS header, | Supplemental data (23): } [5 bytes data] > GET / | HTTP/1.1 > Host: 0--0.uk > Accept: */* > | User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) | Gecko/20100101 Firefox/84.0 > * TLSv1.2 (IN), | TLS header, Supplemental data (23): { [5 bytes data] | * Mark bundle as not supporting multiuse < HTTP/1.1 200 | OK < Server: nginx < Date: Sun, 29 May 2022 | 06:27:15 GMT < Content-Type: text/html;charset=utf-8 | < Transfer-Encoding: chunked < Connection: keep-alive | < X-Frame-Options: SAMEORIGIN < Expires: -1 < | Cache-Control: no-store, no-cache, must-revalidate, max-age=0 | < Pragma: no-cache < Content-Language: en-US < | Set-Cookie: ZM_TEST=true;Secure < Set-Cookie: ZM_LOGIN_CS | RF=b2dda010-d795-4759-a9c3-80349f3b46ed;Secure;HttpOnly < | Vary: User-Agent < X-UA-Compatible: IE=edge < | Vary: Accept-Encoding, User-Agent < { [13068 | bytes data] * Connection #1 to host 0--0.uk left intact | substringUTF8(content, 1, 100): <!DOCTYPE html> <!-- set | this class so CSS definitions that now use REM size, would work | relative to 1 row in set. Elapsed: 0.539 sec. | Processed 4.60 thousand rows, 273.86 MB (8.54 thousand rows/s., | 508.28 MB/s.) | simonw wrote: | How does that work? How can clickehouse-local run queries | against a 129 GB file hosted on S3 without downloading the | whole thing? | | Is it using HTTP range header tricks, like DuckDB does for | querying Parquet files? | https://duckdb.org/docs/extensions/httpfs.html | | If so, what's the data.native.zst file format? Is it similar to | Parquet? | zX41ZdbW wrote: | Yes, the native format is very similar to Parquet. | | It works for Parquet as well: SELECT * FROM | url('https://clickhouse-public- | datasets.s3.amazonaws.com/hits.parquet') LIMIT 1 | | And for CSV or TSV: SELECT * FROM | url('https://clickhouse-public-datasets.s3.amazonaws.com/gith | ub_events/tsv/github_events_v3.tsv.xz') LIMIT 1 | | And for ndJSON: SELECT repo_name, created_at, | event_type FROM s3('https://clickhouse-public-datasets.s3.ama | zonaws.com/github_events/partitioned_json/github_events_*.gz' | , JSONLines, 'repo_name String, actor_login String, | created_at String, event_type String') WHERE actor_login = | 'simonw' LIMIT 10 | | Note: the query above is kind of slow. Here is the query from | preloaded data - your activity in GitHub issues: | | https://play.clickhouse.com/play?user=play#U0VMRUNUIGNyZWF0Z. | .. | simonw wrote: | Another question about that demo. | | https://clickhouse.com/docs/en/getting-started/example- | datas... says "Dataset contains all events on GitHub from | 2011 to Dec 6 2020" - but I'm seeing results in there from | a couple of hours ago. | | Do you know if that's continually updated and, if so, is | that documented anywhere? | cldellow wrote: | > How does that work? | | Disclaimer: I'm not a Clickhouse user, but I have a bit of | experience with Parquet. | | It looks like the native format is (very briefly) described | here: | https://clickhouse.com/docs/en/interfaces/formats/#native | | It looks similar at a high level to Parquet: binary, columnar | and has metadata that permits requesting a subset of the | data. | | Looking at: | | > Processed 4.60 thousand rows, 273.86 MB | | I'd guess it's chunking the rows into groups of ~4,000. | | The OP must have a nice connection if that completed in 0.5 | seconds! (Or perhaps the 273.86MB is the uncompressed size | after zstd compression, or perhaps there were other parts of | the session that caused that chunk to get cached, and it was | elided from what was pasted in to HN.) | | EDIT: I was curious, so I ran the tool and watched bandwidth | on iftop. It uses about ~50MB each time I run the query. From | this, I conclude: it does not cache things, the 273.86MB is | the uncompressed size, and OP has a much better internet | connection than me. :) | deterrence wrote: | [flagged] | kristianp wrote: | This raises the question: how much in the way of user telemetry | does Chrome send back to google? | tgsovlerkhgsel wrote: | By default, a lot. However, they also are (or at least used to | be, it seems to be quite outdated now) really good at | documenting their telemetry publicly: | https://www.google.com/chrome/privacy/whitepaper.html | | (I haven't checked whether the documentation is | complete/accurate, of course.) | est wrote: | Looks like not a single Chinese site made to top 1k. I guess it's | reasonable because all Google services were blocked so CrUX can't | gather any data. | themoonisachees wrote: | Do Chinese people use chrome? One would think the download page | is blocked as well, so the demographic for chrome users should | be way smaller. | | Also to consider: China uses in-app browsing a lot, with | interactive experiences very similar to websites built right in | the bilibili/ali/wechat apps. | moffkalast wrote: | > in-app browsing | | But that's also just chromium isn't it, much like a PWA? | Unless they made something of their own. | kristianp wrote: | > The CrUX dataset is based on data collected from Google | Chrome and is thus biased away from countries with limited | Chrome usage (e.g., China). | Mortiffer wrote: | Thanks for pointing this out. Can definitely put this dataset to | use ___________________________________________________________________ (page generated 2022-12-31 23:00 UTC)