[HN Gopher] How Bear does analytics with CSS ___________________________________________________________________ How Bear does analytics with CSS Author : todsacerdoti Score : 292 points Date : 2023-11-01 08:08 UTC (14 hours ago) (HTM) web link (herman.bearblog.dev) (TXT) w3m dump (herman.bearblog.dev) | user20231101 wrote: | Smart approach! | nannal wrote: | > And not just the bad ones, like Google Analytics. Even Fathom | and Plausible analytics struggle with logging activity on | adblocked browsers. | | I believe that's as they're trying to live in what amounts to a | toxic wasteland. Users like us are done with the whole concept | and as such I assume if CSS analytics becomes popular, then | attempts will be made to bypass that too. | account-5 wrote: | Makes me reminiscent of uMatrix which could block the loading | of CSS too. | momentary wrote: | Is uMatrix not in vogue any more? It's still my go to tool! | account-5 wrote: | It's not actively developed anymore so I've been using | ublocks advanced options which are good but not as good as | uMatrix was. | its-summertime wrote: | ||somesite.example^$css | | would work in ublock | account-5 wrote: | I didn't know this. But with uMatrix you could default to | all websites and then whitelist those you wanted it for. At | least that's the way I used it and uBlock advanced user | features. | berkes wrote: | Why? | | I manually unblocked Piwik/Matomo, Plausible and and Fathom | from ublock. I don't see any harm in what and how these track. | And they do give the people behind the site valuable | information "to improve the service". | | e.g. Plausible collects less information on me than the common | nginx or Apache logs do. For me, as blogger, it's important to | see when a post gets on HN, is linked from somewhere and what | kinds of content are valued and which are ignored. So that I | can blog about stuff you actually want to read and spread it | through channels so that you are actually aware of it. | morelisp wrote: | You're just saying a smaller-scale version of "as a publisher | it's important for me to collect data on my audience to | optimize my advertising revenue." The adtech companies take | the shit for being the visible 10% but publishers are | consistently the ones pressuring for more collection. | ordersofmag wrote: | I'm a website 'publisher' for a non-profit that has zero | advertising on our site. Our entire purpose for collecting | analytics is to make the site work better for our users. | Really. Folks like us may not be in the majority but it's | worth keeping in mind that "analytics = ad revenue | optimization" is over-generalizing. | morelisp wrote: | I'm sure your stated 13 years of data is absolutely | critical to optimize your page load times. | majewsky wrote: | Can you give some examples of changes that you made | specifically to make the site work better for users, and | how those were guided by analytics? I usually just do | user interviews because building analytics feels like | summoning a compliance nightmare for little actual | impact. | arp242 wrote: | I've decided to either stop working or keep working on | some things based on the fact that I did or didn't get | any traffic for it. I've become aware some pages were | linked on Hacker News, Lobsters, or other sites, and | reading the discussion I've been able to improve some | things in the article. | | And also just knowing some people read what you write is | nice. There is nothing wrong with having some validation | (as long as you don't obsess over it) and it's a basic | human need. | | This is just for a blog; for a product knowing "how many | people actually use this?" is useful. I suspect that for | some things the number is literally 0, but it can be hard | to know for sure. | | User interviews are great, but it's time-consuming to do | well and especially for small teams this is not always | doable. It's also hard to catch things that are useful | for just a small fraction of your users. i.e. "it's | useful for 5%" means you need to do a lot of user | interviews ( _and_ hope they don 't forget to mention | it!) | HuwFulcher wrote: | How horrifying that someone who does writing potentially as | their income would seek to protect that revenue stream. | | Services like Plausible give you the bare minimum to | understand what is viewed most. If you have a website that | you want people to visit then it's a pretty basic | requirement that you'll want to see what people are | interested in. | | When you start "personalising" the experience based on some | tracking that's when it becomes a problem. | peoplefromibiza wrote: | > a pretty basic requirement that you'll want to see what | people are interested in. | | not really | | it should be what you are competent and proficient at | | people will come because they like what you do, not | because you do the things they like (sounds like the same | thing, but it isn't) | | there are many proxies to know what they like if you want | to plan what to publish and when and for how long, | website visits are one of the less interesting. | | a lot of websites such as this one get a lot of visits | that drive no revenue at all. | | OTOH there are websites who receive a small amount of | visits, but make revenues based on the amount of people | subscribing to the content (the textbook example is OF, | people there can get from a handful of subscriber what | others earn from hundreds of thousands of views on YT or | the like) | | so basically monitoring your revenues works better than | constantly optimizing for views, in the latter case you | are optimizing for the wrong thing | | I know a lot of people who sell online that do not use | analytics at all, except for coarse grained ones like | number of subscriptions/number of items sold/how many | email they receive about something they published or | messages from social platforms etc. | | that's been true in my experience through almost 30 years | of interacting and helping publishing creative content | online and offline (books, records, etc) | HuwFulcher wrote: | > people will come because they like what you do, not | because you do the things they like (sounds like the same | thing, but it isn't) | | This isn't true for all channels. The current state of | search requires you to adapt your content to what people | are looking for. Social channels are as you've said. | | It doesn't matter how you want to slice it. Understanding | how many people are coming to your website, from where | and what they're looking at is valuable. | | I agree the "end metric" is whatever actually drives the | revenue. But number of people coming to a website can | help tune that. | cpill wrote: | emails revived or messages on social media are just | another analytic and filling that same need as knowing | pages hits. and somehow these people are vega analytics | junkies instead of mainlining page hits. your | unconvincing in the argument for "analytics are not | needed" | marban wrote: | Plausible still works if you reverse-proxy the script and the | event url through your own /randompath. | chrismorgan wrote: | This approach is no harder to block than the JavaScript | approaches: you're just blocking requests to certain URL | patterns. | nannal wrote: | That approach would work until analytics gets mixed in with | actual styles and then you're trying to use a website without | CSS. | chrismorgan wrote: | You're blocking the _image_ , not the CSS. Here's a rule to | catch it at present: ||bearblog.dev/hit/ | | This is the shortest it can be written with certainty of no | false positives, but you can do things like making the URL | pattern more specific (e.g. _/ hit/*/_) or adding the | _image_ option (append _$image_ ) or just removing the | ||bearblog.dev domain filter if it spread to other domains | as well (there probably aren't enough false positives to | worry about). | | I find it also worth noting that _all_ of these techniques | are pretty easily circumventable by technical means, by | blending content and tracking /ads/whatever. In case of | all-out war, content blockers _will_ lose. It's just that | no one has seen fit to escalate that far (and in some cases | there are legal limitations, potentially on both sides of | the fight). | macNchz wrote: | > In case of all-out war, content blockers will lose. | It's just that no one has seen fit to escalate that far | (and in some cases there are legal limitations, | potentially on both sides of the fight). | | The Chrome Manifest v3 and Web Environment Integrity | proposals are arguably some of the clearest steps in that | direction, a long term strategy being slow-played to | limit pushback. | ben_w wrote: | The bit of the web that feels to me like a toxic wasteland is | all the adverts; the tracking is a much more subtle issue, | where the damage is the long-term potential of having a digital | twin that can be experimented on to find how best to manipulate | me. | | I'm not sure how many people actually fear that. Might get | responses from "yes, and it's creepy" to "don't be daft that's | just SciFi". | input_sh wrote: | Nothing's gonna block your webserver's access.log fed into an | analytics service. | | If anything, you're gonna get numbers that are inflated because | it's a bit impossible to dismiss all of the bot traffic just by | looking at user agents. | victorbjorklund wrote: | This does make sense! Might try it for my own analytics solution. | Anyone can think of a downside of this vs js? | berkes wrote: | I can think of many "downsides" but whether those matter or are | actually upsides really depends on your use-case and | perspective. | | * You cannot (easily) track interaction events (esp. relevant | for SPAs, but also things like "user highlighted x" or "user | typed Y, then backspaced then typed Z)" | | * You cannot track timings between events (e.g. how long a user | is on the page) | | * You cannot track data such as screen-sizes, agents, etc. | | * You cannot track errors and exceptions. | Wouter33 wrote: | Nice implementation! Just a heads-up, hashing the ip like that is | still considered tracking under GDPR and requires a privacy | banner in the EU. | thih9 wrote: | Can you explain why or link a source? I'd like to learn the | details. | fizzbuzz-rs wrote: | Likely because the hash of an IP can easily be reversed as | there are only ~2^32 IPv4 addresses. | openplatypus wrote: | It is not just that. Having user IP and such a hashing | approach you can re-identify past sessions. | thih9 wrote: | What if my hashing function has high likelihood of | collisions? | firtoz wrote: | Then you cannot trust the analytics | thih9 wrote: | Do you trust analytics that doesn't use JS? Or relies on | mobile users to scroll the page before counting a hit? | | It's all a heuristic and even with high collision | hashing, analytics would provide some additional insight. | rjmunro wrote: | You can estimate the actual numbers based on the | collision rate. | | Analytics is not about absolute accuracy, it's about | measuring differences; things like which pages are most | popular, did traffic grow when you ran a PR campaign etc. | dsies wrote: | https://gdpr-info.eu/art-4-gdpr/ paragraph 1: | | > 'personal data' means any information relating to an | identified or identifiable natural person ('data subject'); | an identifiable natural person is one who can be identified, | directly or indirectly, in particular by reference to an | identifier such as a name, an identification number, location | data, an online identifier or to one or more factors specific | to the physical, physiological, genetic, mental, economic, | cultural or social identity of that natural person; | thih9 wrote: | This does not reference hashing, which can be an | irreversible and destructive operation. As such, it can | remove the "relating" part - i.e. you'll no longer be able | to use the information to relate it to an identifiable | natural person. | | In this context, if I define a hashing function that e.g. | sums all ip address octets, what then? | jvdvegt wrote: | A hash (whether MD5 or some SHA) on IP4-address is easily | reversed. | | Summing octets is non-reversable, so it seems like a good | 'hash' to me (but note: you'll get a lot of collisions). | And of course, IANAL. | dsies wrote: | I was answering your request for a source. | | The linked article talks about _identification numbers_ | that can be used to link a person. I am not a lawyer but | the article specifically refers to one person. | | By that logic, if the hash you generate cannot be linked | to exactly one, specific person/request - you're in the | clear. I think ;) | openplatypus wrote: | Correct. This is a flawed hashing implementation as it allows | for re-identification. | | Having that IP and user timezone you can generate the same hash | and trace back the user. This is hardly anonymous hashing. | | Wide Angle Analytics adds daily, transient salt to each IP hash | which is never logged thus generating a truly anonymous hash | that prevents reidentification. | thih9 wrote: | What if my hashing function is really destructive and has | high likelihood of collisions? | hk__2 wrote: | > What if my hashing function is really destructive and has | high likelihood of collisions? | | If it's so destructive that it's impossible to track users, | it's useless for you. If not, you need a privacy banner. | thih9 wrote: | A high collision hash would be useful for me on my low | traffic page and I'd enjoy not having to display a cookie | banner. | | Also: https://news.ycombinator.com/item?id=38096235 | victorbjorklund wrote: | Probably should be "salted hashes might be considered PII". It | has not be tried by the EU court and the law is not 100% clear. | It might be. It might not be. | e38383 wrote: | If the data gets stored in this way (hash of IP[0]) for a long | time I'm with you. But if you only store the data for 24 hours | it might still count as temporary storage and should be | "anonymized" enough. | | IMO (and I'm not a lawyer): if you store ip+site for 24 hours | and after that only store "region" (maybe country or state) and | site this should be GDPR compliant. | | [0] it should use sha256 or similar and not md5 | donohoe wrote: | Actually no. It's very likely this is fine. Context is | important. | | Not a layer but discussed this previously with lawyers when | building a GDPR framework awhile back. | sleepyhead wrote: | Context is irrelevant. What is relevant is whether a value, | for example a hash, can be identified to a specific person in | some way. | donohoe wrote: | I'm really not going to argue here. | | I've been told this directly by lawyers who specialize in | GDPR and CCPA etc. I will take their word over yours. | | If you are a lawyer with direct expertise in this area then | I'm willing to listen. | mcny wrote: | On the topic of analytics, how do you store them? | | Let's say I have an e-commerce website, with products I want to | sell. In addition to analytics, I decide to log a select few | actions myself such as visits to product detail page while logged | in. So I want to store things like user id, product id, | timestamp, etc. | | How do I actually store this? My naive approach is to stick it in | a table. The DBA yelled at me and asked how long I need data. I | said at least a month. They said ok and I think they moved all | older data to a different table (set up a job for it?) | | How do real people store these logs? How long do you keep them? | ludwigvan wrote: | ClickHouse | jon-wood wrote: | Unless you're at huge volume you can totally do this in a | Postgres table. Even if you are you can partition that table by | date (or whatever other attributes make sense) so that you | don't have to deal with massive indexes. | | I once did this, and we didn't need to even think about | partitioning until we hit a billion rows or so. (But partition | sooner than that, it wasn't a pleasant experience) | n_e wrote: | An analytics database is better (clickhouse, bigquery...). | | They can do aggregations much faster and can deal with | sparse/many columns (the "paid" event has an "amount" | attribute, the "page_view" event has an "url" attribute...) | ordersofmag wrote: | We've got 13 years worth of data stored in mysql (5 million | visitor/year). It's a pain to query there so we keep a copy in | clickhouse as well (which is a joy to query). | mcny wrote: | I only track visits to a product detail page so far. | Basically, some basic metadata about the user (logged in | only), some metadata about the product, and basic "auditing" | columns -- created by, created date, modified by, modified | date (although why I have modified by and modified date makes | no sense to me, I don't anticipate to ever edit these, | they're only there for "standardization". I don't like it but | I can only fight so many battles at a time). | | I am approaching 1.5 million rows in under two months. | Thankfully, my DBA is kind, generous, and infinitely patient. | | Clickhouse looks like a good approach. I'll have to look into | that. | | > select count(*) from trackproductview; | | > 1498745 | | > select top 1 createddate from TrackProductView order by | createddate asc; | | > 2023-08-18 11:31:04.000 | | what is the maximum number of rows in clickhouse table? Is | there such a limit? | victorbjorklund wrote: | I use Postgres with timescale db. Works unless your e-commerce | is amazon.com. Great thing with timescale db is that they take | care of creating materialized views with the aggregates you | care about (like product views per hour etc) and you can even | choose to "throw away" the events themselves and just keep the | aggregations (to avoid getting a huge db if you have a lot of | events). | p4bl0 wrote: | The :hover pseudo-class could be applied and unapplied multiple | times for a single page load. This can certainly be mitigated | using cache related http headers but then if the same page is | visited by the same person a second time coming from the same | referrer, the analytics endpoint won't be loaded. | | But maybe I'm not aware that browsers guarantee that "images" | loaded using url() in CSS will be (re)loaded exactly once per | page? | kevincox wrote: | I'm not sure about `url()` in CSS but `<img>` tags are | guaranteed to only be loaded once per URL per page. I would | assume that `url()` works the same. | | This bit me when I tried to make a page that reload an image as | a form of monitoring. However URL interestingly includes the | fragment (after the #) even though it isn't set to the server. | So I managed to work around this by appending #1, #2, #3... to | the image URL. | | https://gitlab.com/kevincox/image-monitor/-/blob/e916fcf2f9a... | alabhyajindal wrote: | Wow, I didn't know you could trigger a URL endpoint with CSS! | dontlaugh wrote: | Why not just get this info from the HTTP server? | victorbjorklund wrote: | Hard if you run serverless | dontlaugh wrote: | There's still a server somewhere and it can log URLs and IPs. | tmikaeld wrote: | Not if it's static generated html/css. | | And the real benefit of this trick is separating users from | bots. | berkes wrote: | And even if there are many servers (a CDN or distributed | caching) you can collect and merge these. | victorbjorklund wrote: | Tell me how to collect the logs for static sites on | Cloudflare Pages (not functions. The Pages sites) | berkes wrote: | Cloudflare Pages are running on servers. These servers | (can, quite certainly will) have logs. | | That you cannot access the logs because you don't own the | servers doesn't mean there aren't any servers that have | logs. | victorbjorklund wrote: | Yes, no one has argued that Cloudflare Pages arent using | servers. But it is "hard" to track using logs if you are | a cloudflare customers. Guess only way would be to hack | into cloudflare itself and access my logs that way. But | that is "hard" (because yes theoretically it is possible | i know). And not a realistic alternative. | victorbjorklund wrote: | Of course. But you can't access it. You can't get logs for | static sites on Cloudflare Pages. | Spivak wrote: | Huh? You can get logs just fine from your ALB's and API | Gateways. | hk__2 wrote: | > Why not just get this info from the HTTP server? | | This is explained in the blog post: | | > There's always the option of just parsing server logs, which | gives a rough indication of the kinds of traffic accessing the | server. Unfortunately all server traffic is generally seen as | equal. Technically bots "should" have a user-agent that | identifies them as a bot, but few identify that since they're | trying to scrape information as a "person" using a browser. In | essence, just using server logs for analytics gives a skewed | perspective to traffic since a lot of it are search-engine | crawlers and scrapers (and now GPT-based parsers). | dontlaugh wrote: | Don't bots now load an entire browser including simulated | user interaction, to the point where there's no difference? | janosdebugs wrote: | Not for the most part, it's still very expensive. Even if, | they don't simulate mouse movement. | spiderfarmer wrote: | All bots | jackjeff wrote: | The whole anonymization of IP addresses by just hashing the date | and IP is just security theater. | | Cryptographic hashes are designed to be fast. You can do 6 | billion md5 hashes in a second on an MacBook (m1 pro) via hashcat | and there's only 4 billion ipv4 addresses. So you can brute force | the entire range and find the IP address. Basically reverse the | hash. | | And that's true even if they used something secure like SHA-256 | instead of broken MD5 | berkes wrote: | Maybe they use a secret salt or rotating salt? The example code | doesn't, so I'm afraid you are right. But one addition and it | can be made reasonable secure. | | I am afraid, however, that this security theater is enough to | pass many laws, regulations and such on PII. | ktta wrote: | Not if they use a password hash like Argon2 or scrypt | ale42 wrote: | But that's very heavy to compute at scale... | isodev wrote: | True, but also it's a blogging platform - does it really | have that kind of scale to be concerned with? | ale42 wrote: | Probably not, I was mainly thinking if that kind of | solution was to be adopted at a scale like Google | Analytics. | __alexs wrote: | Even then it is theatre because if you know the IP address | you want to check it's trivial to see if there's a match. | chrismorgan wrote: | And _this_ is why such a hash will still be considered | personal data under legislation like GDPR. | TekMol wrote: | That is easy to fix though. Just use a temporary salt. | | Pseudo code: if salt.day < today(): | salt = {day: today(), salt: random()} ip_hash = | sha256(ip + salt.salt) | __alexs wrote: | Assuming you don't store the salts, this produces a value | that is useless for anything but counting something like DAU. | Which you could equally just do by counting them all and | deleting all the data at the end of the day, or using a | cardinality estimator like HLL. | TekMol wrote: | DAU in regards to a given page. | | Have you read the article? That is what the author's goal | seems to be. | | He wants to prevent multiple requests to the same page by | the same IP counted multiple times. | tatersolid wrote: | Is that more efficiently done with an appropriate caching | header on the page as it is served? | | Cache-Control: private, max-age=86400 | | This prevents repeat requests for normal browsers from | hitting the server. | dvdkon wrote: | That same uselessness for long-term identification of users | is what makes this approach compliant with laws regulating | use of PII, since what you have after a small time window | isn't actually PII (unless correlated with another dataset, | but that's always the case). | SamBam wrote: | That's precisely all that OP is storing in the original | article. | | They're just getting a list of hashes per day, and | associated client info. They have no idea if the same user | visit them on multiple days, because the hashes will be | different. | kevincox wrote: | Of course if you have multiple severs or may reboot you need | to store the salt somewhere. If you are going to bother | storing the salt and cleaning it up after the day is over it | may be just as easy to clean the hashes at the end of the day | (and keep the total count) which is equivalent. This should | work unless you want to keep individual counts around for | something like seeing distribution of requests per IP or | similar. But in that case you could just replace the hashes | with random values at the end of the day to fully anonymize | them since you no longer need to increment then. | Etheryte wrote: | For context, this problem also came up in a discussion about | Storybook doing something similar in their telemetry [0] and | with zero optimization it takes around two hours to calculate | the salted hashes for every IPv4 on my home laptop. | | [0] https://news.ycombinator.com/item?id=37596757 | hnreport wrote: | This the type of comment that reinforces not even trying to | learn or outsource security. | | You'll never know enough. | petesergeant wrote: | I think the opposite? I'm a dev with a bit of an interest in | security, and this immediately jumped out at me from the | story; knowing enough security to discard bad ideas is | useful. | WhyNotHugo wrote: | Aside from it being technically trivial to get an IP back from | its hash, the EU data protection agency made it very clear that | "hashing PII does not count as anonymising PII". | | Even if you hash somebody's full name, you can later answer the | question "does this hash match the this specific full name". | Being able to answer this question implies that the | anonymisation process is reversible. | kevincox wrote: | I think the word "reversible" here is being stretched a bit. | There is a significant difference between being able to list | every name that has used your service and being able to check | if a particular name has used your service. (Of course these | can be effectively the same in cases where you can list all | possible inputs such as hashed IPv4 addresses.) | | That doesn't mean that hashing is enough for pure anonymity, | but used properly hashes are definitely a step above | something fully reversible (like encryption with a common | key). | SamBam wrote: | I'm not sure the distinction is meaningful. If the police | demand your logs to find out whether a certain IP address | visited in the past year, they'd be able to find that out | pretty quickly given what's stored. So how is privacy being | respected? | pluto_modadic wrote: | if it fulfills the same function, does it matter? | | if you have an ad ID for a person, say example@example.com, | and you want to deduplicate it, | | if you provide them with the names, the company that buys | the data can still "blend" it with data they know, if they | know how the hash was generated... and effectively get back | that person's email, or IP, or phone number, or at least | get a good hunch that the closest match is such and such | person with uncanny certainty | | de-anonymization of big data is trivial in basically every | case that was written by an advertising company, instead of | written by a truly privacy focused business. | | if it were really a non-reversible hash, it would be evenly | distributed, not predictable, and basically useless for | advertising, because it wouldn't preserve locality. It | needs to allow for finding duplicates... so the person you | give the hash to, can abuse that fact. | bayindirh wrote: | We're members of some EU projects, and they share a common | help desk. To serve as a knowledge base, the tickets are | kept, but all PII is anonymized after 2 years AFAIK. | | What they do is pretty simple. They overwrite the data fields | with the text "<Anonymized>". No hashes, no identifiers, | nothing. Everything is gone. Plain and simple. | spookie wrote: | KISS That's the best way to go about it | jefftk wrote: | It depends. For example, if each day you generate a random | nonce and use it to salt that day's PII (and don't store the | nonce) then you cannot later determine (a) did person A visit | on day N or (b) is visitor X on day N the same as visitor Y | on day N+1. But you can still determine how many distinct | visitors you had on day N, and answer questions about within- | day usage patterns. | ilrwbwrkhv wrote: | Yes but if the business is not in the EU they don't need to | care one bit about GDPR or EU. | troupo wrote: | If they target residents of the EU, they must care. | | _Edit:_ | | This is a different bear: | | Also, Bear claims to be GDPR compliant: | https://bear.app/faq/bear-is-gdpr-compliant/ | TylerE wrote: | Is an ipv4 address really classes as PII? Sounds a bit | insane. | beardog wrote: | It can be used to track you across the web, get a general | geographic area, and if you have the right connections one | can get the ISP subscriber address. Given that PII is | anything that can be used to identify a person, I think it | qualifies despite it being difficult for a rando to tie an | IP to a person. | | Additionally in the case of ipv6 it can be tied to a | specific device more often. One cannot rely on ipv6 privacy | extensions to sufficiently help there. | rtsil wrote: | That's compounded by the increasing use of static IPs, or | at least extremely long-lasting dynamic IPs in some ISPs. | alkonaut wrote: | Hashes should be salted. If you salt, you are fine, if you | don't you aren't. | | Whether the salt can be kept indefinitely, or is rotated | regularly etc is just an implementation detail, but the key | with salting hashes for analytics is that the salt never leaves | the client. | | As explained in the article there seems to be no salt (or | rather, the current date seems to be used as a salt, but that's | not a random salt and can easily be guessed for anyone who | wants to say "did IP x.y.z.w visit on date yy-mm-dd?". | | It's pretty easy to reason about these things if you look from | the perspective of an attacker. How would you do to figure out | anything about a specific person given the data? If you can't, | then the data is probably OK to store. | piaste wrote: | > Hashes should be salted. If you salt, you are fine, if you | don't you aren't. | | > Whether the salt can be kept indefinitely, or is rotated | regularly etc is just an implementation detail, but the key | with salting hashes for analytics is that the salt never | leaves the client. | | I think I'm missing something. | | If the salt is known to the server, then it's useless for | this scenario. Because given a known salt, you can generate | the hashes for every IP address + that salt very quickly. | (Salting passwords works because the space for passwords is | big, so rainbow tables are expensive to generate.) | | If the salt is unknown to the server, i.e. generated by the | client and 'never leaves the client'... then why bother with | hashes? Just have the client generate a UUID directly instead | of a salt. | rkangel wrote: | Without a salt, you can generate the hash for every IP | address _once_ , and then permanantly have a hash->IP | lookup (effectively a Rainbow table). If you have a salt, | then you need to do it for each database entry, which does | make it computationally more expensive. | tptacek wrote: | People are obsessed with this attack from the 1970s, but | in practice password cracking rigs just brute force the | hashes, and that has been the practice since my career | started in the 1990s and people used `crack`, into the | 2000s and `jtr`, and today with `hashcat` or whatever it | is the cool kids use now. "Rainbow tables" don't matter. | If you're discussing the expense of attacking your scheme | with or without rainbow tables, you've already lost. | jonahx wrote: | > If you're discussing the expense of attacking your | scheme with or without rainbow tables, you've already | lost. | | Can you elaborate on this or link to some info | elaborating what you mean? I'd like to learn about it. | alkonaut wrote: | > > _the salt never leaves the client_ | | > I think I'm missing something. | | ... | | > If the salt is known to the server, | | That's what you were missing yes | SamBam wrote: | Did you miss the second half where GP asked why the | client doesn't just send up a UUID, instead of generating | their own salt and hash? | arp242 wrote: | > why bother with hashes? Just have the client generate a | UUID directly instead of a salt. | | The reason for all this bonanza is that the ePrivacy | directive requires a cookie banner, _" making exceptions | only for data that is _"strictly necessary in order to | provide a [..] service explicitly requested by the | subscriber or user"*. | | In the end, you only have "pinky promise" that someone | isn't doing more processing on the server end, so in | reality it doesn't matter much especially if the cookie | lifetime is short (hours or even minutes). Actually, a | cookie or other (short-lived!) client-side ID is probably | better for everyone if it wasn't for the cookie banners. | TylerE wrote: | ALL of the faff around cookies is the biggest security | theater of the past 40 years. I remember hearing the | fear-mongering in the very early 2000's about cookies in | the mainstream media - it was self-evidentally a farce | then, and a farce now. | throwaway290 wrote: | Isn't in this case data is part of "strictly necessary" | data (IP address)? That's all that gets collected by that | magic CSS + server, no? | arp242 wrote: | ePrivacy directive only applies to information stored on | the client side (such as cookies). | darken wrote: | Salts are generally stored with the hash, and are only really | intended to prevent "rainbow table" attacks. (I.e. use of | precomputed hash tables.) Though a predictable and matching | salt per entry does mean you can attack all the hashes for a | timestamp per hash attempt. | | That being said, the previous responder's point still stands | that you can brute force the salted IPs at about a second per | IP with the colocated salt. Using multiple hash iterations | (e.g. 1000x; i.e. "stretching") is how you'd meaningfully | increase computational complexity, but still not in a way | that makes use of the general "can't be practically reversed" | hash guarantees. | alkonaut wrote: | As I said the key for hashing PII for telemetry is that the | client does the hashing on the client side and the client | never transmits the salt. This isn't a login system or | similar. There is no "validation" of the hash. All the hash | is is a unique marker for a user that doesn't contain any | PII. | SamBam wrote: | How does the client generate the same salt every time | they visit the page, without using cookies? | donkeyd wrote: | Use localstorage! | | Kidding, of course. I don't think there's a way to track | users across sessions, without storing something and | requiring a 'cookie notification'. Which is kind of the | point of all these laws. | alkonaut wrote: | Storing a salt with 24h expiry would be the same thing as | the solution in the article. It would be better from a | privacy perspective because the IP would then not be | transmitted in a reversible way. | | If I hadn't asked for permission to send hash(ip + date) | then I'd sure not ask permission if I instead stored a | random salt for each new 24h and sent the hash(ip + | todays_salt). | | This is effectively a cookie and it's not strictly | necessary if it's stats only. So I think on the server | side I'd just invent some reason why it's necessary for | the product itself too, and make the telemetry just an | added bonus. | alkonaut wrote: | If you can use JS it's easy. For example | localStorage.setItem("salt", Math.random()). Without JS | it's hard I think. I don't know why this author wants to | use JS, perhaps out of respect for his visitors, but then | I think it's worse to send PII over the wire (And an IP | hashed in the way he describes is PII). | SamBam wrote: | EU's consent requirements don't distinguish between | cookies and localStorage, as far as I understand. And a | salt that is only used for analytics would not count as | "strictly necessary" data, so I think you'd have to put | up a consent popup. Which is precisely the kind of thing | a solution of that is trying to avoid. | alkonaut wrote: | Indeed, but as I wrote in another reply: it doesn't | matter. It's even worse to send PII over the wire. Using | the date as the salt (as he does) just means it's | reversible PII - a.k.a. PII!. | | Presumable these are stored on the server side to | identify returning visitors - so instead of storing a | random number for 24 hours on the client, you now have | PII stored on the server. So basically there is no way to | do this that doesn't require consent. | | The only way to do it is to make the information required | for some necessary function, and then let the analytics | piggyback on it | SamBam wrote: | I think I agree with you there. But again, the idea of a | "salt" is then overcomplicating things. It's exactly the | same to have the client generate a GUUID and just send | that up, no salting or hashing required. | alkonaut wrote: | Yup for only identifying a system that's easier. If this | is all the telemetry is ever planned to do then that's | all you need. The benefit of having a local hash function | is when you want to transmit multiple ids for data. E.g | in a word processor you might transmit | hash(salt+username) on start and hash(salt+filename) when | opening a document and so on. That way you can send | identifiers for things that are sensitive or private like | file names in a standardized way and you don't need to | keep track of N generated guids for N use cases. | | On the telemetry server you get e.g | | Function "print" used by user 123 document 345. Using | that you can do things like answering how many times an | average document is printed or how many times per year an | average user uses the print function. | robertlagrant wrote: | IP address is "non-sensitive PII"[0]. It's pretty hard to | identify someone from an IP address. Hashing and then | deleting every day is very reasonable. | | [0] https://www.ibm.com/topics/pii | sysop073 wrote: | What's the point in hashing the IP + salt then, just let | each client generate a random nonce and use that as the | key | tptacek wrote: | Salting a standard cryptographic hash (like SHA2) doesn't do | anything meaningful to slow a brute force attack. This | problem is the reason we have password KDFs like scrypt. | | (I don't care about this Bear analytics thing at all, and | just clicked the comment thread to see if it was the Bear I | thought it was; I do care about people's misconceptions about | hashing.) | alkonaut wrote: | What do you mean by "brute force" in the context of | reversing PII that has been obscured by a one way hash? My | IP number passed through SHA1 with a salt (a salt I | generated and stored safely on my end) is | 6FF6BA399B75F5698CEEDB2B1716C46D12C28DF5 Since this is all | that would be sent over the wire for analytics, this is the | only information an attacker will have available. | | The only thing you can brute force from that is _some_ IP | and _some salt_ such that SHA1(IP+Salt) = | 6FF6BA399B75F5698CEEDB2B1716C46D12C28DF5 But you 'll find | millions of such IPs. Perhaps all possible IP's will work | with _some_ salt, and give that hash. It 's not revealing | my IP even if you manage to find a match? | infinityio wrote: | If you also explicitly mentioned the salt used (as bear | appear to have done?), this just becomes a matter of | testing 4 billion options and seeing which matches | alkonaut wrote: | I think it's just unsalted in the example code. Or you | could argue that the date is kind of used as a salt. But | the point was that salting + hashing is fine for PII in | telemetry if and only if the salt stays on the client. It | might be difficult to do without JS though. | michaelmior wrote: | > Salting a standard cryptographic hash (like SHA2) doesn't | do anything meaningful to slow a brute force attack. | | Sure, but it does at least prevent the use of rainbow | tables. Arguably not relevant in this scenario, but it | doesn't mean that salting does nothing. Rainbow tables can | speed up attacks by many orders of magnitude. Salting may | not prevent each individual password from being brute | forced, but for most attackers, it probably will prevent | your entire database from being compromised due to the | amount of computation required. | tptacek wrote: | Rainbow tables don't matter. If you're discussing the | strength of your scheme with or without rainbow tables, | you have already lost. | | https://news.ycombinator.com/item?id=38098188 | robertlagrant wrote: | That's just a link where you claim the same thing. What's | your actual rationale? Do you think salting is pointless? | dspillett wrote: | _> Cryptographic hashes are designed to be fast._ | | Not _really_. They are designed to be fast _enough_ and even | then only as a secondary priority. | | _> You can do 6 billion ... hashes /second on [commodity | hardware] ... there's only 4 billion ipv4 addresses. So you can | brute force the entire range_ | | This is harder if you use a salt not known to the attacker. | Per-entry salts can help even more, though that isn't relevant | to IPv4 addresses in a web/app analytics context because after | the attempt at anonymisation you want to still be able to tell | that two addresses were the same. | | _> And that's true even if they used something secure like | SHA-256 instead of broken MD5_ | | Relying purely on the computation complexity of one hash | operation, even one not yet broken, is not safe given how easy | temporary access to mass CPU/GPU power is these days. This can | be mitigated somewhat by running many rounds of the hash with a | non-global salt - which is what good key derivation processes | do for instance. Of course you need to increase the number of | rounds over time to keep up with the rate of growth in | processing availability, to keep undoing your hash more hassle | than it is worth. | | But yeah, a single unsalted hash (or a hash with a salt the | attacker knows) on IP address is not going to stop anyone who | wants to work out what that address is. | krsdcbl wrote: | Don't forget that md5 is comparatively slow & there are way | options for hashing nowadays: | | https://jolynch.github.io/posts/use_fast_data_algorithms/ | SAI_Peregrinus wrote: | A "salt not known to the attacker" is a "key" to a keyed hash | function or message authentication code. A salt isn't a | secret, though it's not usually published openly. | marcosdumay wrote: | > only as a secondary priority | | That's not a reasonable way to say it. It's literally the | second priority, and heavily evaluated when deciding what | algorithms to take. | | > This is harder if you use a salt not known to the attacker. | | The "attacker" here is the sever owner. So if you use a | random salt and throw it away, you are good, anything | resembling the way people use salt on practice is not fine. | HermanMartinus wrote: | Author here. I commented down below, but it's probably more | relevant in this thread. | | For a bit of clarity around IP addresses hashes. The only use | they have in this context is preventing duplicate hits in a day | (making each page view unique by default). At the end of each | day there is a worker job that scrubs the ip hash which is now | irrelevant. | myfonj wrote: | Have you considered serving actual small transparent image | with caching headers set to expire at midnight? | freitasm wrote: | "The only downside to this method is if there are multiple reads | from the same IP address but on separate devices, it will still | only be seen as one read. And I'm okay with that since it | constitutes such a minor fragment of traffic." | | Many ISPs are now using CG-NAT so this approach would miscount | thousands of visitors seemingly coming from a single IP address. | tmikaeld wrote: | Only if all of them use the exact same user agent | platform/browser. | | (It would be better if he used a hash of the raw user agent | string) | freitasm wrote: | UA aren't unique these days. | colesantiago wrote: | How would one block this from tracking you? | | I think we would either need to send fake data to these analytics | tools deliberately like https://adnauseam.io/ | | Or now include CSS as a spy tracker that needs to be blocked. | Kiro wrote: | I don't see how this is more intrusive for privacy than what | you can already get from access logs. | colesantiago wrote: | It is still tracking you so it needs to be blocked. | Kiro wrote: | So are access logs. How are you going to block those? | colesantiago wrote: | I never said anything about access logs, I specifically | mentioned this CSS trick that will become popular for ad | companies to track people. | | For this, this would need to block the endpoint or send | obfuscated data deliberately in protest of this. | | Should you _want_ to cover access logs also, then forms | of tracking then sending excessive, random obfuscation | data with adnauseam would also help here. | | https://adnauseam.io/ | jokethrowaway wrote: | I sure hope you're being sarcastic here and illustrating | the ridiculousness of privacy extremists (who, btw, ruined | the web, thanks to a few idiot politicians in the EU). | | If not, what's wrong with a service knowing you're | accessing it? How can they serve a page without knowing | you're getting a page? | callalex wrote: | Ruined the web? It sure seems like the web still works | from my perspective. What has been ruined for you? | matrss wrote: | If it is not then it must be unnecessary, since you could get | the same information from the access logs already. | its-summertime wrote: | ||/hit/*$image | | In your favorite ad blocker | meiraleal wrote: | Interesting approach but what about mobile users? | welpo wrote: | From the article: | | > Now, when a person hovers their cursor over the page (or | scrolls on mobile) it triggers body:hover which calls the URL | for the post hit | cantSpellSober wrote: | It _doesn 't_ do that though. | | > The :hover pseudo-class is problematic on touchscreens. | Depending on the browser, the :hover pseudo-class might never | match | | https://developer.mozilla.org/en-US/docs/Web/CSS/:hover | | Don't take my word for it. Trying it in mobile emulators will | have the same result. | rzmmm wrote: | > Now, when a person hovers their cursor over the page (or | scrolls on mobile)... | | I can imagine many cases where real human user doesn't scroll the | page on mobile platform. I like the CSS approach but I'm not sure | it's better than doing some bot filtering with the server logs. | freitzzz wrote: | I attempted to do this back at the start of this year, but lost | motivation building the web ui. My trick is not CSS but simply | loading fake images with <img> tags: | | https://github.com/nolytics | openplatypus wrote: | The CSS tracker is as useful as server log-based analytics. If | that is the information you need, cool. | | But JS trackers are so much more. Time spent on the website, | scroll depth, screen sizes, some limited and compliant and yet | useful unique sessions, those things cannot be achieved without | some (simple) JS. | | Server side, JS, CSS... No one size fits all. | | Wide Angle Analytics has strong privacy, DNT support, an opt-out | mechanism, EU cloud, compliance documentation, and full process | adherence. Employs non-reversible short-lived sessions that still | give you good tracking. Combine it with custom domain or first- | party API calls and you get near 100% data accuracy. | croes wrote: | If it's an US company then EU cloud doesn't matter regarding | data protection for EU citizens. | | The Cloud Act rendered that worthless. | openplatypus wrote: | Wide Angle Analytics is German company operating everything | on EU cloud (EU owners, EU location). | EspressoGPT wrote: | You probably even could analyze screen sizes by doing the same | thing but with CSS media queries. | TekMol wrote: | The CSS tracker is as useful as server log-based analytics. | | It is not. Have you read the article? | | The whole point of the CSS approach is to weed out user agents | which are not doing mouse hover on the body events. You can't | see that from server logs. | jokethrowaway wrote: | Lovely technique and probably more than adequate for most uses. | | My scraping bots use an instance of chrome and therefore trigger | hover as well, but you'll cut out the less sophisticated bots. | | This is because of protection systems, if I try to scrape my | target website with just code I just get insta banned / | "captched". | jerbear4328 wrote: | Are you sure? Even if you run a headless browser, you might not | be triggering the hover event, unless you specifically tell it | to or your framework simulates a virtual mouse that triggers | mouse events and CSS. | | You totally could be triggering it, but not every bot will, | even the fancy ones. | fatih-erikli wrote: | This is known as "pixel tracker" for decades. | cantSpellSober wrote: | Used in emails as well. Loading a 1x1 transparent <img> is a | more sure thing than triggering a hover event, but ad-blockers | often block those | t0astbread wrote: | Occasionally I've seen people fail and add the pixel as an | attachment instead. | blacksmith_tb wrote: | True, though doing it in CSS does have a couple of interesting | aspects, using :hover would filter out bots that didn't use a | full-on webdriver (most bots, that is). I would think that | using an @import with 'supports' for an empty-ish .css file | would be better in some ways (since adblockers are awfully good | at spotting 1px transparent tracking pixels, but less likely to | block .css files to avoid breaking layouts), but that wouldn't | have the clever :hover benefits. | chrismorgan wrote: | I'd like to see a comparison of the server log information with | the hit endpoint information: my feeling is that the reasons for | separating it don't really hold water, and that the initial | request server logs could fairly easily be filtered to acceptable | quality levels, obviating the subsequent request. | | The basic server logs include declared bots, undeclared bots | pretending to use browsers, undeclared bots actually using | browsers, and humans. | | The hit endpoint logs will exclude almost all declared bots, | almost all undeclared bots pretending to use browsers, and some | humans, but will retain a few undeclared bots that search for and | load subresources, and almost all humans. About undeclared bots | that actually use browsers, I'm uncertain as I haven't inspected | how they are typically driven and what their initial mouse cursor | state is: if it's placed within the document it'll trigger, but | if it's not controlled it'll probably be outside the document. ( | _Edit:_ actually, I hadn't considered that bearblog caps the body | element's width and uses margin, so if the mouse cursor is not in | the main column it won't trigger. My feeling is that this will | get rid of almost all undeclared bots using browsers, but | _significantly_ undercount users with large screens.) | | But my experience is that reasonably simple heuristics do a | pretty good job of filtering out the bots the hit endpoint also | excludes. | | * Declared bots: the filtration technique can be ported as-is. | | * Undeclared bots pretending to use browsers: that's a | behavioural matter, but when I did a _little_ probing of this | some years ago, I found that a great many of them were using | unrealistic user-agent strings, either visibly wonky or | impossible or just corresponding to browsers more than a year old | (which almost no real users are using). I suspect you could get | rid of the vast majority of them reasonably easily, though it | might require occasional maintenance (you could do things like | estimate the browser's age based on their current version number | and release cadence, with the caveat that it may slowly drift and | should be checked every few years) and will certainly exclude a | very few humans. | | * Undeclared bots actually using browsers: this depends on the | unknown I declared, whether they position their mice in the | document area. But my suspicion is that these simply aren't worth | worrying about because they're not enough to notably skew things. | Actually using browsers is _expensive_ , people avoid it where | possible. | | And on the matter of humans, it's worth clarifying that the hit | endpoint is _worse_ in some ways, and honestly quite risky: | | * Some humans will use environments that _can't_ trigger the | extra hit request (e.g. text-mode browsers, or using some service | that fetches and presents content in a different way); | | * Some humans will behave in ways that _don't_ trigger the extra | hit request (e.g. keyboard-only with no mouse movement, or | loading then going offline); | | * Some humans will block the extra hit request; and if you upset | the wrong people or potentially even become too popular, it'll | make its way into a popular content blocker list and significant | fractions of your human base will block it. _This_ , in my | opinion, is the biggest risk. | | * There's also the risk that at some point browsers might | prefetch such resources to minimise the privacy leak. (Some email | clients have done this at times, and browsers have wrestled from | time to time with related privacy leaks, which have led to the | hobbling of what properties :visited can affect, and other | mitigations of clickjacking. I think it _conceivable_ that such a | thing could be changed, though I doubt it will happen and there | would be plenty of notice if it ever did.) | | But there's a deeper question to it: _if_ you don't exclude some | bots; or _if_ the URL pattern gets on a popular content filter | list: does it matter? Does it skew the ratios of your results | significantly? (Absolute numbers have never been particularly | meaningful or comparable between services or sources: you can | only meaningfully compare numbers from within a source.) My | feeling is that after filtering out most of the bots in fairly | straightforward ways, the data that remains is likely to be of | similar enough quality to the hit endpoint technique: both will | be overcounting in some areas and undercounting in others, but I | expect both to be Good Enough, at which point I prefer the | simplicity of not having a separate endpoint. | | (I think I've presented a fairly balanced view of the facts and | the risks of both approaches, and invite correction in any point. | Understand also that I've never tried doing this kind of analysis | in any _detail_ , and what examination and such I have done was | almost all 5-8 years ago, so there's a distinct possibility that | my feelings are just way off base.) | p4bl0 wrote: | I have a genuine question that I fear might be interpreted as a | dismissive opinion but I'm actually interested in the answer: | what's the goal of collecting analytics data in the case of | personal blogs in a non-commercial context such as what Bearblog | seems to be? | Veen wrote: | Curiosity? I like to know if anyone is reading what I write. | It's also useful to know what people are interested in. Even | personal bloggers may want to tailor content to their audience. | It's good to know that 500 people have read an article about | one topic, but only 3 people read one about a different topic. | mrweasel wrote: | For the curiosity, one solution I've been pondering, but | never gotten around to implementing is just logging the | country of origin for a request, rather than the entire IP. | | IPs are useful in case of attack, but you could limit | yourself to simply logging subnets. It's a little more | aggressive block a subnet, or an entire ISP, but it seems | like a good tradeoff. | taurusnoises wrote: | I can speak to this from the writer's perspective as someone | who has been actively blogging since c. 2000 and has been | consistently (very) interested in my "stats" the entire time. | | The primary reason I care about analytics is to see if posts | are getting read, which on the surface (and in some ways) is | for reasons of vanity, but is actually about writer-reader | engagement. I'm genuinely interested in what my readers | resonate with, because I want to give them more of that. The | "that" could be topical, tonal, length, who knows. It helps me | hone my material specifically for my readers. Ultimately, I | could write about a dozen different things in two dozen | different ways. Obviously, I do what I like, but I refine it to | resonate with my audience. | | In this sense, analytics are kind of a way for me to get to | know my audience. With blogs that had high engagement, | analytics gave me a sort of fuzzy character description of who | my readers were. As with above, I got to see what they liked, | but also when they liked it. Were they reading first thing in | the morning? Were they lunch time readers? Were they late at | night readers. This helped me choose (or feel better about) | posting at certain times. Of course, all of this was fuzzy | intel, but I found it really helped me engage with my | readership more actively. | hennell wrote: | Feedback loops. Contrary to what a lot of people seem to think, | analytics is not just about advertising or selling data, it's | about analysing site and content performance. Sure that can be | used (and abused) for advertising, but it's also essential if | you want any feedback about what you're doing. | | You might get no monetary value from having 12 people read the | site or 12,000 but from a personal perspective it's nice to | know what people want to read about from you, and so you can | feel like the time you spent writing it was well spent, and | adjust if you wish to things that are more popular. | colesantiago wrote: | If you want to send obfuscated data on purpose to prevent this | dark pattern behaviour from spreading I recommend Adnauseam. | | (not the creator, just a regular user of this great tool) | | We need more tools that send random, fake data to analytics | providers which renders the analytics useless to them in protest | of tracking. | | If there are any more like Adnauseam, I would love to know. | | https://adnauseam.io/ | myfonj wrote: | Seems clever and all, but `body:hover` will most probably | completely miss all "keyboard-only" users and users with user | agents (assistive technologies) that do not use pointer devices. | | Yes, these are marginal groups perhaps, but it is always super | bad sign seeing them excluded in any way. | | I am not sure (I doubt) there is a 100 % reliable way to detect | that "real user is reading this article (and issue HTTP request)" | from baseline CSS in every single user agent out there (some of | them might not support CSS at all, after all, or have loading of | any kind of decorative images from CSS disabled). | | There are modern selectors that could help, like :root:focus- | within (requiring that user would actually focus something | interactive there, what again is not guaranteed for al agents to | trigger such selector), and/or bleeding edge scroll-linked | animations (`@scroll-timeline`). But again, braille readers will | probably remain left out. | demondemidi wrote: | Keyboard only users? All 10 of them? ;) | bayindirh wrote: | Well with me, it's probably 11. | | Joking aside, I love to read websites with keybaords, esp. if | I'm reading blogs. So, it's possible that sometimes my | pointer is out there somewhere to prevent distraction. | myfonj wrote: | I think there might be more than ten [1] blind folks using | computers out there, most of them not using pointing devices | at all or not in a way that would produce "hover". | | [1] was it base ten, right? | zichy wrote: | Think about screen readers. | vivekd wrote: | I'm a keyboard user when on my computer, qutebrowser but I | think your sentiments are correct, the numbers of keyboard | only users are probably much much smaller than the number of | people using Adblock. So OPs method is likely to produce a | more accurate analytics than a JavaScript only design. | | OP just thought of a creative, effective and probably faster | more code efficient way to do analytics. I love it, thanks OP | for sharing it | paulddraper wrote: | https://www.youtube.com/watch?v=lKie-vgUGdI | qingcharles wrote: | Marginal? Surely this affects 50%+ of user-agents, i.e. phones | and tablets which don't support :hover? (without a mouse being | plugged in) | myfonj wrote: | I think most mobile browsers emit "hover" state whenever you | tap / drag / swipe over something in the page. "active" state | is even more reliable IMO. But yes, you are right that it is | problematic. Quoting MDN page about ":hover" [1]: | | > Note: The :hover pseudo-class is problematic on | touchscreens. Depending on the browser, the :hover pseudo- | class might never match, match only for a moment after | touching an element, or continue to match even after the user | has stopped touching and until the user touches another | element. Web developers should make sure that content is | accessible on devices with limited or non-existent hovering | capabilities. | | [1] https://developer.mozilla.org/en-US/docs/Web/CSS/:hover | callalex wrote: | I really wish modern touchscreens spent the extra few cents | to support hover. Samsung devices from the ~2012 era all | supported detection of fingers hovering near the screen. I | suspect it's terrible patent laws holding back this | technology, like most technologies that aren't headline | features. | gizmo wrote: | > Even Fathom and Plausible analytics struggle with logging | activity on adblocked browsers. | | The simple solution is to respect the basic wishes of those who | do not want to be tracked. This is a "struggle" only because | website operators don't want to hear no. | reustle wrote: | As much I agree with respecting folks wishes to not be tracked, | most of these cases are not about "tracking". | | It's usually website hosts just wanting to know how many folks | are passing through. If a visitor doesn't even want to | contribute to incrementing a private visit counter by +1, then | maybe don't bother visiting. | gizmo wrote: | If it was just about a simple count the host could just `wc | -l access.log`. Clearly website hosts are not satisfied with | that, and so they ignore DO_NOT_TRACK and disrespectfully try | to circumvent privacy extensions. | jakelazaroff wrote: | Is there a meaningful difference between recording "this IP | address made a request on this date" and "this IP address | made a request on this date after hovering their cursor | over the page body"? How is your suggestion more acceptable | than what the blog describes? | gizmo wrote: | Going out of your way to specifically track people who | indicate they don't want to be tracked is worse. | vivekd wrote: | Google cloud and AWS VPS and many hosting services | collect and provide this info by default. Chances are | most websites do this including this one you are using | now. HN does up bans meaning they must access visitor IP. | | Why aren't people starting their protest against the | website they're currently using instead of at OP. | gizmo wrote: | We all know that tracking is ubiquitous on the web. This | blogpost however discusses technology that specifically | helps with tracking people who don't want to be tracked. | I responded that an alternative approach is to just not. | That's not hypocritical. | joshmanders wrote: | Again, you don't answer the question of what's the | difference between a image pixel or javascript logging | that you visited the site vs nginx/apache logging you | visited the site? | | You're upset that OP used an image or javascript instead | of grepping `access.log` makes absolute no sense. The | same data is shown there. | gizmo wrote: | It's rude to tell people how they feel and it's rude to | assert a post makes "absolutely no sense" while at the | same time demanding a response. | | One difference is intent. When you build an analytics | system you have an obligation to let people opt out. | Access logs serve many legitimate purposes, and yes, they | can also be used to track people, but that is not why | access logs exist. This difference is also reflected in | law. Using access logs for security purposes is always | allowed but using that same data for marketing purposes | may require an opt-in or disclosure. | jakelazaroff wrote: | My point is that your `wc -l access.log` solution will | _also_ track people who send the Do Not Track header | unless you specifically prevent it from doing so. In | fact, you could implement the _exact same system_ | described in the blog post by replacing the Python code | run on every request with an aggregation of the access | log. So what is the pragmatic difference between the two? | gizmo wrote: | Even the GDPR makes this distinction. Access logs (with | IP addresses) are fine if you use them for technical | purposes (monitor for 404, 500 errors) but if you use | access logs for marketing purposes you need users to opt- | in, because IP addresses are considered PII by law. And | if you don't log IPs then you can't track uniques. | Tracking and logging are not the same thing. | arp242 wrote: | > If it was just about a simple count the host could just | `wc -l access.log` | | That doesn't really work because huge amount of traffic is | from 1) bots, 2) prefetches and other things that shouldn't | be counted, 3) the same person loading the page 5 times, | visiting every page on the site, etc. In short, these | numbers will be wildly wrong (and in my experience "how | wrong" can also differ quite a bit per site and over time, | depending on factors that are not very obvious). | | What people want is a simple break-down of useful things | like which entry pages are used, where people came from (as | in: "did my post get on the frontpage of HN?") | | I don't see how anyone's privacy or anything is violated | with that. You can object to that of course. You can also | object to people wearing a red shirt or a baseball cap. At | some point objections become unreasonable. | anonymouse008 wrote: | I don't know how I feel about this overall. I think we took | some rules from the physical world that we liked and discarded | others that we've ended up with a cognitively dissonant space. | | For example, if you walked into my coffee shop, I would be able | to lay eyes on you and count your visits for the week. I could | also observe were you sit and how long you stay. If I were to | better serve you with these data points, by reserving your | table before you arrive with your order ready, you'd probably | welcome my attention to detail. However, if I were to see you | pulled about x number of watts a month from my outlets, then | locked up the outlets for a fee suddenly - then you'd | rightfully wish to never be observed again. | | So what I'm getting at is, the issues with tracking appear to | be with the perverse assholes vs. the benevolent shopkeeps of | the tracking. | | To wrap up this thought: what's happening now though is a | stalker is following us into every store, watching our every | move. In physical space, we'd have this person arrested and | assigned a restraining order with severe consequences. However, | instead of holding those creeps accountable, we've punished the | small businesses that just want to serve us. | | -- | | I don't know how I feel about this or really what to do. | croniev wrote: | The coffe shop reserving my place and having my order ready | before I arrive sounds nice - but is it not an innecessary | luxury, that I would not miss had I never even thought of its | possibility? I never asked for it, I was ready to stand in | line for my order, and the tracking of my behavior resulted | in a pleasant surprise, not a feature I was hoping for. If I | really wanted my order to be ready when I arrive, then I | would provide the information to you, not expect that you | observe me to figute it out. | | My point is that I don't get why the small businesses should | have the right to track me to offer me better services that I | never even asked for. Sure, its nice, but its not worth | deregulating tracking and allowing all the evil corps to | track me too. | joshmanders wrote: | Here's a better analogy using the coffee shop: | | You walk into your favorite coffee shop, order your | favorite coffee, every day. But because of privacy reasons | the coffeeshop owner is unaware of anything. Doesn't even | track inventory, just orders whatever whenever. | | One day you walk in and now you can't get your favorite | coffee... Because the owner decided to remove that item | from the menu. You get mad, "Where's my favorite coffee?" | the barista says "owner removed it from menu" and you get | even more upset "Why? Don't you know I come in here every | day and order the same thing?!" | | Nope, because you don't want any amount of tracking | whatesoever, knowing any type of habits from visitors is | wrong! | | But in this scenario you deem the owner knowing that you | order that coffee every day ensures that it never leaves | the menu, so you actually do like tracking. | majewsky wrote: | The coffee shop analogy falls apart after a few seconds | because tracking in real life does not scale the same way | that tracking in the digital space scales. If you wanted to | track movements in a coffee shop as detailed as you can on | websites or applications with heavy tracking, you would need | to have a dozen people with clipboards strewn about the | place, at which point it would feel justifiably dystopian. | The only difference on a website is that the clipboard- | bearing surveillers are not as readily apparent. | lcnPylGDnU4H9OF wrote: | > you would need to have a dozen people with clipboards | strewn about the place | | Assuming you live in the US, next time you're in a grocery | store, count how many cameras you can spot. Then consider: | these cameras could possibly have facial recognition | software; these cameras could possibly have object | recognition software; these cameras could possibly have | software that tracks eye movements to see where people are | looking. | | Then wonder: do they have cameras in the parking lot? Maybe | those cameras can read license plates to know which | vehicles are coming and going. Any time I see any sort of | news about information that can be retrieved from a photo, | I assume that it will be done by many cameras at >1 Hertz | in a handful of years. | arp242 wrote: | I don't think it's very important that people "can" do | this; the only thing that matters if they actually "are" | doing it. | al_borland wrote: | I think that's the point. It's the level of detail of | tracking online that's the problem. If a website just wants | to know someone showed up, that's one thing. If a site | wants to know that I specifically showed up, and dig in to | find out who I specifically am, and what I'm into so they | can target me... that's too much. | | Current website tracking is like the coffee shop owner | hiring a private investigate to dig into the personal lives | of everyone who walks in the door so they can suggest the | right coffee and custom cup without having to ask. They | could not do that and just let someone pick their own | cup... or give them a generic one. I'd like that better. If | clipboards in coffee shops are dystopian, so is current web | tracking, and we should feel the same about it. | | I think Bear strikes a good balance. It lets authors know | someone is reading, but it's not keeping profiles on users | to target them with manipulative advertising or some kind | of curated reading list. | OhMeadhbh wrote: | I have, unfortunately, become cynical in my old age. Don't take | this the wrong way, but... | | <cynical_statement> The purpose of the web is to distribute | ads. The "struggle" is with people who think we made this | infrastructure so you could share recipes with your grand- | mother. </cynical_statement> | gizmo wrote: | No matter how bad the web gets, it can still get worse. | Things can always get worse. That's why I'm not a cynic. Even | when the fight is hopeless --and I don't believe it is-- | delaying the inevitable is still worthwhile. | al_borland wrote: | The infrastructure was put in place for people to freely | share information. The ads only came once people started | spending their time online and that's where the eyeballs | were. | | The interstate highway system in the US wasn't build with the | intent of advertising to people, it was to move people and | goods around (and maybe provide a means to move around the | military on the ground when needed). Once there were a lot of | eyes flying down the interstate, the billboard was used to | advertise to those people. | | The same thing happened with the newspaper, magazines, radio, | TV, and YouTube. The technology comes first and the ads come | with popularity and as a means to keep it low cost. We're | seeing that now with Netflix as well. I'm actually a little | surprised that books don't come with ads throughout them... | maybe the longevity of the medium makes ads impractical. | digging wrote: | Hm, I don't think that's the _purpose_ of the web. It 's just | the most common use case. | fdaslkjlkjklj wrote: | Looks like a clever way to do analytics. Would be neat to see how | it compares with just munging the server logs since you're only | looking at page views basically. | | re the hashing issue, it looks interesting but adding more | entropy with other client headers and using a stronger hash algo | should be fine. | HermanMartinus wrote: | Hey, author here. For a bit of clarity around IP addresses | hashes. The only use they have in this context is preventing | duplicate hits in a day (making each page view unique by | default). At the end of each day there is a worker job that | empties them out while retaining the hit info. | | I've added an edit to the essay for clarity. | dantiberian wrote: | You should add this as a reply to the top comment as well. | bosch_mind wrote: | If 10 users share an IP on a shared VPN around the globe and | hit your site, you only count that as 1? What about corporate | networks, etc? IP is a bad indicator | Culonavirus wrote: | It's a bad indicator especially since people who would | otherwise not use VPN apparently started using this: | https://support.apple.com/en-us/102602 | Galanwe wrote: | Not even mentioning CGNAT. | fishtoaster wrote: | The idea of using CSS-triggered requests for analytics was really | cool to me when I first encountered it. | | One guy on twitter (no longer available) used it for mouse | tracking: overlay an invisible grid of squares on the page, each | with a unique background image triggered on hover. Each | background image sends a specific request to the server, which | interprets it! | | For fun one summer, I extended that idea to create a JS-free "css | only async web chat": https://github.com/kkuchta/css-only-chat | joewils wrote: | This is really clever and fun. I'm curious how the results | compare to something "old school" like AWStats? | | https://github.com/eldy/awstats | stmblast wrote: | Bearblog is AWESOME!! | | I use it for my personal stuff and it's great. No hassle, just | paste your markdown in and you're good to go. ___________________________________________________________________ (page generated 2023-11-01 23:00 UTC)