[HN Gopher] How Bear does analytics with CSS
       ___________________________________________________________________
        
       How Bear does analytics with CSS
        
       Author : todsacerdoti
       Score  : 292 points
       Date   : 2023-11-01 08:08 UTC (14 hours ago)
        
 (HTM) web link (herman.bearblog.dev)
 (TXT) w3m dump (herman.bearblog.dev)
        
       | user20231101 wrote:
       | Smart approach!
        
       | nannal wrote:
       | > And not just the bad ones, like Google Analytics. Even Fathom
       | and Plausible analytics struggle with logging activity on
       | adblocked browsers.
       | 
       | I believe that's as they're trying to live in what amounts to a
       | toxic wasteland. Users like us are done with the whole concept
       | and as such I assume if CSS analytics becomes popular, then
       | attempts will be made to bypass that too.
        
         | account-5 wrote:
         | Makes me reminiscent of uMatrix which could block the loading
         | of CSS too.
        
           | momentary wrote:
           | Is uMatrix not in vogue any more? It's still my go to tool!
        
             | account-5 wrote:
             | It's not actively developed anymore so I've been using
             | ublocks advanced options which are good but not as good as
             | uMatrix was.
        
           | its-summertime wrote:
           | ||somesite.example^$css
           | 
           | would work in ublock
        
             | account-5 wrote:
             | I didn't know this. But with uMatrix you could default to
             | all websites and then whitelist those you wanted it for. At
             | least that's the way I used it and uBlock advanced user
             | features.
        
         | berkes wrote:
         | Why?
         | 
         | I manually unblocked Piwik/Matomo, Plausible and and Fathom
         | from ublock. I don't see any harm in what and how these track.
         | And they do give the people behind the site valuable
         | information "to improve the service".
         | 
         | e.g. Plausible collects less information on me than the common
         | nginx or Apache logs do. For me, as blogger, it's important to
         | see when a post gets on HN, is linked from somewhere and what
         | kinds of content are valued and which are ignored. So that I
         | can blog about stuff you actually want to read and spread it
         | through channels so that you are actually aware of it.
        
           | morelisp wrote:
           | You're just saying a smaller-scale version of "as a publisher
           | it's important for me to collect data on my audience to
           | optimize my advertising revenue." The adtech companies take
           | the shit for being the visible 10% but publishers are
           | consistently the ones pressuring for more collection.
        
             | ordersofmag wrote:
             | I'm a website 'publisher' for a non-profit that has zero
             | advertising on our site. Our entire purpose for collecting
             | analytics is to make the site work better for our users.
             | Really. Folks like us may not be in the majority but it's
             | worth keeping in mind that "analytics = ad revenue
             | optimization" is over-generalizing.
        
               | morelisp wrote:
               | I'm sure your stated 13 years of data is absolutely
               | critical to optimize your page load times.
        
               | majewsky wrote:
               | Can you give some examples of changes that you made
               | specifically to make the site work better for users, and
               | how those were guided by analytics? I usually just do
               | user interviews because building analytics feels like
               | summoning a compliance nightmare for little actual
               | impact.
        
               | arp242 wrote:
               | I've decided to either stop working or keep working on
               | some things based on the fact that I did or didn't get
               | any traffic for it. I've become aware some pages were
               | linked on Hacker News, Lobsters, or other sites, and
               | reading the discussion I've been able to improve some
               | things in the article.
               | 
               | And also just knowing some people read what you write is
               | nice. There is nothing wrong with having some validation
               | (as long as you don't obsess over it) and it's a basic
               | human need.
               | 
               | This is just for a blog; for a product knowing "how many
               | people actually use this?" is useful. I suspect that for
               | some things the number is literally 0, but it can be hard
               | to know for sure.
               | 
               | User interviews are great, but it's time-consuming to do
               | well and especially for small teams this is not always
               | doable. It's also hard to catch things that are useful
               | for just a small fraction of your users. i.e. "it's
               | useful for 5%" means you need to do a lot of user
               | interviews ( _and_ hope they don 't forget to mention
               | it!)
        
             | HuwFulcher wrote:
             | How horrifying that someone who does writing potentially as
             | their income would seek to protect that revenue stream.
             | 
             | Services like Plausible give you the bare minimum to
             | understand what is viewed most. If you have a website that
             | you want people to visit then it's a pretty basic
             | requirement that you'll want to see what people are
             | interested in.
             | 
             | When you start "personalising" the experience based on some
             | tracking that's when it becomes a problem.
        
               | peoplefromibiza wrote:
               | > a pretty basic requirement that you'll want to see what
               | people are interested in.
               | 
               | not really
               | 
               | it should be what you are competent and proficient at
               | 
               | people will come because they like what you do, not
               | because you do the things they like (sounds like the same
               | thing, but it isn't)
               | 
               | there are many proxies to know what they like if you want
               | to plan what to publish and when and for how long,
               | website visits are one of the less interesting.
               | 
               | a lot of websites such as this one get a lot of visits
               | that drive no revenue at all.
               | 
               | OTOH there are websites who receive a small amount of
               | visits, but make revenues based on the amount of people
               | subscribing to the content (the textbook example is OF,
               | people there can get from a handful of subscriber what
               | others earn from hundreds of thousands of views on YT or
               | the like)
               | 
               | so basically monitoring your revenues works better than
               | constantly optimizing for views, in the latter case you
               | are optimizing for the wrong thing
               | 
               | I know a lot of people who sell online that do not use
               | analytics at all, except for coarse grained ones like
               | number of subscriptions/number of items sold/how many
               | email they receive about something they published or
               | messages from social platforms etc.
               | 
               | that's been true in my experience through almost 30 years
               | of interacting and helping publishing creative content
               | online and offline (books, records, etc)
        
               | HuwFulcher wrote:
               | > people will come because they like what you do, not
               | because you do the things they like (sounds like the same
               | thing, but it isn't)
               | 
               | This isn't true for all channels. The current state of
               | search requires you to adapt your content to what people
               | are looking for. Social channels are as you've said.
               | 
               | It doesn't matter how you want to slice it. Understanding
               | how many people are coming to your website, from where
               | and what they're looking at is valuable.
               | 
               | I agree the "end metric" is whatever actually drives the
               | revenue. But number of people coming to a website can
               | help tune that.
        
               | cpill wrote:
               | emails revived or messages on social media are just
               | another analytic and filling that same need as knowing
               | pages hits. and somehow these people are vega analytics
               | junkies instead of mainlining page hits. your
               | unconvincing in the argument for "analytics are not
               | needed"
        
         | marban wrote:
         | Plausible still works if you reverse-proxy the script and the
         | event url through your own /randompath.
        
         | chrismorgan wrote:
         | This approach is no harder to block than the JavaScript
         | approaches: you're just blocking requests to certain URL
         | patterns.
        
           | nannal wrote:
           | That approach would work until analytics gets mixed in with
           | actual styles and then you're trying to use a website without
           | CSS.
        
             | chrismorgan wrote:
             | You're blocking the _image_ , not the CSS. Here's a rule to
             | catch it at present:                 ||bearblog.dev/hit/
             | 
             | This is the shortest it can be written with certainty of no
             | false positives, but you can do things like making the URL
             | pattern more specific (e.g. _/ hit/*/_) or adding the
             | _image_ option (append _$image_ ) or just removing the
             | ||bearblog.dev domain filter if it spread to other domains
             | as well (there probably aren't enough false positives to
             | worry about).
             | 
             | I find it also worth noting that _all_ of these techniques
             | are pretty easily circumventable by technical means, by
             | blending content and tracking /ads/whatever. In case of
             | all-out war, content blockers _will_ lose. It's just that
             | no one has seen fit to escalate that far (and in some cases
             | there are legal limitations, potentially on both sides of
             | the fight).
        
               | macNchz wrote:
               | > In case of all-out war, content blockers will lose.
               | It's just that no one has seen fit to escalate that far
               | (and in some cases there are legal limitations,
               | potentially on both sides of the fight).
               | 
               | The Chrome Manifest v3 and Web Environment Integrity
               | proposals are arguably some of the clearest steps in that
               | direction, a long term strategy being slow-played to
               | limit pushback.
        
         | ben_w wrote:
         | The bit of the web that feels to me like a toxic wasteland is
         | all the adverts; the tracking is a much more subtle issue,
         | where the damage is the long-term potential of having a digital
         | twin that can be experimented on to find how best to manipulate
         | me.
         | 
         | I'm not sure how many people actually fear that. Might get
         | responses from "yes, and it's creepy" to "don't be daft that's
         | just SciFi".
        
         | input_sh wrote:
         | Nothing's gonna block your webserver's access.log fed into an
         | analytics service.
         | 
         | If anything, you're gonna get numbers that are inflated because
         | it's a bit impossible to dismiss all of the bot traffic just by
         | looking at user agents.
        
       | victorbjorklund wrote:
       | This does make sense! Might try it for my own analytics solution.
       | Anyone can think of a downside of this vs js?
        
         | berkes wrote:
         | I can think of many "downsides" but whether those matter or are
         | actually upsides really depends on your use-case and
         | perspective.
         | 
         | * You cannot (easily) track interaction events (esp. relevant
         | for SPAs, but also things like "user highlighted x" or "user
         | typed Y, then backspaced then typed Z)"
         | 
         | * You cannot track timings between events (e.g. how long a user
         | is on the page)
         | 
         | * You cannot track data such as screen-sizes, agents, etc.
         | 
         | * You cannot track errors and exceptions.
        
       | Wouter33 wrote:
       | Nice implementation! Just a heads-up, hashing the ip like that is
       | still considered tracking under GDPR and requires a privacy
       | banner in the EU.
        
         | thih9 wrote:
         | Can you explain why or link a source? I'd like to learn the
         | details.
        
           | fizzbuzz-rs wrote:
           | Likely because the hash of an IP can easily be reversed as
           | there are only ~2^32 IPv4 addresses.
        
             | openplatypus wrote:
             | It is not just that. Having user IP and such a hashing
             | approach you can re-identify past sessions.
        
             | thih9 wrote:
             | What if my hashing function has high likelihood of
             | collisions?
        
               | firtoz wrote:
               | Then you cannot trust the analytics
        
               | thih9 wrote:
               | Do you trust analytics that doesn't use JS? Or relies on
               | mobile users to scroll the page before counting a hit?
               | 
               | It's all a heuristic and even with high collision
               | hashing, analytics would provide some additional insight.
        
               | rjmunro wrote:
               | You can estimate the actual numbers based on the
               | collision rate.
               | 
               | Analytics is not about absolute accuracy, it's about
               | measuring differences; things like which pages are most
               | popular, did traffic grow when you ran a PR campaign etc.
        
           | dsies wrote:
           | https://gdpr-info.eu/art-4-gdpr/ paragraph 1:
           | 
           | > 'personal data' means any information relating to an
           | identified or identifiable natural person ('data subject');
           | an identifiable natural person is one who can be identified,
           | directly or indirectly, in particular by reference to an
           | identifier such as a name, an identification number, location
           | data, an online identifier or to one or more factors specific
           | to the physical, physiological, genetic, mental, economic,
           | cultural or social identity of that natural person;
        
             | thih9 wrote:
             | This does not reference hashing, which can be an
             | irreversible and destructive operation. As such, it can
             | remove the "relating" part - i.e. you'll no longer be able
             | to use the information to relate it to an identifiable
             | natural person.
             | 
             | In this context, if I define a hashing function that e.g.
             | sums all ip address octets, what then?
        
               | jvdvegt wrote:
               | A hash (whether MD5 or some SHA) on IP4-address is easily
               | reversed.
               | 
               | Summing octets is non-reversable, so it seems like a good
               | 'hash' to me (but note: you'll get a lot of collisions).
               | And of course, IANAL.
        
               | dsies wrote:
               | I was answering your request for a source.
               | 
               | The linked article talks about _identification numbers_
               | that can be used to link a person. I am not a lawyer but
               | the article specifically refers to one person.
               | 
               | By that logic, if the hash you generate cannot be linked
               | to exactly one, specific person/request - you're in the
               | clear. I think ;)
        
         | openplatypus wrote:
         | Correct. This is a flawed hashing implementation as it allows
         | for re-identification.
         | 
         | Having that IP and user timezone you can generate the same hash
         | and trace back the user. This is hardly anonymous hashing.
         | 
         | Wide Angle Analytics adds daily, transient salt to each IP hash
         | which is never logged thus generating a truly anonymous hash
         | that prevents reidentification.
        
           | thih9 wrote:
           | What if my hashing function is really destructive and has
           | high likelihood of collisions?
        
             | hk__2 wrote:
             | > What if my hashing function is really destructive and has
             | high likelihood of collisions?
             | 
             | If it's so destructive that it's impossible to track users,
             | it's useless for you. If not, you need a privacy banner.
        
               | thih9 wrote:
               | A high collision hash would be useful for me on my low
               | traffic page and I'd enjoy not having to display a cookie
               | banner.
               | 
               | Also: https://news.ycombinator.com/item?id=38096235
        
         | victorbjorklund wrote:
         | Probably should be "salted hashes might be considered PII". It
         | has not be tried by the EU court and the law is not 100% clear.
         | It might be. It might not be.
        
         | e38383 wrote:
         | If the data gets stored in this way (hash of IP[0]) for a long
         | time I'm with you. But if you only store the data for 24 hours
         | it might still count as temporary storage and should be
         | "anonymized" enough.
         | 
         | IMO (and I'm not a lawyer): if you store ip+site for 24 hours
         | and after that only store "region" (maybe country or state) and
         | site this should be GDPR compliant.
         | 
         | [0] it should use sha256 or similar and not md5
        
         | donohoe wrote:
         | Actually no. It's very likely this is fine. Context is
         | important.
         | 
         | Not a layer but discussed this previously with lawyers when
         | building a GDPR framework awhile back.
        
           | sleepyhead wrote:
           | Context is irrelevant. What is relevant is whether a value,
           | for example a hash, can be identified to a specific person in
           | some way.
        
             | donohoe wrote:
             | I'm really not going to argue here.
             | 
             | I've been told this directly by lawyers who specialize in
             | GDPR and CCPA etc. I will take their word over yours.
             | 
             | If you are a lawyer with direct expertise in this area then
             | I'm willing to listen.
        
       | mcny wrote:
       | On the topic of analytics, how do you store them?
       | 
       | Let's say I have an e-commerce website, with products I want to
       | sell. In addition to analytics, I decide to log a select few
       | actions myself such as visits to product detail page while logged
       | in. So I want to store things like user id, product id,
       | timestamp, etc.
       | 
       | How do I actually store this? My naive approach is to stick it in
       | a table. The DBA yelled at me and asked how long I need data. I
       | said at least a month. They said ok and I think they moved all
       | older data to a different table (set up a job for it?)
       | 
       | How do real people store these logs? How long do you keep them?
        
         | ludwigvan wrote:
         | ClickHouse
        
         | jon-wood wrote:
         | Unless you're at huge volume you can totally do this in a
         | Postgres table. Even if you are you can partition that table by
         | date (or whatever other attributes make sense) so that you
         | don't have to deal with massive indexes.
         | 
         | I once did this, and we didn't need to even think about
         | partitioning until we hit a billion rows or so. (But partition
         | sooner than that, it wasn't a pleasant experience)
        
         | n_e wrote:
         | An analytics database is better (clickhouse, bigquery...).
         | 
         | They can do aggregations much faster and can deal with
         | sparse/many columns (the "paid" event has an "amount"
         | attribute, the "page_view" event has an "url" attribute...)
        
         | ordersofmag wrote:
         | We've got 13 years worth of data stored in mysql (5 million
         | visitor/year). It's a pain to query there so we keep a copy in
         | clickhouse as well (which is a joy to query).
        
           | mcny wrote:
           | I only track visits to a product detail page so far.
           | Basically, some basic metadata about the user (logged in
           | only), some metadata about the product, and basic "auditing"
           | columns -- created by, created date, modified by, modified
           | date (although why I have modified by and modified date makes
           | no sense to me, I don't anticipate to ever edit these,
           | they're only there for "standardization". I don't like it but
           | I can only fight so many battles at a time).
           | 
           | I am approaching 1.5 million rows in under two months.
           | Thankfully, my DBA is kind, generous, and infinitely patient.
           | 
           | Clickhouse looks like a good approach. I'll have to look into
           | that.
           | 
           | > select count(*) from trackproductview;
           | 
           | > 1498745
           | 
           | > select top 1 createddate from TrackProductView order by
           | createddate asc;
           | 
           | > 2023-08-18 11:31:04.000
           | 
           | what is the maximum number of rows in clickhouse table? Is
           | there such a limit?
        
         | victorbjorklund wrote:
         | I use Postgres with timescale db. Works unless your e-commerce
         | is amazon.com. Great thing with timescale db is that they take
         | care of creating materialized views with the aggregates you
         | care about (like product views per hour etc) and you can even
         | choose to "throw away" the events themselves and just keep the
         | aggregations (to avoid getting a huge db if you have a lot of
         | events).
        
       | p4bl0 wrote:
       | The :hover pseudo-class could be applied and unapplied multiple
       | times for a single page load. This can certainly be mitigated
       | using cache related http headers but then if the same page is
       | visited by the same person a second time coming from the same
       | referrer, the analytics endpoint won't be loaded.
       | 
       | But maybe I'm not aware that browsers guarantee that "images"
       | loaded using url() in CSS will be (re)loaded exactly once per
       | page?
        
         | kevincox wrote:
         | I'm not sure about `url()` in CSS but `<img>` tags are
         | guaranteed to only be loaded once per URL per page. I would
         | assume that `url()` works the same.
         | 
         | This bit me when I tried to make a page that reload an image as
         | a form of monitoring. However URL interestingly includes the
         | fragment (after the #) even though it isn't set to the server.
         | So I managed to work around this by appending #1, #2, #3... to
         | the image URL.
         | 
         | https://gitlab.com/kevincox/image-monitor/-/blob/e916fcf2f9a...
        
       | alabhyajindal wrote:
       | Wow, I didn't know you could trigger a URL endpoint with CSS!
        
       | dontlaugh wrote:
       | Why not just get this info from the HTTP server?
        
         | victorbjorklund wrote:
         | Hard if you run serverless
        
           | dontlaugh wrote:
           | There's still a server somewhere and it can log URLs and IPs.
        
             | tmikaeld wrote:
             | Not if it's static generated html/css.
             | 
             | And the real benefit of this trick is separating users from
             | bots.
        
             | berkes wrote:
             | And even if there are many servers (a CDN or distributed
             | caching) you can collect and merge these.
        
               | victorbjorklund wrote:
               | Tell me how to collect the logs for static sites on
               | Cloudflare Pages (not functions. The Pages sites)
        
               | berkes wrote:
               | Cloudflare Pages are running on servers. These servers
               | (can, quite certainly will) have logs.
               | 
               | That you cannot access the logs because you don't own the
               | servers doesn't mean there aren't any servers that have
               | logs.
        
               | victorbjorklund wrote:
               | Yes, no one has argued that Cloudflare Pages arent using
               | servers. But it is "hard" to track using logs if you are
               | a cloudflare customers. Guess only way would be to hack
               | into cloudflare itself and access my logs that way. But
               | that is "hard" (because yes theoretically it is possible
               | i know). And not a realistic alternative.
        
             | victorbjorklund wrote:
             | Of course. But you can't access it. You can't get logs for
             | static sites on Cloudflare Pages.
        
           | Spivak wrote:
           | Huh? You can get logs just fine from your ALB's and API
           | Gateways.
        
         | hk__2 wrote:
         | > Why not just get this info from the HTTP server?
         | 
         | This is explained in the blog post:
         | 
         | > There's always the option of just parsing server logs, which
         | gives a rough indication of the kinds of traffic accessing the
         | server. Unfortunately all server traffic is generally seen as
         | equal. Technically bots "should" have a user-agent that
         | identifies them as a bot, but few identify that since they're
         | trying to scrape information as a "person" using a browser. In
         | essence, just using server logs for analytics gives a skewed
         | perspective to traffic since a lot of it are search-engine
         | crawlers and scrapers (and now GPT-based parsers).
        
           | dontlaugh wrote:
           | Don't bots now load an entire browser including simulated
           | user interaction, to the point where there's no difference?
        
             | janosdebugs wrote:
             | Not for the most part, it's still very expensive. Even if,
             | they don't simulate mouse movement.
        
         | spiderfarmer wrote:
         | All bots
        
       | jackjeff wrote:
       | The whole anonymization of IP addresses by just hashing the date
       | and IP is just security theater.
       | 
       | Cryptographic hashes are designed to be fast. You can do 6
       | billion md5 hashes in a second on an MacBook (m1 pro) via hashcat
       | and there's only 4 billion ipv4 addresses. So you can brute force
       | the entire range and find the IP address. Basically reverse the
       | hash.
       | 
       | And that's true even if they used something secure like SHA-256
       | instead of broken MD5
        
         | berkes wrote:
         | Maybe they use a secret salt or rotating salt? The example code
         | doesn't, so I'm afraid you are right. But one addition and it
         | can be made reasonable secure.
         | 
         | I am afraid, however, that this security theater is enough to
         | pass many laws, regulations and such on PII.
        
         | ktta wrote:
         | Not if they use a password hash like Argon2 or scrypt
        
           | ale42 wrote:
           | But that's very heavy to compute at scale...
        
             | isodev wrote:
             | True, but also it's a blogging platform - does it really
             | have that kind of scale to be concerned with?
        
               | ale42 wrote:
               | Probably not, I was mainly thinking if that kind of
               | solution was to be adopted at a scale like Google
               | Analytics.
        
           | __alexs wrote:
           | Even then it is theatre because if you know the IP address
           | you want to check it's trivial to see if there's a match.
        
             | chrismorgan wrote:
             | And _this_ is why such a hash will still be considered
             | personal data under legislation like GDPR.
        
         | TekMol wrote:
         | That is easy to fix though. Just use a temporary salt.
         | 
         | Pseudo code:                   if salt.day < today():
         | salt = {day: today(), salt: random()}         ip_hash =
         | sha256(ip + salt.salt)
        
           | __alexs wrote:
           | Assuming you don't store the salts, this produces a value
           | that is useless for anything but counting something like DAU.
           | Which you could equally just do by counting them all and
           | deleting all the data at the end of the day, or using a
           | cardinality estimator like HLL.
        
             | TekMol wrote:
             | DAU in regards to a given page.
             | 
             | Have you read the article? That is what the author's goal
             | seems to be.
             | 
             | He wants to prevent multiple requests to the same page by
             | the same IP counted multiple times.
        
               | tatersolid wrote:
               | Is that more efficiently done with an appropriate caching
               | header on the page as it is served?
               | 
               | Cache-Control: private, max-age=86400
               | 
               | This prevents repeat requests for normal browsers from
               | hitting the server.
        
             | dvdkon wrote:
             | That same uselessness for long-term identification of users
             | is what makes this approach compliant with laws regulating
             | use of PII, since what you have after a small time window
             | isn't actually PII (unless correlated with another dataset,
             | but that's always the case).
        
             | SamBam wrote:
             | That's precisely all that OP is storing in the original
             | article.
             | 
             | They're just getting a list of hashes per day, and
             | associated client info. They have no idea if the same user
             | visit them on multiple days, because the hashes will be
             | different.
        
           | kevincox wrote:
           | Of course if you have multiple severs or may reboot you need
           | to store the salt somewhere. If you are going to bother
           | storing the salt and cleaning it up after the day is over it
           | may be just as easy to clean the hashes at the end of the day
           | (and keep the total count) which is equivalent. This should
           | work unless you want to keep individual counts around for
           | something like seeing distribution of requests per IP or
           | similar. But in that case you could just replace the hashes
           | with random values at the end of the day to fully anonymize
           | them since you no longer need to increment then.
        
         | Etheryte wrote:
         | For context, this problem also came up in a discussion about
         | Storybook doing something similar in their telemetry [0] and
         | with zero optimization it takes around two hours to calculate
         | the salted hashes for every IPv4 on my home laptop.
         | 
         | [0] https://news.ycombinator.com/item?id=37596757
        
         | hnreport wrote:
         | This the type of comment that reinforces not even trying to
         | learn or outsource security.
         | 
         | You'll never know enough.
        
           | petesergeant wrote:
           | I think the opposite? I'm a dev with a bit of an interest in
           | security, and this immediately jumped out at me from the
           | story; knowing enough security to discard bad ideas is
           | useful.
        
         | WhyNotHugo wrote:
         | Aside from it being technically trivial to get an IP back from
         | its hash, the EU data protection agency made it very clear that
         | "hashing PII does not count as anonymising PII".
         | 
         | Even if you hash somebody's full name, you can later answer the
         | question "does this hash match the this specific full name".
         | Being able to answer this question implies that the
         | anonymisation process is reversible.
        
           | kevincox wrote:
           | I think the word "reversible" here is being stretched a bit.
           | There is a significant difference between being able to list
           | every name that has used your service and being able to check
           | if a particular name has used your service. (Of course these
           | can be effectively the same in cases where you can list all
           | possible inputs such as hashed IPv4 addresses.)
           | 
           | That doesn't mean that hashing is enough for pure anonymity,
           | but used properly hashes are definitely a step above
           | something fully reversible (like encryption with a common
           | key).
        
             | SamBam wrote:
             | I'm not sure the distinction is meaningful. If the police
             | demand your logs to find out whether a certain IP address
             | visited in the past year, they'd be able to find that out
             | pretty quickly given what's stored. So how is privacy being
             | respected?
        
             | pluto_modadic wrote:
             | if it fulfills the same function, does it matter?
             | 
             | if you have an ad ID for a person, say example@example.com,
             | and you want to deduplicate it,
             | 
             | if you provide them with the names, the company that buys
             | the data can still "blend" it with data they know, if they
             | know how the hash was generated... and effectively get back
             | that person's email, or IP, or phone number, or at least
             | get a good hunch that the closest match is such and such
             | person with uncanny certainty
             | 
             | de-anonymization of big data is trivial in basically every
             | case that was written by an advertising company, instead of
             | written by a truly privacy focused business.
             | 
             | if it were really a non-reversible hash, it would be evenly
             | distributed, not predictable, and basically useless for
             | advertising, because it wouldn't preserve locality. It
             | needs to allow for finding duplicates... so the person you
             | give the hash to, can abuse that fact.
        
           | bayindirh wrote:
           | We're members of some EU projects, and they share a common
           | help desk. To serve as a knowledge base, the tickets are
           | kept, but all PII is anonymized after 2 years AFAIK.
           | 
           | What they do is pretty simple. They overwrite the data fields
           | with the text "<Anonymized>". No hashes, no identifiers,
           | nothing. Everything is gone. Plain and simple.
        
             | spookie wrote:
             | KISS That's the best way to go about it
        
           | jefftk wrote:
           | It depends. For example, if each day you generate a random
           | nonce and use it to salt that day's PII (and don't store the
           | nonce) then you cannot later determine (a) did person A visit
           | on day N or (b) is visitor X on day N the same as visitor Y
           | on day N+1. But you can still determine how many distinct
           | visitors you had on day N, and answer questions about within-
           | day usage patterns.
        
           | ilrwbwrkhv wrote:
           | Yes but if the business is not in the EU they don't need to
           | care one bit about GDPR or EU.
        
             | troupo wrote:
             | If they target residents of the EU, they must care.
             | 
             |  _Edit:_
             | 
             | This is a different bear:
             | 
             | Also, Bear claims to be GDPR compliant:
             | https://bear.app/faq/bear-is-gdpr-compliant/
        
           | TylerE wrote:
           | Is an ipv4 address really classes as PII? Sounds a bit
           | insane.
        
             | beardog wrote:
             | It can be used to track you across the web, get a general
             | geographic area, and if you have the right connections one
             | can get the ISP subscriber address. Given that PII is
             | anything that can be used to identify a person, I think it
             | qualifies despite it being difficult for a rando to tie an
             | IP to a person.
             | 
             | Additionally in the case of ipv6 it can be tied to a
             | specific device more often. One cannot rely on ipv6 privacy
             | extensions to sufficiently help there.
        
               | rtsil wrote:
               | That's compounded by the increasing use of static IPs, or
               | at least extremely long-lasting dynamic IPs in some ISPs.
        
         | alkonaut wrote:
         | Hashes should be salted. If you salt, you are fine, if you
         | don't you aren't.
         | 
         | Whether the salt can be kept indefinitely, or is rotated
         | regularly etc is just an implementation detail, but the key
         | with salting hashes for analytics is that the salt never leaves
         | the client.
         | 
         | As explained in the article there seems to be no salt (or
         | rather, the current date seems to be used as a salt, but that's
         | not a random salt and can easily be guessed for anyone who
         | wants to say "did IP x.y.z.w visit on date yy-mm-dd?".
         | 
         | It's pretty easy to reason about these things if you look from
         | the perspective of an attacker. How would you do to figure out
         | anything about a specific person given the data? If you can't,
         | then the data is probably OK to store.
        
           | piaste wrote:
           | > Hashes should be salted. If you salt, you are fine, if you
           | don't you aren't.
           | 
           | > Whether the salt can be kept indefinitely, or is rotated
           | regularly etc is just an implementation detail, but the key
           | with salting hashes for analytics is that the salt never
           | leaves the client.
           | 
           | I think I'm missing something.
           | 
           | If the salt is known to the server, then it's useless for
           | this scenario. Because given a known salt, you can generate
           | the hashes for every IP address + that salt very quickly.
           | (Salting passwords works because the space for passwords is
           | big, so rainbow tables are expensive to generate.)
           | 
           | If the salt is unknown to the server, i.e. generated by the
           | client and 'never leaves the client'... then why bother with
           | hashes? Just have the client generate a UUID directly instead
           | of a salt.
        
             | rkangel wrote:
             | Without a salt, you can generate the hash for every IP
             | address _once_ , and then permanantly have a hash->IP
             | lookup (effectively a Rainbow table). If you have a salt,
             | then you need to do it for each database entry, which does
             | make it computationally more expensive.
        
               | tptacek wrote:
               | People are obsessed with this attack from the 1970s, but
               | in practice password cracking rigs just brute force the
               | hashes, and that has been the practice since my career
               | started in the 1990s and people used `crack`, into the
               | 2000s and `jtr`, and today with `hashcat` or whatever it
               | is the cool kids use now. "Rainbow tables" don't matter.
               | If you're discussing the expense of attacking your scheme
               | with or without rainbow tables, you've already lost.
        
               | jonahx wrote:
               | > If you're discussing the expense of attacking your
               | scheme with or without rainbow tables, you've already
               | lost.
               | 
               | Can you elaborate on this or link to some info
               | elaborating what you mean? I'd like to learn about it.
        
             | alkonaut wrote:
             | > > _the salt never leaves the client_
             | 
             | > I think I'm missing something.
             | 
             | ...
             | 
             | > If the salt is known to the server,
             | 
             | That's what you were missing yes
        
               | SamBam wrote:
               | Did you miss the second half where GP asked why the
               | client doesn't just send up a UUID, instead of generating
               | their own salt and hash?
        
             | arp242 wrote:
             | > why bother with hashes? Just have the client generate a
             | UUID directly instead of a salt.
             | 
             | The reason for all this bonanza is that the ePrivacy
             | directive requires a cookie banner, _" making exceptions
             | only for data that is _"strictly necessary in order to
             | provide a [..] service explicitly requested by the
             | subscriber or user"*.
             | 
             | In the end, you only have "pinky promise" that someone
             | isn't doing more processing on the server end, so in
             | reality it doesn't matter much especially if the cookie
             | lifetime is short (hours or even minutes). Actually, a
             | cookie or other (short-lived!) client-side ID is probably
             | better for everyone if it wasn't for the cookie banners.
        
               | TylerE wrote:
               | ALL of the faff around cookies is the biggest security
               | theater of the past 40 years. I remember hearing the
               | fear-mongering in the very early 2000's about cookies in
               | the mainstream media - it was self-evidentally a farce
               | then, and a farce now.
        
               | throwaway290 wrote:
               | Isn't in this case data is part of "strictly necessary"
               | data (IP address)? That's all that gets collected by that
               | magic CSS + server, no?
        
               | arp242 wrote:
               | ePrivacy directive only applies to information stored on
               | the client side (such as cookies).
        
           | darken wrote:
           | Salts are generally stored with the hash, and are only really
           | intended to prevent "rainbow table" attacks. (I.e. use of
           | precomputed hash tables.) Though a predictable and matching
           | salt per entry does mean you can attack all the hashes for a
           | timestamp per hash attempt.
           | 
           | That being said, the previous responder's point still stands
           | that you can brute force the salted IPs at about a second per
           | IP with the colocated salt. Using multiple hash iterations
           | (e.g. 1000x; i.e. "stretching") is how you'd meaningfully
           | increase computational complexity, but still not in a way
           | that makes use of the general "can't be practically reversed"
           | hash guarantees.
        
             | alkonaut wrote:
             | As I said the key for hashing PII for telemetry is that the
             | client does the hashing on the client side and the client
             | never transmits the salt. This isn't a login system or
             | similar. There is no "validation" of the hash. All the hash
             | is is a unique marker for a user that doesn't contain any
             | PII.
        
               | SamBam wrote:
               | How does the client generate the same salt every time
               | they visit the page, without using cookies?
        
               | donkeyd wrote:
               | Use localstorage!
               | 
               | Kidding, of course. I don't think there's a way to track
               | users across sessions, without storing something and
               | requiring a 'cookie notification'. Which is kind of the
               | point of all these laws.
        
               | alkonaut wrote:
               | Storing a salt with 24h expiry would be the same thing as
               | the solution in the article. It would be better from a
               | privacy perspective because the IP would then not be
               | transmitted in a reversible way.
               | 
               | If I hadn't asked for permission to send hash(ip + date)
               | then I'd sure not ask permission if I instead stored a
               | random salt for each new 24h and sent the hash(ip +
               | todays_salt).
               | 
               | This is effectively a cookie and it's not strictly
               | necessary if it's stats only. So I think on the server
               | side I'd just invent some reason why it's necessary for
               | the product itself too, and make the telemetry just an
               | added bonus.
        
               | alkonaut wrote:
               | If you can use JS it's easy. For example
               | localStorage.setItem("salt", Math.random()). Without JS
               | it's hard I think. I don't know why this author wants to
               | use JS, perhaps out of respect for his visitors, but then
               | I think it's worse to send PII over the wire (And an IP
               | hashed in the way he describes is PII).
        
               | SamBam wrote:
               | EU's consent requirements don't distinguish between
               | cookies and localStorage, as far as I understand. And a
               | salt that is only used for analytics would not count as
               | "strictly necessary" data, so I think you'd have to put
               | up a consent popup. Which is precisely the kind of thing
               | a solution of that is trying to avoid.
        
               | alkonaut wrote:
               | Indeed, but as I wrote in another reply: it doesn't
               | matter. It's even worse to send PII over the wire. Using
               | the date as the salt (as he does) just means it's
               | reversible PII - a.k.a. PII!.
               | 
               | Presumable these are stored on the server side to
               | identify returning visitors - so instead of storing a
               | random number for 24 hours on the client, you now have
               | PII stored on the server. So basically there is no way to
               | do this that doesn't require consent.
               | 
               | The only way to do it is to make the information required
               | for some necessary function, and then let the analytics
               | piggyback on it
        
               | SamBam wrote:
               | I think I agree with you there. But again, the idea of a
               | "salt" is then overcomplicating things. It's exactly the
               | same to have the client generate a GUUID and just send
               | that up, no salting or hashing required.
        
               | alkonaut wrote:
               | Yup for only identifying a system that's easier. If this
               | is all the telemetry is ever planned to do then that's
               | all you need. The benefit of having a local hash function
               | is when you want to transmit multiple ids for data. E.g
               | in a word processor you might transmit
               | hash(salt+username) on start and hash(salt+filename) when
               | opening a document and so on. That way you can send
               | identifiers for things that are sensitive or private like
               | file names in a standardized way and you don't need to
               | keep track of N generated guids for N use cases.
               | 
               | On the telemetry server you get e.g
               | 
               | Function "print" used by user 123 document 345. Using
               | that you can do things like answering how many times an
               | average document is printed or how many times per year an
               | average user uses the print function.
        
               | robertlagrant wrote:
               | IP address is "non-sensitive PII"[0]. It's pretty hard to
               | identify someone from an IP address. Hashing and then
               | deleting every day is very reasonable.
               | 
               | [0] https://www.ibm.com/topics/pii
        
               | sysop073 wrote:
               | What's the point in hashing the IP + salt then, just let
               | each client generate a random nonce and use that as the
               | key
        
           | tptacek wrote:
           | Salting a standard cryptographic hash (like SHA2) doesn't do
           | anything meaningful to slow a brute force attack. This
           | problem is the reason we have password KDFs like scrypt.
           | 
           | (I don't care about this Bear analytics thing at all, and
           | just clicked the comment thread to see if it was the Bear I
           | thought it was; I do care about people's misconceptions about
           | hashing.)
        
             | alkonaut wrote:
             | What do you mean by "brute force" in the context of
             | reversing PII that has been obscured by a one way hash? My
             | IP number passed through SHA1 with a salt (a salt I
             | generated and stored safely on my end) is
             | 6FF6BA399B75F5698CEEDB2B1716C46D12C28DF5 Since this is all
             | that would be sent over the wire for analytics, this is the
             | only information an attacker will have available.
             | 
             | The only thing you can brute force from that is _some_ IP
             | and _some salt_ such that SHA1(IP+Salt) =
             | 6FF6BA399B75F5698CEEDB2B1716C46D12C28DF5 But you 'll find
             | millions of such IPs. Perhaps all possible IP's will work
             | with _some_ salt, and give that hash. It 's not revealing
             | my IP even if you manage to find a match?
        
               | infinityio wrote:
               | If you also explicitly mentioned the salt used (as bear
               | appear to have done?), this just becomes a matter of
               | testing 4 billion options and seeing which matches
        
               | alkonaut wrote:
               | I think it's just unsalted in the example code. Or you
               | could argue that the date is kind of used as a salt. But
               | the point was that salting + hashing is fine for PII in
               | telemetry if and only if the salt stays on the client. It
               | might be difficult to do without JS though.
        
             | michaelmior wrote:
             | > Salting a standard cryptographic hash (like SHA2) doesn't
             | do anything meaningful to slow a brute force attack.
             | 
             | Sure, but it does at least prevent the use of rainbow
             | tables. Arguably not relevant in this scenario, but it
             | doesn't mean that salting does nothing. Rainbow tables can
             | speed up attacks by many orders of magnitude. Salting may
             | not prevent each individual password from being brute
             | forced, but for most attackers, it probably will prevent
             | your entire database from being compromised due to the
             | amount of computation required.
        
               | tptacek wrote:
               | Rainbow tables don't matter. If you're discussing the
               | strength of your scheme with or without rainbow tables,
               | you have already lost.
               | 
               | https://news.ycombinator.com/item?id=38098188
        
               | robertlagrant wrote:
               | That's just a link where you claim the same thing. What's
               | your actual rationale? Do you think salting is pointless?
        
         | dspillett wrote:
         | _> Cryptographic hashes are designed to be fast._
         | 
         | Not _really_. They are designed to be fast _enough_ and even
         | then only as a secondary priority.
         | 
         |  _> You can do 6 billion ... hashes /second on [commodity
         | hardware] ... there's only 4 billion ipv4 addresses. So you can
         | brute force the entire range_
         | 
         | This is harder if you use a salt not known to the attacker.
         | Per-entry salts can help even more, though that isn't relevant
         | to IPv4 addresses in a web/app analytics context because after
         | the attempt at anonymisation you want to still be able to tell
         | that two addresses were the same.
         | 
         |  _> And that's true even if they used something secure like
         | SHA-256 instead of broken MD5_
         | 
         | Relying purely on the computation complexity of one hash
         | operation, even one not yet broken, is not safe given how easy
         | temporary access to mass CPU/GPU power is these days. This can
         | be mitigated somewhat by running many rounds of the hash with a
         | non-global salt - which is what good key derivation processes
         | do for instance. Of course you need to increase the number of
         | rounds over time to keep up with the rate of growth in
         | processing availability, to keep undoing your hash more hassle
         | than it is worth.
         | 
         | But yeah, a single unsalted hash (or a hash with a salt the
         | attacker knows) on IP address is not going to stop anyone who
         | wants to work out what that address is.
        
           | krsdcbl wrote:
           | Don't forget that md5 is comparatively slow & there are way
           | options for hashing nowadays:
           | 
           | https://jolynch.github.io/posts/use_fast_data_algorithms/
        
           | SAI_Peregrinus wrote:
           | A "salt not known to the attacker" is a "key" to a keyed hash
           | function or message authentication code. A salt isn't a
           | secret, though it's not usually published openly.
        
           | marcosdumay wrote:
           | > only as a secondary priority
           | 
           | That's not a reasonable way to say it. It's literally the
           | second priority, and heavily evaluated when deciding what
           | algorithms to take.
           | 
           | > This is harder if you use a salt not known to the attacker.
           | 
           | The "attacker" here is the sever owner. So if you use a
           | random salt and throw it away, you are good, anything
           | resembling the way people use salt on practice is not fine.
        
         | HermanMartinus wrote:
         | Author here. I commented down below, but it's probably more
         | relevant in this thread.
         | 
         | For a bit of clarity around IP addresses hashes. The only use
         | they have in this context is preventing duplicate hits in a day
         | (making each page view unique by default). At the end of each
         | day there is a worker job that scrubs the ip hash which is now
         | irrelevant.
        
           | myfonj wrote:
           | Have you considered serving actual small transparent image
           | with caching headers set to expire at midnight?
        
       | freitasm wrote:
       | "The only downside to this method is if there are multiple reads
       | from the same IP address but on separate devices, it will still
       | only be seen as one read. And I'm okay with that since it
       | constitutes such a minor fragment of traffic."
       | 
       | Many ISPs are now using CG-NAT so this approach would miscount
       | thousands of visitors seemingly coming from a single IP address.
        
         | tmikaeld wrote:
         | Only if all of them use the exact same user agent
         | platform/browser.
         | 
         | (It would be better if he used a hash of the raw user agent
         | string)
        
           | freitasm wrote:
           | UA aren't unique these days.
        
       | colesantiago wrote:
       | How would one block this from tracking you?
       | 
       | I think we would either need to send fake data to these analytics
       | tools deliberately like https://adnauseam.io/
       | 
       | Or now include CSS as a spy tracker that needs to be blocked.
        
         | Kiro wrote:
         | I don't see how this is more intrusive for privacy than what
         | you can already get from access logs.
        
           | colesantiago wrote:
           | It is still tracking you so it needs to be blocked.
        
             | Kiro wrote:
             | So are access logs. How are you going to block those?
        
               | colesantiago wrote:
               | I never said anything about access logs, I specifically
               | mentioned this CSS trick that will become popular for ad
               | companies to track people.
               | 
               | For this, this would need to block the endpoint or send
               | obfuscated data deliberately in protest of this.
               | 
               | Should you _want_ to cover access logs also, then forms
               | of tracking then sending excessive, random obfuscation
               | data with adnauseam would also help here.
               | 
               | https://adnauseam.io/
        
             | jokethrowaway wrote:
             | I sure hope you're being sarcastic here and illustrating
             | the ridiculousness of privacy extremists (who, btw, ruined
             | the web, thanks to a few idiot politicians in the EU).
             | 
             | If not, what's wrong with a service knowing you're
             | accessing it? How can they serve a page without knowing
             | you're getting a page?
        
               | callalex wrote:
               | Ruined the web? It sure seems like the web still works
               | from my perspective. What has been ruined for you?
        
           | matrss wrote:
           | If it is not then it must be unnecessary, since you could get
           | the same information from the access logs already.
        
         | its-summertime wrote:
         | ||/hit/*$image
         | 
         | In your favorite ad blocker
        
       | meiraleal wrote:
       | Interesting approach but what about mobile users?
        
         | welpo wrote:
         | From the article:
         | 
         | > Now, when a person hovers their cursor over the page (or
         | scrolls on mobile) it triggers body:hover which calls the URL
         | for the post hit
        
           | cantSpellSober wrote:
           | It _doesn 't_ do that though.
           | 
           | > The :hover pseudo-class is problematic on touchscreens.
           | Depending on the browser, the :hover pseudo-class might never
           | match
           | 
           | https://developer.mozilla.org/en-US/docs/Web/CSS/:hover
           | 
           | Don't take my word for it. Trying it in mobile emulators will
           | have the same result.
        
       | rzmmm wrote:
       | > Now, when a person hovers their cursor over the page (or
       | scrolls on mobile)...
       | 
       | I can imagine many cases where real human user doesn't scroll the
       | page on mobile platform. I like the CSS approach but I'm not sure
       | it's better than doing some bot filtering with the server logs.
        
       | freitzzz wrote:
       | I attempted to do this back at the start of this year, but lost
       | motivation building the web ui. My trick is not CSS but simply
       | loading fake images with <img> tags:
       | 
       | https://github.com/nolytics
        
       | openplatypus wrote:
       | The CSS tracker is as useful as server log-based analytics. If
       | that is the information you need, cool.
       | 
       | But JS trackers are so much more. Time spent on the website,
       | scroll depth, screen sizes, some limited and compliant and yet
       | useful unique sessions, those things cannot be achieved without
       | some (simple) JS.
       | 
       | Server side, JS, CSS... No one size fits all.
       | 
       | Wide Angle Analytics has strong privacy, DNT support, an opt-out
       | mechanism, EU cloud, compliance documentation, and full process
       | adherence. Employs non-reversible short-lived sessions that still
       | give you good tracking. Combine it with custom domain or first-
       | party API calls and you get near 100% data accuracy.
        
         | croes wrote:
         | If it's an US company then EU cloud doesn't matter regarding
         | data protection for EU citizens.
         | 
         | The Cloud Act rendered that worthless.
        
           | openplatypus wrote:
           | Wide Angle Analytics is German company operating everything
           | on EU cloud (EU owners, EU location).
        
         | EspressoGPT wrote:
         | You probably even could analyze screen sizes by doing the same
         | thing but with CSS media queries.
        
         | TekMol wrote:
         | The CSS tracker is as useful as server log-based analytics.
         | 
         | It is not. Have you read the article?
         | 
         | The whole point of the CSS approach is to weed out user agents
         | which are not doing mouse hover on the body events. You can't
         | see that from server logs.
        
       | jokethrowaway wrote:
       | Lovely technique and probably more than adequate for most uses.
       | 
       | My scraping bots use an instance of chrome and therefore trigger
       | hover as well, but you'll cut out the less sophisticated bots.
       | 
       | This is because of protection systems, if I try to scrape my
       | target website with just code I just get insta banned /
       | "captched".
        
         | jerbear4328 wrote:
         | Are you sure? Even if you run a headless browser, you might not
         | be triggering the hover event, unless you specifically tell it
         | to or your framework simulates a virtual mouse that triggers
         | mouse events and CSS.
         | 
         | You totally could be triggering it, but not every bot will,
         | even the fancy ones.
        
       | fatih-erikli wrote:
       | This is known as "pixel tracker" for decades.
        
         | cantSpellSober wrote:
         | Used in emails as well. Loading a 1x1 transparent <img> is a
         | more sure thing than triggering a hover event, but ad-blockers
         | often block those
        
           | t0astbread wrote:
           | Occasionally I've seen people fail and add the pixel as an
           | attachment instead.
        
         | blacksmith_tb wrote:
         | True, though doing it in CSS does have a couple of interesting
         | aspects, using :hover would filter out bots that didn't use a
         | full-on webdriver (most bots, that is). I would think that
         | using an @import with 'supports' for an empty-ish .css file
         | would be better in some ways (since adblockers are awfully good
         | at spotting 1px transparent tracking pixels, but less likely to
         | block .css files to avoid breaking layouts), but that wouldn't
         | have the clever :hover benefits.
        
       | chrismorgan wrote:
       | I'd like to see a comparison of the server log information with
       | the hit endpoint information: my feeling is that the reasons for
       | separating it don't really hold water, and that the initial
       | request server logs could fairly easily be filtered to acceptable
       | quality levels, obviating the subsequent request.
       | 
       | The basic server logs include declared bots, undeclared bots
       | pretending to use browsers, undeclared bots actually using
       | browsers, and humans.
       | 
       | The hit endpoint logs will exclude almost all declared bots,
       | almost all undeclared bots pretending to use browsers, and some
       | humans, but will retain a few undeclared bots that search for and
       | load subresources, and almost all humans. About undeclared bots
       | that actually use browsers, I'm uncertain as I haven't inspected
       | how they are typically driven and what their initial mouse cursor
       | state is: if it's placed within the document it'll trigger, but
       | if it's not controlled it'll probably be outside the document. (
       | _Edit:_ actually, I hadn't considered that bearblog caps the body
       | element's width and uses margin, so if the mouse cursor is not in
       | the main column it won't trigger. My feeling is that this will
       | get rid of almost all undeclared bots using browsers, but
       | _significantly_ undercount users with large screens.)
       | 
       | But my experience is that reasonably simple heuristics do a
       | pretty good job of filtering out the bots the hit endpoint also
       | excludes.
       | 
       | * Declared bots: the filtration technique can be ported as-is.
       | 
       | * Undeclared bots pretending to use browsers: that's a
       | behavioural matter, but when I did a _little_ probing of this
       | some years ago, I found that a great many of them were using
       | unrealistic user-agent strings, either visibly wonky or
       | impossible or just corresponding to browsers more than a year old
       | (which almost no real users are using). I suspect you could get
       | rid of the vast majority of them reasonably easily, though it
       | might require occasional maintenance (you could do things like
       | estimate the browser's age based on their current version number
       | and release cadence, with the caveat that it may slowly drift and
       | should be checked every few years) and will certainly exclude a
       | very few humans.
       | 
       | * Undeclared bots actually using browsers: this depends on the
       | unknown I declared, whether they position their mice in the
       | document area. But my suspicion is that these simply aren't worth
       | worrying about because they're not enough to notably skew things.
       | Actually using browsers is _expensive_ , people avoid it where
       | possible.
       | 
       | And on the matter of humans, it's worth clarifying that the hit
       | endpoint is _worse_ in some ways, and honestly quite risky:
       | 
       | * Some humans will use environments that _can't_ trigger the
       | extra hit request (e.g. text-mode browsers, or using some service
       | that fetches and presents content in a different way);
       | 
       | * Some humans will behave in ways that _don't_ trigger the extra
       | hit request (e.g. keyboard-only with no mouse movement, or
       | loading then going offline);
       | 
       | * Some humans will block the extra hit request; and if you upset
       | the wrong people or potentially even become too popular, it'll
       | make its way into a popular content blocker list and significant
       | fractions of your human base will block it. _This_ , in my
       | opinion, is the biggest risk.
       | 
       | * There's also the risk that at some point browsers might
       | prefetch such resources to minimise the privacy leak. (Some email
       | clients have done this at times, and browsers have wrestled from
       | time to time with related privacy leaks, which have led to the
       | hobbling of what properties :visited can affect, and other
       | mitigations of clickjacking. I think it _conceivable_ that such a
       | thing could be changed, though I doubt it will happen and there
       | would be plenty of notice if it ever did.)
       | 
       | But there's a deeper question to it: _if_ you don't exclude some
       | bots; or _if_ the URL pattern gets on a popular content filter
       | list: does it matter? Does it skew the ratios of your results
       | significantly? (Absolute numbers have never been particularly
       | meaningful or comparable between services or sources: you can
       | only meaningfully compare numbers from within a source.) My
       | feeling is that after filtering out most of the bots in fairly
       | straightforward ways, the data that remains is likely to be of
       | similar enough quality to the hit endpoint technique: both will
       | be overcounting in some areas and undercounting in others, but I
       | expect both to be Good Enough, at which point I prefer the
       | simplicity of not having a separate endpoint.
       | 
       | (I think I've presented a fairly balanced view of the facts and
       | the risks of both approaches, and invite correction in any point.
       | Understand also that I've never tried doing this kind of analysis
       | in any _detail_ , and what examination and such I have done was
       | almost all 5-8 years ago, so there's a distinct possibility that
       | my feelings are just way off base.)
        
       | p4bl0 wrote:
       | I have a genuine question that I fear might be interpreted as a
       | dismissive opinion but I'm actually interested in the answer:
       | what's the goal of collecting analytics data in the case of
       | personal blogs in a non-commercial context such as what Bearblog
       | seems to be?
        
         | Veen wrote:
         | Curiosity? I like to know if anyone is reading what I write.
         | It's also useful to know what people are interested in. Even
         | personal bloggers may want to tailor content to their audience.
         | It's good to know that 500 people have read an article about
         | one topic, but only 3 people read one about a different topic.
        
           | mrweasel wrote:
           | For the curiosity, one solution I've been pondering, but
           | never gotten around to implementing is just logging the
           | country of origin for a request, rather than the entire IP.
           | 
           | IPs are useful in case of attack, but you could limit
           | yourself to simply logging subnets. It's a little more
           | aggressive block a subnet, or an entire ISP, but it seems
           | like a good tradeoff.
        
         | taurusnoises wrote:
         | I can speak to this from the writer's perspective as someone
         | who has been actively blogging since c. 2000 and has been
         | consistently (very) interested in my "stats" the entire time.
         | 
         | The primary reason I care about analytics is to see if posts
         | are getting read, which on the surface (and in some ways) is
         | for reasons of vanity, but is actually about writer-reader
         | engagement. I'm genuinely interested in what my readers
         | resonate with, because I want to give them more of that. The
         | "that" could be topical, tonal, length, who knows. It helps me
         | hone my material specifically for my readers. Ultimately, I
         | could write about a dozen different things in two dozen
         | different ways. Obviously, I do what I like, but I refine it to
         | resonate with my audience.
         | 
         | In this sense, analytics are kind of a way for me to get to
         | know my audience. With blogs that had high engagement,
         | analytics gave me a sort of fuzzy character description of who
         | my readers were. As with above, I got to see what they liked,
         | but also when they liked it. Were they reading first thing in
         | the morning? Were they lunch time readers? Were they late at
         | night readers. This helped me choose (or feel better about)
         | posting at certain times. Of course, all of this was fuzzy
         | intel, but I found it really helped me engage with my
         | readership more actively.
        
         | hennell wrote:
         | Feedback loops. Contrary to what a lot of people seem to think,
         | analytics is not just about advertising or selling data, it's
         | about analysing site and content performance. Sure that can be
         | used (and abused) for advertising, but it's also essential if
         | you want any feedback about what you're doing.
         | 
         | You might get no monetary value from having 12 people read the
         | site or 12,000 but from a personal perspective it's nice to
         | know what people want to read about from you, and so you can
         | feel like the time you spent writing it was well spent, and
         | adjust if you wish to things that are more popular.
        
       | colesantiago wrote:
       | If you want to send obfuscated data on purpose to prevent this
       | dark pattern behaviour from spreading I recommend Adnauseam.
       | 
       | (not the creator, just a regular user of this great tool)
       | 
       | We need more tools that send random, fake data to analytics
       | providers which renders the analytics useless to them in protest
       | of tracking.
       | 
       | If there are any more like Adnauseam, I would love to know.
       | 
       | https://adnauseam.io/
        
       | myfonj wrote:
       | Seems clever and all, but `body:hover` will most probably
       | completely miss all "keyboard-only" users and users with user
       | agents (assistive technologies) that do not use pointer devices.
       | 
       | Yes, these are marginal groups perhaps, but it is always super
       | bad sign seeing them excluded in any way.
       | 
       | I am not sure (I doubt) there is a 100 % reliable way to detect
       | that "real user is reading this article (and issue HTTP request)"
       | from baseline CSS in every single user agent out there (some of
       | them might not support CSS at all, after all, or have loading of
       | any kind of decorative images from CSS disabled).
       | 
       | There are modern selectors that could help, like :root:focus-
       | within (requiring that user would actually focus something
       | interactive there, what again is not guaranteed for al agents to
       | trigger such selector), and/or bleeding edge scroll-linked
       | animations (`@scroll-timeline`). But again, braille readers will
       | probably remain left out.
        
         | demondemidi wrote:
         | Keyboard only users? All 10 of them? ;)
        
           | bayindirh wrote:
           | Well with me, it's probably 11.
           | 
           | Joking aside, I love to read websites with keybaords, esp. if
           | I'm reading blogs. So, it's possible that sometimes my
           | pointer is out there somewhere to prevent distraction.
        
           | myfonj wrote:
           | I think there might be more than ten [1] blind folks using
           | computers out there, most of them not using pointing devices
           | at all or not in a way that would produce "hover".
           | 
           | [1] was it base ten, right?
        
           | zichy wrote:
           | Think about screen readers.
        
           | vivekd wrote:
           | I'm a keyboard user when on my computer, qutebrowser but I
           | think your sentiments are correct, the numbers of keyboard
           | only users are probably much much smaller than the number of
           | people using Adblock. So OPs method is likely to produce a
           | more accurate analytics than a JavaScript only design.
           | 
           | OP just thought of a creative, effective and probably faster
           | more code efficient way to do analytics. I love it, thanks OP
           | for sharing it
        
           | paulddraper wrote:
           | https://www.youtube.com/watch?v=lKie-vgUGdI
        
         | qingcharles wrote:
         | Marginal? Surely this affects 50%+ of user-agents, i.e. phones
         | and tablets which don't support :hover? (without a mouse being
         | plugged in)
        
           | myfonj wrote:
           | I think most mobile browsers emit "hover" state whenever you
           | tap / drag / swipe over something in the page. "active" state
           | is even more reliable IMO. But yes, you are right that it is
           | problematic. Quoting MDN page about ":hover" [1]:
           | 
           | > Note: The :hover pseudo-class is problematic on
           | touchscreens. Depending on the browser, the :hover pseudo-
           | class might never match, match only for a moment after
           | touching an element, or continue to match even after the user
           | has stopped touching and until the user touches another
           | element. Web developers should make sure that content is
           | accessible on devices with limited or non-existent hovering
           | capabilities.
           | 
           | [1] https://developer.mozilla.org/en-US/docs/Web/CSS/:hover
        
           | callalex wrote:
           | I really wish modern touchscreens spent the extra few cents
           | to support hover. Samsung devices from the ~2012 era all
           | supported detection of fingers hovering near the screen. I
           | suspect it's terrible patent laws holding back this
           | technology, like most technologies that aren't headline
           | features.
        
       | gizmo wrote:
       | > Even Fathom and Plausible analytics struggle with logging
       | activity on adblocked browsers.
       | 
       | The simple solution is to respect the basic wishes of those who
       | do not want to be tracked. This is a "struggle" only because
       | website operators don't want to hear no.
        
         | reustle wrote:
         | As much I agree with respecting folks wishes to not be tracked,
         | most of these cases are not about "tracking".
         | 
         | It's usually website hosts just wanting to know how many folks
         | are passing through. If a visitor doesn't even want to
         | contribute to incrementing a private visit counter by +1, then
         | maybe don't bother visiting.
        
           | gizmo wrote:
           | If it was just about a simple count the host could just `wc
           | -l access.log`. Clearly website hosts are not satisfied with
           | that, and so they ignore DO_NOT_TRACK and disrespectfully try
           | to circumvent privacy extensions.
        
             | jakelazaroff wrote:
             | Is there a meaningful difference between recording "this IP
             | address made a request on this date" and "this IP address
             | made a request on this date after hovering their cursor
             | over the page body"? How is your suggestion more acceptable
             | than what the blog describes?
        
               | gizmo wrote:
               | Going out of your way to specifically track people who
               | indicate they don't want to be tracked is worse.
        
               | vivekd wrote:
               | Google cloud and AWS VPS and many hosting services
               | collect and provide this info by default. Chances are
               | most websites do this including this one you are using
               | now. HN does up bans meaning they must access visitor IP.
               | 
               | Why aren't people starting their protest against the
               | website they're currently using instead of at OP.
        
               | gizmo wrote:
               | We all know that tracking is ubiquitous on the web. This
               | blogpost however discusses technology that specifically
               | helps with tracking people who don't want to be tracked.
               | I responded that an alternative approach is to just not.
               | That's not hypocritical.
        
               | joshmanders wrote:
               | Again, you don't answer the question of what's the
               | difference between a image pixel or javascript logging
               | that you visited the site vs nginx/apache logging you
               | visited the site?
               | 
               | You're upset that OP used an image or javascript instead
               | of grepping `access.log` makes absolute no sense. The
               | same data is shown there.
        
               | gizmo wrote:
               | It's rude to tell people how they feel and it's rude to
               | assert a post makes "absolutely no sense" while at the
               | same time demanding a response.
               | 
               | One difference is intent. When you build an analytics
               | system you have an obligation to let people opt out.
               | Access logs serve many legitimate purposes, and yes, they
               | can also be used to track people, but that is not why
               | access logs exist. This difference is also reflected in
               | law. Using access logs for security purposes is always
               | allowed but using that same data for marketing purposes
               | may require an opt-in or disclosure.
        
               | jakelazaroff wrote:
               | My point is that your `wc -l access.log` solution will
               | _also_ track people who send the Do Not Track header
               | unless you specifically prevent it from doing so. In
               | fact, you could implement the _exact same system_
               | described in the blog post by replacing the Python code
               | run on every request with an aggregation of the access
               | log. So what is the pragmatic difference between the two?
        
               | gizmo wrote:
               | Even the GDPR makes this distinction. Access logs (with
               | IP addresses) are fine if you use them for technical
               | purposes (monitor for 404, 500 errors) but if you use
               | access logs for marketing purposes you need users to opt-
               | in, because IP addresses are considered PII by law. And
               | if you don't log IPs then you can't track uniques.
               | Tracking and logging are not the same thing.
        
             | arp242 wrote:
             | > If it was just about a simple count the host could just
             | `wc -l access.log`
             | 
             | That doesn't really work because huge amount of traffic is
             | from 1) bots, 2) prefetches and other things that shouldn't
             | be counted, 3) the same person loading the page 5 times,
             | visiting every page on the site, etc. In short, these
             | numbers will be wildly wrong (and in my experience "how
             | wrong" can also differ quite a bit per site and over time,
             | depending on factors that are not very obvious).
             | 
             | What people want is a simple break-down of useful things
             | like which entry pages are used, where people came from (as
             | in: "did my post get on the frontpage of HN?")
             | 
             | I don't see how anyone's privacy or anything is violated
             | with that. You can object to that of course. You can also
             | object to people wearing a red shirt or a baseball cap. At
             | some point objections become unreasonable.
        
         | anonymouse008 wrote:
         | I don't know how I feel about this overall. I think we took
         | some rules from the physical world that we liked and discarded
         | others that we've ended up with a cognitively dissonant space.
         | 
         | For example, if you walked into my coffee shop, I would be able
         | to lay eyes on you and count your visits for the week. I could
         | also observe were you sit and how long you stay. If I were to
         | better serve you with these data points, by reserving your
         | table before you arrive with your order ready, you'd probably
         | welcome my attention to detail. However, if I were to see you
         | pulled about x number of watts a month from my outlets, then
         | locked up the outlets for a fee suddenly - then you'd
         | rightfully wish to never be observed again.
         | 
         | So what I'm getting at is, the issues with tracking appear to
         | be with the perverse assholes vs. the benevolent shopkeeps of
         | the tracking.
         | 
         | To wrap up this thought: what's happening now though is a
         | stalker is following us into every store, watching our every
         | move. In physical space, we'd have this person arrested and
         | assigned a restraining order with severe consequences. However,
         | instead of holding those creeps accountable, we've punished the
         | small businesses that just want to serve us.
         | 
         | --
         | 
         | I don't know how I feel about this or really what to do.
        
           | croniev wrote:
           | The coffe shop reserving my place and having my order ready
           | before I arrive sounds nice - but is it not an innecessary
           | luxury, that I would not miss had I never even thought of its
           | possibility? I never asked for it, I was ready to stand in
           | line for my order, and the tracking of my behavior resulted
           | in a pleasant surprise, not a feature I was hoping for. If I
           | really wanted my order to be ready when I arrive, then I
           | would provide the information to you, not expect that you
           | observe me to figute it out.
           | 
           | My point is that I don't get why the small businesses should
           | have the right to track me to offer me better services that I
           | never even asked for. Sure, its nice, but its not worth
           | deregulating tracking and allowing all the evil corps to
           | track me too.
        
             | joshmanders wrote:
             | Here's a better analogy using the coffee shop:
             | 
             | You walk into your favorite coffee shop, order your
             | favorite coffee, every day. But because of privacy reasons
             | the coffeeshop owner is unaware of anything. Doesn't even
             | track inventory, just orders whatever whenever.
             | 
             | One day you walk in and now you can't get your favorite
             | coffee... Because the owner decided to remove that item
             | from the menu. You get mad, "Where's my favorite coffee?"
             | the barista says "owner removed it from menu" and you get
             | even more upset "Why? Don't you know I come in here every
             | day and order the same thing?!"
             | 
             | Nope, because you don't want any amount of tracking
             | whatesoever, knowing any type of habits from visitors is
             | wrong!
             | 
             | But in this scenario you deem the owner knowing that you
             | order that coffee every day ensures that it never leaves
             | the menu, so you actually do like tracking.
        
           | majewsky wrote:
           | The coffee shop analogy falls apart after a few seconds
           | because tracking in real life does not scale the same way
           | that tracking in the digital space scales. If you wanted to
           | track movements in a coffee shop as detailed as you can on
           | websites or applications with heavy tracking, you would need
           | to have a dozen people with clipboards strewn about the
           | place, at which point it would feel justifiably dystopian.
           | The only difference on a website is that the clipboard-
           | bearing surveillers are not as readily apparent.
        
             | lcnPylGDnU4H9OF wrote:
             | > you would need to have a dozen people with clipboards
             | strewn about the place
             | 
             | Assuming you live in the US, next time you're in a grocery
             | store, count how many cameras you can spot. Then consider:
             | these cameras could possibly have facial recognition
             | software; these cameras could possibly have object
             | recognition software; these cameras could possibly have
             | software that tracks eye movements to see where people are
             | looking.
             | 
             | Then wonder: do they have cameras in the parking lot? Maybe
             | those cameras can read license plates to know which
             | vehicles are coming and going. Any time I see any sort of
             | news about information that can be retrieved from a photo,
             | I assume that it will be done by many cameras at >1 Hertz
             | in a handful of years.
        
             | arp242 wrote:
             | I don't think it's very important that people "can" do
             | this; the only thing that matters if they actually "are"
             | doing it.
        
             | al_borland wrote:
             | I think that's the point. It's the level of detail of
             | tracking online that's the problem. If a website just wants
             | to know someone showed up, that's one thing. If a site
             | wants to know that I specifically showed up, and dig in to
             | find out who I specifically am, and what I'm into so they
             | can target me... that's too much.
             | 
             | Current website tracking is like the coffee shop owner
             | hiring a private investigate to dig into the personal lives
             | of everyone who walks in the door so they can suggest the
             | right coffee and custom cup without having to ask. They
             | could not do that and just let someone pick their own
             | cup... or give them a generic one. I'd like that better. If
             | clipboards in coffee shops are dystopian, so is current web
             | tracking, and we should feel the same about it.
             | 
             | I think Bear strikes a good balance. It lets authors know
             | someone is reading, but it's not keeping profiles on users
             | to target them with manipulative advertising or some kind
             | of curated reading list.
        
         | OhMeadhbh wrote:
         | I have, unfortunately, become cynical in my old age. Don't take
         | this the wrong way, but...
         | 
         | <cynical_statement> The purpose of the web is to distribute
         | ads. The "struggle" is with people who think we made this
         | infrastructure so you could share recipes with your grand-
         | mother. </cynical_statement>
        
           | gizmo wrote:
           | No matter how bad the web gets, it can still get worse.
           | Things can always get worse. That's why I'm not a cynic. Even
           | when the fight is hopeless --and I don't believe it is--
           | delaying the inevitable is still worthwhile.
        
           | al_borland wrote:
           | The infrastructure was put in place for people to freely
           | share information. The ads only came once people started
           | spending their time online and that's where the eyeballs
           | were.
           | 
           | The interstate highway system in the US wasn't build with the
           | intent of advertising to people, it was to move people and
           | goods around (and maybe provide a means to move around the
           | military on the ground when needed). Once there were a lot of
           | eyes flying down the interstate, the billboard was used to
           | advertise to those people.
           | 
           | The same thing happened with the newspaper, magazines, radio,
           | TV, and YouTube. The technology comes first and the ads come
           | with popularity and as a means to keep it low cost. We're
           | seeing that now with Netflix as well. I'm actually a little
           | surprised that books don't come with ads throughout them...
           | maybe the longevity of the medium makes ads impractical.
        
           | digging wrote:
           | Hm, I don't think that's the _purpose_ of the web. It 's just
           | the most common use case.
        
       | fdaslkjlkjklj wrote:
       | Looks like a clever way to do analytics. Would be neat to see how
       | it compares with just munging the server logs since you're only
       | looking at page views basically.
       | 
       | re the hashing issue, it looks interesting but adding more
       | entropy with other client headers and using a stronger hash algo
       | should be fine.
        
       | HermanMartinus wrote:
       | Hey, author here. For a bit of clarity around IP addresses
       | hashes. The only use they have in this context is preventing
       | duplicate hits in a day (making each page view unique by
       | default). At the end of each day there is a worker job that
       | empties them out while retaining the hit info.
       | 
       | I've added an edit to the essay for clarity.
        
         | dantiberian wrote:
         | You should add this as a reply to the top comment as well.
        
         | bosch_mind wrote:
         | If 10 users share an IP on a shared VPN around the globe and
         | hit your site, you only count that as 1? What about corporate
         | networks, etc? IP is a bad indicator
        
           | Culonavirus wrote:
           | It's a bad indicator especially since people who would
           | otherwise not use VPN apparently started using this:
           | https://support.apple.com/en-us/102602
        
           | Galanwe wrote:
           | Not even mentioning CGNAT.
        
       | fishtoaster wrote:
       | The idea of using CSS-triggered requests for analytics was really
       | cool to me when I first encountered it.
       | 
       | One guy on twitter (no longer available) used it for mouse
       | tracking: overlay an invisible grid of squares on the page, each
       | with a unique background image triggered on hover. Each
       | background image sends a specific request to the server, which
       | interprets it!
       | 
       | For fun one summer, I extended that idea to create a JS-free "css
       | only async web chat": https://github.com/kkuchta/css-only-chat
        
       | joewils wrote:
       | This is really clever and fun. I'm curious how the results
       | compare to something "old school" like AWStats?
       | 
       | https://github.com/eldy/awstats
        
       | stmblast wrote:
       | Bearblog is AWESOME!!
       | 
       | I use it for my personal stuff and it's great. No hassle, just
       | paste your markdown in and you're good to go.
        
       ___________________________________________________________________
       (page generated 2023-11-01 23:00 UTC)