[HN Gopher] Shadow traffic: site visits that are not captured by... ___________________________________________________________________ Shadow traffic: site visits that are not captured by typical analytics providers Author : ahstilde Score : 53 points Date : 2020-08-18 17:16 UTC (5 hours ago) (HTM) web link (blog.parse.ly) (TXT) w3m dump (blog.parse.ly) | ChuckMcM wrote: | Okay, the cynic in me wants to write "New Age web designers are | stumped by lack of analytics while still refusing to look at | their HTTP server log data." | | I remember when the _ONLY_ analytics were those you could derive | by analyzing your http logs. Which have useful information in | them. Things like source IP address (which can be geo tagged), a | bunch of HTTP headers (which are full of information too), and a | timestamp which tells you when it came in and from where. Not to | mention session cookies which take zero javascript to implement. | | I've been retooling my site slowly to _only_ use these analytics | (less the cookies) because I value people 's privacy while | browsing as much as my own. During the transition I've been | comparing what I can pull out of the logs vs what Google's | analytics gives me. Sure, Google can do wonders, especially if | the person is coming from a browser where they are logged into | Google. But, as the article points out, they miss everyone | running noscript and/or other privacy enhancers like Privacy | Badger from EFF. | | I don't feel like I'm going to miss the Google added insights. | butterfi wrote: | The last time I processed my own web logs I used Urchin. What | are the good choices these days for log processing? | ChuckMcM wrote: | I don't know, I just use perl to extract the data and feed it | into influx. Then I pull data sets from influx into numpy and | process it how ever I want. | GrantZvolsky wrote: | I recently decided not to use GA on my site in favour of server | logs for all the reasons you mention plus one that I see as the | most important: it would require me to annoy users with a | cookie consent banner. | neonate wrote: | This is pretty funny, buried in the middle of the article: | | > Option 2 - Server-Side Tracking | ChuckMcM wrote: | With advice that this is so technically complex, don't even | look there. | | These guys can write java script that animates a web page so | that it looks like a turning page and its "too technical" to | pull data out of a server log? | DevX101 wrote: | I recognized this a couple years ago at a startup I work with. | Comparing Google Analytics numbers to validated event logs, the | numbers were off by ~20-30%. Surely there must be a quick | workaround I thought, there's no way there's an entire multi- | billion $ industry of 3rd party analytics software giving bogus | numbers to websites?! But that's indeed the case. I immediately | made top priority building out an in-house analytics platform | where event logs were sent via the API and thus didn't get | blocked. | | And for those saying relative direction is all that matters, I | guarantee you the behavior of users with adblocker installed is | very different from those who can't be bothered or don't know | how. | tiffanyh wrote: | For clarity, so you'd recommend web server logs over client | side analytics? | | If so, what open source web server analytics tools do you | believe is best? E.g. https://goaccess.io | sharkweek wrote: | Let me tell you what was fun - trying to explain to a former | boss why Facebook ads showed one number for amount of traffic | sent, Google Analytics showed a different number for traffic | from those ads, then lastly our server logs showing an entirely | different number! | paulchap wrote: | Funnily enough, I got a warning telling me uBlock prevented this | page from loading... | billyhoffman wrote: | I get wanting to create valuable thought leadership content, but | this is the worst example of Bad product marketing: | | 1- present a new concept to readers (shadow visitors) | | 2- show how this concept is scary and bad for your business (your | analytics is off by 20%!!!) | | 3- present 2 options, which by the way are free, but immediately | shit all over them (Server logs! But that's hard and complicated! | Edge logs! But getting these are hard!) | | 4- present your company's product as option 3, which | surprisingly, have no downsides and isnt shit upon | | 5- profit | | What disingenuous garbage. You should be ashamed Parse.ly. | | The right way to do this is do steps 1 and 2. Then show in detail | step 3: how to solve the problem with easy options, ideally with | free and open software. It's ok to show edge cases, corner cases, | or just shear scale issues that makes these options challenging. | | The difference is that good product marketing pieces show people | how to solve the problem and offer a solution to do that at scale | or in a automated/hosted way so the customer doesn't have to deal | with it. | | If your product marketing content's message is "you are screwed | unless you buy our product " you are doing it wrong | dddddaviddddd wrote: | I had an article on the front page of Hacker News last year that | had about 17,000 real visits, as determined by analysing my | server log files. I was also using Google Analytics at the time, | which told me I had 10,000 visitors (of which only 7 were using | Firefox!). | | Obviously there's a gap between what trackers say and reality, | bigger for some demographics than for others. | ta17711771 wrote: | > of which only 7 were using Firefox | | Were _reporting_ using Firefox. | | Also, not surprising, Firefox security leaves a lot to be | desired. | marcinzm wrote: | >Also, not surprising, Firefox security leaves a lot to be | desired. | | Huh? Blocking google analytics tracking is a positive, not a | negative regarding security. | vlovich123 wrote: | He's talking about user agent spoofing. | inetknght wrote: | I'm not so sure. It shouldn't matter whatsoever what user | agent I'm using. In fact, not sending the user agent | field would be massively better if only that in and of | itself wasn't a unique datapoint. | marcinzm wrote: | Still not sure how that ties into Firefox having worse | security. User agent spoofing is a privacy item and a | positive privacy item. | [deleted] | gentleman11 wrote: | Obvious question: how do you filter out bot traffic with server | side logs? What percent of visitors are bots anyway? | bleepblorp wrote: | Most legitimate bots identify themselves with specific user | agent strings. | | Script kiddie attack bots are generally fairly obvious as they | hammer away at things like /wp-login.php for days on end | regardless of what error codes the server returns. | | Most other bots are pretty evident just by looking at access | patterns. Just identify their IPs and drop them from your | analytics. | compumike wrote: | There are off-the-shelf open-source libraries for this that are | pretty decent and are kept up-to-date by the community. For | example, you can just do browser.is_bot? after you install | https://github.com/fnando/browser#bots | ratsbane wrote: | Watching the bot traffic is an interesting exercise in itself. | The trick is not to filter it out; it's to identify it (to the | greatest degree possible.) | toast0 wrote: | User-Agent for nice bots, and client ASN for naughty bots will | get you pretty far. Fake chrome from residential IPs is hard to | detect though. | pkaye wrote: | What does parse.ly do differently to account for this discrepancy | in the analytics? | ghgr wrote: | The other side of the coin are the inflated traffic statistics | that include all kind of bots & crawlers with spoofed user-agent, | and for low traffic sites like niche personal blogs with a custom | domain it can be 90+% of the server-side logged visits. | | How proud was the 15-year old me with my first .com domain, | having over 100 visitors per day. Little did I know that the | actual number of visitors was much, much less than that. | indymike wrote: | Server side tracking is useful for logging http requests. Client | side tracking is useful for logging user interactions. Used to be | there would be a small difference between server and client due | to caches and user settings... but... Modern apps (i.e. React, | Vue, Angular) often only load one page, and then all interaction | is managed by client side code, so often client side tracking is | the only thing that works. | jklinger410 wrote: | This is pretty much a parsely ad. I think the HN crowd is pretty | aware that ad blockers, vpns, etc can break analytics. | acdha wrote: | I think you're too quick to dismiss it. It's one thing to know | that it exists and another to recognize that it's somewhere | between 20-40% of your total traffic, especially unprompted. | I've had many conversations where people assumed their traffic | numbers was real until this point was raised, at which point | everyone metaphorically slapped their foreheads and realized | that they had forgotten to take this into account. | dylz wrote: | another quite fun thing is that this depends on your site | vertical and demographic: on some of the esports-related | properties i've worked on in the past, adblock rates (that | also block analytics) may exceed 80%. i have seen 90% before | (along with per-device differences, like you suddenly think | almost all of your traffic is mobile - because all desktop | users are blocking client side analytics) | throwaway287391 wrote: | When does this actually matter though? Isn't growth (or | shrinkage) what you'd normally really care about (e.g. this | month we had 10% more DAUs than last month)? I suppose if you | changed something to attract tons of new users who | disproportionately use AdBlock (for example) this becomes an | issue as it wouldn't show up in your metrics, but is that | sort of thing common? | | I suppose if nothing else it's good to know so you can | immediately beef up your numbers +20% in your slide deck for | VCs. | acdha wrote: | What if you're doing anything which doesn't involve logins | -- public information, advertising, etc. - where users | don't otherwise trigger something like account creation / | logins? | | What if you're trying to get stats about people who don't | convert or otherwise give you a signal that they're using | the site? | | I've run into sites where things like signup or checkout | are blocked behind an analytics tracker (Adobe used to | recommend running theirs in a synchronous navigation- | blocking mode) which meant that any problem with that | service was completely invisible unless they contacted you | to complain. | | I also remember people wondering why Firefox users stopped | using their site when they shipped the release which | enabled tracking protection by default. | throwaway287391 wrote: | Good points, thanks! | inetknght wrote: | I agree about this being a parsely ad. However, I also think | there's a large minority of HN users who _aren 't_ aware about | just how client-side analytics are broken or by how many of | their users. | choeger wrote: | You should use Server-Side tracking. | | There is no reason to design your website in a way that makes | your legitimate analysis use cases depend on Client-Side | computations. | | If Server-Side tracking looks too complex for you, you might want | to reevaluate the balance of technical knowledge in your | enterprise. | XCSme wrote: | What if your site is a SPA? You would not know, for example, | the time spent on site, what pages are visited, where exactly | users leave, if there are client-side errors, right? | igneo676 wrote: | Use a hybrid approach: | | 1. Server side analytics | | 2. Client side analytics for information akin to what you're | asking that Server side misses | | 3. Crash analytics for client side errors | | As mentioned though, you're only getting partial info from | some of those options. It also gives you a chance to decide | which of these you _really_ need and hopefully eliminate | anything you don't | wlll wrote: | Then now you have two problems ;) | | I guess your options here are, do collect metrics in JS and | hope that whatever the reason that 20% of visitors don't show | up in Google analytics isn't preventing them from using your | site. | acdha wrote: | If you're using a SPA you need to build that instrumentation | in to match the native behaviour, along with robust error | handling using something like | https://github.com/getsentry/sentry/ so you can tell when | your code is broken client-side where you would otherwise not | have visibility. | | This is much less likely to be blocked if you self-host it -- | breaking requests to your server will break the app, whereas | blocking common cross-site tracking services is popular | because there are few drawbacks for the user. | XCSme wrote: | You can self-host sentry? | acdha wrote: | Yes - it's pretty easy to run the open source app in your | favorite container runtime: | | https://hub.docker.com/_/sentry/ | XCSme wrote: | Wow, I didn't know that. I remember using it at my last | company and we always kept receiving quota warnings, and | their higher plans were really expensive. | NetToolKit wrote: | > If Server-Side tracking looks too complex for you, you might | want to reevaluate the balance of technical knowledge in your | enterprise. | | Shameless self-plug: if server-side analytics is too | complicated for you, consider using the tool that we just | launched to help with that (other functionality gets included | as well): https://www.nettoolkit.com/gatekeeper/about | snowwrestler wrote: | Server-side tracking is what everyone started off with. There | is a reason client-side analytics won in the marketplace. They | just have a better balance of advantages to disadvantages. | bzb4 wrote: | Client side analytics won because setting them up involves | copying and pasting some code into your HTML code. | alexchamberlain wrote: | What are those advantages and disadvantages? | aeyes wrote: | Try analyzing a TB of logs per day when all you really want | is aggregated statistics. Or really any non-trivial amount | of data, even 1GB is a problem. | | If you want to know click paths you'll need additional data | in your logs. | | If you want to know how much time the average user spent | reading an article on your blog, you are probably out of | luck using logs. | snowwrestler wrote: | The advantage is that you can measure anything that happens | in the browser. As a product owner, what you really care | about is the experience you give your visitor/customer, and | that happens in the browser, not at the server. This | advantage has become stronger over time as sites have used | more javascript. | | One disadvantage is that if your visitor/customer has | javascript turned off, you get no data. This was a concern | in the early days of client-side analytics, but not really | any more. | | A more modern disadvantage is that ad blockers might | prevent your analytics script from running. However, this | is only a problem for client-side analytics packages that | are hosted by ad companies, like Google Analytics. It's not | a problem with the concept of client-side analytics in | general. | | EDIT to add: | | Another advantage is that only measuring things in the | browser makes it a lot easier to exclude non-browser | traffic like bots and spiders from your reports. | | That's also a disadvantage because you can miss server-only | events like "hot-linked" images or PDF downloads straight | from Google. On balance, though, we care a lot less today | about hot-linked files than we care about excluding | automated traffic. | | And in my experience, culturally, client-side packages were | a huge help in getting management off of pointless vanity | metrics like "hit counts" and caring more about human | metrics like visits and time. | LunaSea wrote: | Some things can't be measured on the server side. | | % of a video watched for example would be broken on the server | side due to player buffering. | inetknght wrote: | > _% of a video watched_ | | I would consider that to be private information regardless of | the reason. Why should eg Youtube know where I stopped in the | video? | | > _would be broken on the server side due to player | buffering._ | | Would it, really? | | You don't think that you can tie together "this much of video | was buffered and _possibly_ displayed" is useful information? | | You don't think that "60 seconds of a 10 minute video was | buffered and _possibly displayed_, and another 30 seconds of | buffer was requested every 30 seconds for 5 minutes" is | useful information? | | You don't think that you can determine that the user stopped | watching the video after between four-and-a-half to five-and- | a-half minutes of the video had played? | choeger wrote: | You cannot reliably measure that on the client-side, either. | But if you assume a "normal" use case anyways, you could | easily track the percentage of the video stream requested by | a client. That should approximate percentage watched. Of | course you need to have certain control over your | infrastructure for the matter. If you outsourced everything | to a CDN and thus have no clue what happens to your videos, | well, you probably weren't that interested in the data in the | first place, weren't you? | thinkloop wrote: | Does hosting client-side tracking on your own domain circumvent | all the problems? How come that hasn't become the standard and | killed 3rd-party trackers? If it's a question of having to manage | an analytics platform, can't that still be deferred to a 3rd- | party but through your own subdomain? | pixelmonkey wrote: | I'm one of Parse.ly's co-founders. This post was written by one | of our product managers about a project and investigation we've | been doing for the past few months. It first got on my team's | radar when I posted this set of tweets back in 2019: | | https://twitter.com/amontalenti/status/1165262620959617025 | | Specifically: I noticed a huge difference between the metrics we | were reporting on my blog post in Parse.ly, and the metrics being | reported by my personal blog's Cloudflare CDN (caching the | content). | | Ironically enough, this traffic was all coming from HN and the | post was itself about modern JavaScript[1]. | | Since then, we've also been hearing from a lot of customers about | various scenarios where traffic is either under-counted or mis- | counted. For example, something that has been tripping us up | lately is that our Twitter integration relies (partially) upon | the official t.co link shortener[2], and yet, due to modern | browser rules related to W3C Referrer Policy[3], the t.co link's | path segment is often not transmitted to the analytics provider, | and thus the source tweet for traffic cannot be easily | ascertained. | | I firmly believe in privacy and analytics without compromise[4], | so the team is trying to come up with ways to at least quantify | shadow traffic at an aggregate level, and to ensure legitimate | user privacy interests are honored, while making sure they don't | break legitimate privacy-safe first-party analytics use cases. | | As a developer, something that concerned me recently was | realizing that Sentry, the open source error tracking tool with a | SaaS reporting frontend and a JavaScript SDK[5], gets blocked in | many conservative browser privacy setups. Though the interest to | user privacy is legitimate, I think we can all agree it'd be | better for site/app operators to know when certain browsers are | hitting JavaScript stack traces. | | [1]: https://news.ycombinator.com/item?id=20785616 | | [2]: https://help.twitter.com/en/using-twitter/url-shortener | | [3]: https://www.w3.org/TR/referrer-policy/ | | [4]: https://blog.parse.ly/post/3394/analytics-privacy-without- | co... | | [5]: https://sentry.io/for/javascript/ | ThePhysicist wrote: | Absolute numbers tend to be overrated in analytics. Often | relative numbers, like the number of conversions per tracked user | matter more. Also, if your product is targeted at privacy-savvy | individuals like developers that often use blockers you might be | better off using server-side tracking. That seems to have become | a lost art though, especially since many sites use CDNs that hide | a lot of visits for cacheable content. | thekyle wrote: | If you use Cloudfront (AWS) they have the option for server- | side tracking built-in. You just tell them which S3 bucket to | dump the logs into and you get the raw HTTP requests with | timestamps. I personally use a service called s3stat which | takes those dumps and turns them into pretty graphs. | [deleted] ___________________________________________________________________ (page generated 2020-08-18 23:02 UTC)