[HN Gopher] Shadow traffic: site visits that are not captured by...
       ___________________________________________________________________
        
       Shadow traffic: site visits that are not captured by typical
       analytics providers
        
       Author : ahstilde
       Score  : 53 points
       Date   : 2020-08-18 17:16 UTC (5 hours ago)
        
 (HTM) web link (blog.parse.ly)
 (TXT) w3m dump (blog.parse.ly)
        
       | ChuckMcM wrote:
       | Okay, the cynic in me wants to write "New Age web designers are
       | stumped by lack of analytics while still refusing to look at
       | their HTTP server log data."
       | 
       | I remember when the _ONLY_ analytics were those you could derive
       | by analyzing your http logs. Which have useful information in
       | them. Things like source IP address (which can be geo tagged), a
       | bunch of HTTP headers (which are full of information too), and a
       | timestamp which tells you when it came in and from where. Not to
       | mention session cookies which take zero javascript to implement.
       | 
       | I've been retooling my site slowly to _only_ use these analytics
       | (less the cookies) because I value people 's privacy while
       | browsing as much as my own. During the transition I've been
       | comparing what I can pull out of the logs vs what Google's
       | analytics gives me. Sure, Google can do wonders, especially if
       | the person is coming from a browser where they are logged into
       | Google. But, as the article points out, they miss everyone
       | running noscript and/or other privacy enhancers like Privacy
       | Badger from EFF.
       | 
       | I don't feel like I'm going to miss the Google added insights.
        
         | butterfi wrote:
         | The last time I processed my own web logs I used Urchin. What
         | are the good choices these days for log processing?
        
           | ChuckMcM wrote:
           | I don't know, I just use perl to extract the data and feed it
           | into influx. Then I pull data sets from influx into numpy and
           | process it how ever I want.
        
         | GrantZvolsky wrote:
         | I recently decided not to use GA on my site in favour of server
         | logs for all the reasons you mention plus one that I see as the
         | most important: it would require me to annoy users with a
         | cookie consent banner.
        
         | neonate wrote:
         | This is pretty funny, buried in the middle of the article:
         | 
         | > Option 2 - Server-Side Tracking
        
           | ChuckMcM wrote:
           | With advice that this is so technically complex, don't even
           | look there.
           | 
           | These guys can write java script that animates a web page so
           | that it looks like a turning page and its "too technical" to
           | pull data out of a server log?
        
       | DevX101 wrote:
       | I recognized this a couple years ago at a startup I work with.
       | Comparing Google Analytics numbers to validated event logs, the
       | numbers were off by ~20-30%. Surely there must be a quick
       | workaround I thought, there's no way there's an entire multi-
       | billion $ industry of 3rd party analytics software giving bogus
       | numbers to websites?! But that's indeed the case. I immediately
       | made top priority building out an in-house analytics platform
       | where event logs were sent via the API and thus didn't get
       | blocked.
       | 
       | And for those saying relative direction is all that matters, I
       | guarantee you the behavior of users with adblocker installed is
       | very different from those who can't be bothered or don't know
       | how.
        
         | tiffanyh wrote:
         | For clarity, so you'd recommend web server logs over client
         | side analytics?
         | 
         | If so, what open source web server analytics tools do you
         | believe is best? E.g. https://goaccess.io
        
         | sharkweek wrote:
         | Let me tell you what was fun - trying to explain to a former
         | boss why Facebook ads showed one number for amount of traffic
         | sent, Google Analytics showed a different number for traffic
         | from those ads, then lastly our server logs showing an entirely
         | different number!
        
       | paulchap wrote:
       | Funnily enough, I got a warning telling me uBlock prevented this
       | page from loading...
        
       | billyhoffman wrote:
       | I get wanting to create valuable thought leadership content, but
       | this is the worst example of Bad product marketing:
       | 
       | 1- present a new concept to readers (shadow visitors)
       | 
       | 2- show how this concept is scary and bad for your business (your
       | analytics is off by 20%!!!)
       | 
       | 3- present 2 options, which by the way are free, but immediately
       | shit all over them (Server logs! But that's hard and complicated!
       | Edge logs! But getting these are hard!)
       | 
       | 4- present your company's product as option 3, which
       | surprisingly, have no downsides and isnt shit upon
       | 
       | 5- profit
       | 
       | What disingenuous garbage. You should be ashamed Parse.ly.
       | 
       | The right way to do this is do steps 1 and 2. Then show in detail
       | step 3: how to solve the problem with easy options, ideally with
       | free and open software. It's ok to show edge cases, corner cases,
       | or just shear scale issues that makes these options challenging.
       | 
       | The difference is that good product marketing pieces show people
       | how to solve the problem and offer a solution to do that at scale
       | or in a automated/hosted way so the customer doesn't have to deal
       | with it.
       | 
       | If your product marketing content's message is "you are screwed
       | unless you buy our product " you are doing it wrong
        
       | dddddaviddddd wrote:
       | I had an article on the front page of Hacker News last year that
       | had about 17,000 real visits, as determined by analysing my
       | server log files. I was also using Google Analytics at the time,
       | which told me I had 10,000 visitors (of which only 7 were using
       | Firefox!).
       | 
       | Obviously there's a gap between what trackers say and reality,
       | bigger for some demographics than for others.
        
         | ta17711771 wrote:
         | > of which only 7 were using Firefox
         | 
         | Were _reporting_ using Firefox.
         | 
         | Also, not surprising, Firefox security leaves a lot to be
         | desired.
        
           | marcinzm wrote:
           | >Also, not surprising, Firefox security leaves a lot to be
           | desired.
           | 
           | Huh? Blocking google analytics tracking is a positive, not a
           | negative regarding security.
        
             | vlovich123 wrote:
             | He's talking about user agent spoofing.
        
               | inetknght wrote:
               | I'm not so sure. It shouldn't matter whatsoever what user
               | agent I'm using. In fact, not sending the user agent
               | field would be massively better if only that in and of
               | itself wasn't a unique datapoint.
        
               | marcinzm wrote:
               | Still not sure how that ties into Firefox having worse
               | security. User agent spoofing is a privacy item and a
               | positive privacy item.
        
           | [deleted]
        
       | gentleman11 wrote:
       | Obvious question: how do you filter out bot traffic with server
       | side logs? What percent of visitors are bots anyway?
        
         | bleepblorp wrote:
         | Most legitimate bots identify themselves with specific user
         | agent strings.
         | 
         | Script kiddie attack bots are generally fairly obvious as they
         | hammer away at things like /wp-login.php for days on end
         | regardless of what error codes the server returns.
         | 
         | Most other bots are pretty evident just by looking at access
         | patterns. Just identify their IPs and drop them from your
         | analytics.
        
         | compumike wrote:
         | There are off-the-shelf open-source libraries for this that are
         | pretty decent and are kept up-to-date by the community. For
         | example, you can just do browser.is_bot? after you install
         | https://github.com/fnando/browser#bots
        
         | ratsbane wrote:
         | Watching the bot traffic is an interesting exercise in itself.
         | The trick is not to filter it out; it's to identify it (to the
         | greatest degree possible.)
        
         | toast0 wrote:
         | User-Agent for nice bots, and client ASN for naughty bots will
         | get you pretty far. Fake chrome from residential IPs is hard to
         | detect though.
        
       | pkaye wrote:
       | What does parse.ly do differently to account for this discrepancy
       | in the analytics?
        
       | ghgr wrote:
       | The other side of the coin are the inflated traffic statistics
       | that include all kind of bots & crawlers with spoofed user-agent,
       | and for low traffic sites like niche personal blogs with a custom
       | domain it can be 90+% of the server-side logged visits.
       | 
       | How proud was the 15-year old me with my first .com domain,
       | having over 100 visitors per day. Little did I know that the
       | actual number of visitors was much, much less than that.
        
       | indymike wrote:
       | Server side tracking is useful for logging http requests. Client
       | side tracking is useful for logging user interactions. Used to be
       | there would be a small difference between server and client due
       | to caches and user settings... but... Modern apps (i.e. React,
       | Vue, Angular) often only load one page, and then all interaction
       | is managed by client side code, so often client side tracking is
       | the only thing that works.
        
       | jklinger410 wrote:
       | This is pretty much a parsely ad. I think the HN crowd is pretty
       | aware that ad blockers, vpns, etc can break analytics.
        
         | acdha wrote:
         | I think you're too quick to dismiss it. It's one thing to know
         | that it exists and another to recognize that it's somewhere
         | between 20-40% of your total traffic, especially unprompted.
         | I've had many conversations where people assumed their traffic
         | numbers was real until this point was raised, at which point
         | everyone metaphorically slapped their foreheads and realized
         | that they had forgotten to take this into account.
        
           | dylz wrote:
           | another quite fun thing is that this depends on your site
           | vertical and demographic: on some of the esports-related
           | properties i've worked on in the past, adblock rates (that
           | also block analytics) may exceed 80%. i have seen 90% before
           | (along with per-device differences, like you suddenly think
           | almost all of your traffic is mobile - because all desktop
           | users are blocking client side analytics)
        
           | throwaway287391 wrote:
           | When does this actually matter though? Isn't growth (or
           | shrinkage) what you'd normally really care about (e.g. this
           | month we had 10% more DAUs than last month)? I suppose if you
           | changed something to attract tons of new users who
           | disproportionately use AdBlock (for example) this becomes an
           | issue as it wouldn't show up in your metrics, but is that
           | sort of thing common?
           | 
           | I suppose if nothing else it's good to know so you can
           | immediately beef up your numbers +20% in your slide deck for
           | VCs.
        
             | acdha wrote:
             | What if you're doing anything which doesn't involve logins
             | -- public information, advertising, etc. - where users
             | don't otherwise trigger something like account creation /
             | logins?
             | 
             | What if you're trying to get stats about people who don't
             | convert or otherwise give you a signal that they're using
             | the site?
             | 
             | I've run into sites where things like signup or checkout
             | are blocked behind an analytics tracker (Adobe used to
             | recommend running theirs in a synchronous navigation-
             | blocking mode) which meant that any problem with that
             | service was completely invisible unless they contacted you
             | to complain.
             | 
             | I also remember people wondering why Firefox users stopped
             | using their site when they shipped the release which
             | enabled tracking protection by default.
        
               | throwaway287391 wrote:
               | Good points, thanks!
        
         | inetknght wrote:
         | I agree about this being a parsely ad. However, I also think
         | there's a large minority of HN users who _aren 't_ aware about
         | just how client-side analytics are broken or by how many of
         | their users.
        
       | choeger wrote:
       | You should use Server-Side tracking.
       | 
       | There is no reason to design your website in a way that makes
       | your legitimate analysis use cases depend on Client-Side
       | computations.
       | 
       | If Server-Side tracking looks too complex for you, you might want
       | to reevaluate the balance of technical knowledge in your
       | enterprise.
        
         | XCSme wrote:
         | What if your site is a SPA? You would not know, for example,
         | the time spent on site, what pages are visited, where exactly
         | users leave, if there are client-side errors, right?
        
           | igneo676 wrote:
           | Use a hybrid approach:
           | 
           | 1. Server side analytics
           | 
           | 2. Client side analytics for information akin to what you're
           | asking that Server side misses
           | 
           | 3. Crash analytics for client side errors
           | 
           | As mentioned though, you're only getting partial info from
           | some of those options. It also gives you a chance to decide
           | which of these you _really_ need and hopefully eliminate
           | anything you don't
        
           | wlll wrote:
           | Then now you have two problems ;)
           | 
           | I guess your options here are, do collect metrics in JS and
           | hope that whatever the reason that 20% of visitors don't show
           | up in Google analytics isn't preventing them from using your
           | site.
        
           | acdha wrote:
           | If you're using a SPA you need to build that instrumentation
           | in to match the native behaviour, along with robust error
           | handling using something like
           | https://github.com/getsentry/sentry/ so you can tell when
           | your code is broken client-side where you would otherwise not
           | have visibility.
           | 
           | This is much less likely to be blocked if you self-host it --
           | breaking requests to your server will break the app, whereas
           | blocking common cross-site tracking services is popular
           | because there are few drawbacks for the user.
        
             | XCSme wrote:
             | You can self-host sentry?
        
               | acdha wrote:
               | Yes - it's pretty easy to run the open source app in your
               | favorite container runtime:
               | 
               | https://hub.docker.com/_/sentry/
        
               | XCSme wrote:
               | Wow, I didn't know that. I remember using it at my last
               | company and we always kept receiving quota warnings, and
               | their higher plans were really expensive.
        
         | NetToolKit wrote:
         | > If Server-Side tracking looks too complex for you, you might
         | want to reevaluate the balance of technical knowledge in your
         | enterprise.
         | 
         | Shameless self-plug: if server-side analytics is too
         | complicated for you, consider using the tool that we just
         | launched to help with that (other functionality gets included
         | as well): https://www.nettoolkit.com/gatekeeper/about
        
         | snowwrestler wrote:
         | Server-side tracking is what everyone started off with. There
         | is a reason client-side analytics won in the marketplace. They
         | just have a better balance of advantages to disadvantages.
        
           | bzb4 wrote:
           | Client side analytics won because setting them up involves
           | copying and pasting some code into your HTML code.
        
           | alexchamberlain wrote:
           | What are those advantages and disadvantages?
        
             | aeyes wrote:
             | Try analyzing a TB of logs per day when all you really want
             | is aggregated statistics. Or really any non-trivial amount
             | of data, even 1GB is a problem.
             | 
             | If you want to know click paths you'll need additional data
             | in your logs.
             | 
             | If you want to know how much time the average user spent
             | reading an article on your blog, you are probably out of
             | luck using logs.
        
             | snowwrestler wrote:
             | The advantage is that you can measure anything that happens
             | in the browser. As a product owner, what you really care
             | about is the experience you give your visitor/customer, and
             | that happens in the browser, not at the server. This
             | advantage has become stronger over time as sites have used
             | more javascript.
             | 
             | One disadvantage is that if your visitor/customer has
             | javascript turned off, you get no data. This was a concern
             | in the early days of client-side analytics, but not really
             | any more.
             | 
             | A more modern disadvantage is that ad blockers might
             | prevent your analytics script from running. However, this
             | is only a problem for client-side analytics packages that
             | are hosted by ad companies, like Google Analytics. It's not
             | a problem with the concept of client-side analytics in
             | general.
             | 
             | EDIT to add:
             | 
             | Another advantage is that only measuring things in the
             | browser makes it a lot easier to exclude non-browser
             | traffic like bots and spiders from your reports.
             | 
             | That's also a disadvantage because you can miss server-only
             | events like "hot-linked" images or PDF downloads straight
             | from Google. On balance, though, we care a lot less today
             | about hot-linked files than we care about excluding
             | automated traffic.
             | 
             | And in my experience, culturally, client-side packages were
             | a huge help in getting management off of pointless vanity
             | metrics like "hit counts" and caring more about human
             | metrics like visits and time.
        
         | LunaSea wrote:
         | Some things can't be measured on the server side.
         | 
         | % of a video watched for example would be broken on the server
         | side due to player buffering.
        
           | inetknght wrote:
           | > _% of a video watched_
           | 
           | I would consider that to be private information regardless of
           | the reason. Why should eg Youtube know where I stopped in the
           | video?
           | 
           | > _would be broken on the server side due to player
           | buffering._
           | 
           | Would it, really?
           | 
           | You don't think that you can tie together "this much of video
           | was buffered and _possibly_ displayed" is useful information?
           | 
           | You don't think that "60 seconds of a 10 minute video was
           | buffered and _possibly displayed_, and another 30 seconds of
           | buffer was requested every 30 seconds for 5 minutes" is
           | useful information?
           | 
           | You don't think that you can determine that the user stopped
           | watching the video after between four-and-a-half to five-and-
           | a-half minutes of the video had played?
        
           | choeger wrote:
           | You cannot reliably measure that on the client-side, either.
           | But if you assume a "normal" use case anyways, you could
           | easily track the percentage of the video stream requested by
           | a client. That should approximate percentage watched. Of
           | course you need to have certain control over your
           | infrastructure for the matter. If you outsourced everything
           | to a CDN and thus have no clue what happens to your videos,
           | well, you probably weren't that interested in the data in the
           | first place, weren't you?
        
       | thinkloop wrote:
       | Does hosting client-side tracking on your own domain circumvent
       | all the problems? How come that hasn't become the standard and
       | killed 3rd-party trackers? If it's a question of having to manage
       | an analytics platform, can't that still be deferred to a 3rd-
       | party but through your own subdomain?
        
       | pixelmonkey wrote:
       | I'm one of Parse.ly's co-founders. This post was written by one
       | of our product managers about a project and investigation we've
       | been doing for the past few months. It first got on my team's
       | radar when I posted this set of tweets back in 2019:
       | 
       | https://twitter.com/amontalenti/status/1165262620959617025
       | 
       | Specifically: I noticed a huge difference between the metrics we
       | were reporting on my blog post in Parse.ly, and the metrics being
       | reported by my personal blog's Cloudflare CDN (caching the
       | content).
       | 
       | Ironically enough, this traffic was all coming from HN and the
       | post was itself about modern JavaScript[1].
       | 
       | Since then, we've also been hearing from a lot of customers about
       | various scenarios where traffic is either under-counted or mis-
       | counted. For example, something that has been tripping us up
       | lately is that our Twitter integration relies (partially) upon
       | the official t.co link shortener[2], and yet, due to modern
       | browser rules related to W3C Referrer Policy[3], the t.co link's
       | path segment is often not transmitted to the analytics provider,
       | and thus the source tweet for traffic cannot be easily
       | ascertained.
       | 
       | I firmly believe in privacy and analytics without compromise[4],
       | so the team is trying to come up with ways to at least quantify
       | shadow traffic at an aggregate level, and to ensure legitimate
       | user privacy interests are honored, while making sure they don't
       | break legitimate privacy-safe first-party analytics use cases.
       | 
       | As a developer, something that concerned me recently was
       | realizing that Sentry, the open source error tracking tool with a
       | SaaS reporting frontend and a JavaScript SDK[5], gets blocked in
       | many conservative browser privacy setups. Though the interest to
       | user privacy is legitimate, I think we can all agree it'd be
       | better for site/app operators to know when certain browsers are
       | hitting JavaScript stack traces.
       | 
       | [1]: https://news.ycombinator.com/item?id=20785616
       | 
       | [2]: https://help.twitter.com/en/using-twitter/url-shortener
       | 
       | [3]: https://www.w3.org/TR/referrer-policy/
       | 
       | [4]: https://blog.parse.ly/post/3394/analytics-privacy-without-
       | co...
       | 
       | [5]: https://sentry.io/for/javascript/
        
       | ThePhysicist wrote:
       | Absolute numbers tend to be overrated in analytics. Often
       | relative numbers, like the number of conversions per tracked user
       | matter more. Also, if your product is targeted at privacy-savvy
       | individuals like developers that often use blockers you might be
       | better off using server-side tracking. That seems to have become
       | a lost art though, especially since many sites use CDNs that hide
       | a lot of visits for cacheable content.
        
         | thekyle wrote:
         | If you use Cloudfront (AWS) they have the option for server-
         | side tracking built-in. You just tell them which S3 bucket to
         | dump the logs into and you get the raw HTTP requests with
         | timestamps. I personally use a service called s3stat which
         | takes those dumps and turns them into pretty graphs.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2020-08-18 23:02 UTC)