[HN Gopher] Pkg.jl telemetry should be opt-in
       ___________________________________________________________________
        
       Pkg.jl telemetry should be opt-in
        
       Author : boromi
       Score  : 93 points
       Date   : 2020-07-05 16:24 UTC (6 hours ago)
        
 (HTM) web link (discourse.julialang.org)
 (TXT) w3m dump (discourse.julialang.org)
        
       | parsimo2010 wrote:
       | There are a few concerns I have as a occasional Julia user. When
       | I update my packages is this going to be a silent change, or can
       | we get a notification and a Y/N option to opt out when updating?
       | How visible and easy is it to change this setting after updating
       | if I change my mind?
       | 
       | I don't have a specific concern about the Julia team using my
       | data, but I have general concerns about companies collecting
       | telemetry. Can't they get a rough estimate of active users by
       | counting unique IP addresses over the past X months which doesn't
       | require opting people in to telemetry?
       | 
       | Edit: I think I read the link incorrectly. This person is arguing
       | that users should have to actively opt-in, not that they are
       | opted in automatically. They are arguing for a change that would
       | increase privacy, and I need to opt-out in my current
       | installation. I didn't know I was sending telemetry right now.
        
         | [deleted]
        
         | dnautics wrote:
         | Downthread there's an comment that addresses your point by
         | JohnMylesWhite:
         | 
         | > I think this is the crux of the issue: you're already doing
         | that across the Internet since your IP address is part of many
         | (most?) normal HTTP requests. It's not perfectly uniquely
         | identifiable, but it's not so far away from being that and it's
         | being submitted without even the possibility of opt-out in most
         | cases / for most people.
         | 
         | > So I think the core issue this thread should resolve: would
         | it be better for Julia to just do everything via logging IP
         | addresses? That's what everyone else in OSS is already doing
         | (seemingly without almost any concerns), so perhaps the problem
         | is just that Julia is talking about how to best do things
         | rather than just doing them? That feels quite perverse to me,
         | but it's my big fear after reading this thread.
        
         | nwvg_7257 wrote:
         | You are not sending telemetry right now. That is a feature
         | which will be activated in the upcoming 1.5 release. It will
         | display a notification.
        
           | fiddlerwoaroof wrote:
           | If you make an HTTP request, you are sending "telemetry"
           | information in the form of endpoints, headers and IP
           | information. The server may not track this information, but
           | it's exploitable
        
             | parsimo2010 wrote:
             | My issue is that I've come to terms with the fact that the
             | IP address of every connection can be tracked server side-
             | I can use a VPN to get a little anonymity but can't stop a
             | server from logging connections and downloads. But
             | telemetry adds data on top of that, and it seems like a lot
             | of software wants to track me. I'd feel it was okay if I
             | was required to register an account and log in before
             | downloading/updating packages, that's a noticeable action
             | that lets my brain process the idea that I'm able to be
             | tracked. But sending "anonymous" metadata with almost no
             | action on my part rubs me the wrong way. Lots of devs try
             | to optimize things so they are low friction for users, but
             | I think the Julia user base is a little different than
             | normal software and wouldn't mind a little friction if it
             | meant they had better control of their privacy.
        
               | fiddlerwoaroof wrote:
               | As far as I can tell, this isn't adding anything to IP
               | sharing: the package manager just attaches a persistent
               | UUID to every request. In fact, it is more private than
               | IPs because it can't be tied to an ISP or geographical
               | region.
        
               | SweetestRug wrote:
               | This strongly resonates with me. As as sometime who has
               | considered Julia, I would be happy to sign up so that
               | devs could use the information. Opt-out instead of opt-in
               | is a dark pattern; I am uncomfortable seeing it used.
               | Julia needs adopters to grow it's community. This move
               | frankly makes me less likely to use the language in the
               | future. I am certain I am not the only person who feels
               | this way.
        
           | KenoFischer wrote:
           | While this is true of the feature mentioned, do note that
           | packages are currently hosted on various third party hosting
           | services that can and do track substantially similar
           | information. In 1.5, we're moving to our own infrastructure
           | for serving packages (which should give better performance
           | and allow things like incremental updates). This thread is
           | about what information gets sent along with those requests.
        
       | mixologic wrote:
       | I feel like many developers fail to understand the difference
       | between the ethos of Free/Libre/Open source _software_ , and the
       | realities of running a networked _service_.
       | 
       | Services are not free (as in beer) - they always take time,
       | money, and labor to provide. A PkgServer.jl is exactly the kind
       | of thing that has to be sustained _somehow_.
       | 
       | It's not possible to use a networked service without exchanging
       | some information with that service, which may or may not be
       | useful for the service providers to collect, so that they can
       | provide a better service (Read: make it cost less)
       | 
       | The idea that one should be entitled to use a service, for free,
       | and at the same time ask that the service does not collect any
       | data, or make it opt-in by default, is akin to demanding free
       | beer that people can optionally pay for.
       | 
       | Caveat: My bias is from being a service provider for a packaging
       | endpoint, a security updates endpoint, and a community CI
       | service. Any telemetry data we can get our hands on to help us
       | make informed decisions about what to support, and what to drop
       | support is absolutely invaluable.
        
       | dnautics wrote:
       | It's a fascinating discussion! I don't use Julia much anymore due
       | to job change, I hope all language package teams get to read the
       | back and forth.
        
       | systemvoltage wrote:
       | Why telemetry at all? I don't expect a programming language to
       | have telemetry as an default feature.
       | 
       | I want to hammer this rule into everyone regardless of the domain
       | you're working in when it comes to privacy:
       | 
       | - _Explicitly ask the user. Respect their privacy. Explain why
       | you would like to collect data, may be show past examples of what
       | you 've done with the data and don't deploy dark patterns or
       | default behavior._
       | 
       | It is not that hard. No backlash. No problem at all if you ask
       | the user. Sure, that would lead to less than optimal telemetry
       | for the collecting party but there should not be any way around
       | this. Want more data? Incentivize users, may be give them free
       | subscription for helping out with the beta testing. Give them a
       | discount. Treat data just like a commodity that costs money to
       | obtain responsibly. Right now, everyone is a data-cartel trying
       | to hoard as much as possible.
       | 
       | Why is this so hard to understand? This is opposite of "level-
       | headed". I usually allow PyCharm to collect telemetry, I allow
       | Apple to use Siri requests for improving it. It is because they
       | do this as respectfully as possible without deceiving the user.
        
         | 3JPLW wrote:
         | I encourage you to read https://julialang.org/legal/data
         | 
         | This is very minimal data that gets sent along with requests
         | that you're already making to a (user-selectable) package
         | server.
        
         | ssivark wrote:
         | > _Why telemetry at all? I don 't expect a programming language
         | to have telemetry as an default feature._
         | 
         | That expectation is incorrect if you've ever used a package
         | server or pulled packages from some website including Github
         | (for ANY language). HTTP requests do communicate your IP
         | address, and it is standard practice to store them and use them
         | for analytics.
        
           | systemvoltage wrote:
           | No problem if they do it on the server side. Don't pollute
           | the user space with telemetry without asking.
           | 
           | If I download julia binaries from their website, they can
           | collect IP information if the local laws allow it. Once it is
           | in my possession, it is reprehensible to do anything without
           | asking me first.
        
             | improbable22 wrote:
             | In case this isn't clear, the telemetry being discussed is
             | only doing anything when you ask the package manager to
             | connect to a server, to download things.
        
               | systemvoltage wrote:
               | It is clear. There is a UUID generated on the user-side
               | to identify them.
               | 
               | It's one thing to collect statistics of downloads on the
               | server side and another thing to profile me. It's pretty
               | clear to me.
        
               | umvi wrote:
               | How does a single random number "profile you" ...?
        
               | gnud wrote:
               | With that identifier, an individual can be tracked across
               | different networks. This might well make you
               | identifiable.
               | 
               | (Not that I think Julia does this)
        
         | umvi wrote:
         | Have any zealous "opt in" folks ever been in a position where
         | they need to somehow obtain statistical information about their
         | user base (to raise funding, to make business decisions, to
         | know what features are most being used, etc)? Opt in is like
         | hard mode and practically worthless, <1% of your user base will
         | opt in.
        
           | rightbyte wrote:
           | If you ask nicely and don't pretick the yes or no box I have
           | a hard time believing <1% would opt-in.
           | 
           | This is just excuses. Usage statistics could be tied to
           | downloads or public source code analysis. No need for
           | tracking.
        
         | wlesieutre wrote:
         | _> I allow Apple to use Siri requests for improving it. It is
         | because they do this as respectfully as possible without
         | deceiving the user._
         | 
         | Let's not forget that despite Apple's otherwise good privacy
         | record, Siri was saving your recordings to be listened to and
         | reviewed by 3rd party contractors without giving you any
         | opportunity to opt out. It was only late last year after their
         | competitors were called out for the same issues that Apple
         | provided an opt-out option.
         | 
         | And given how proactive they were with privacy warnings about
         | donating voicemail transcriptions to improve voicemail
         | accuracy, it was pretty reasonable to think "Surely if they
         | were saving the Siri input recordings and letting people listen
         | to them, they would have warned me about that and asked me if
         | it's OK."
         | 
         | https://www.cnbc.com/2019/10/28/ios-13point2-has-new-siri-pr...
         | 
         | It's possible they still had better privacy protections in
         | place for handling the recordings once they have them on their
         | servers(compared to Amazon and others), but even the contents
         | of voice recordings can be enough to de-anonymize them
         | depending on what you've said.
        
       | m4r35n357 wrote:
       | If you don't like it, write your own code!
        
       | KenoFischer wrote:
       | Hi HN, please note that this is an active discussion thread in
       | the Julia community. You are all more than welcome to chime in,
       | but we do try to keep discussions as productive as possible, so
       | if you do decide to comment, I'd ask that
       | 
       | 1) You familiarize yourself with the actual proposal and the
       | improvements that are currently underway and
       | 
       | 2) Be kind
       | 
       | A number of people have put in an enormous amount of effort to
       | try and get this right - please remember that they are indeed
       | people.
        
         | papaf wrote:
         | Is the telemetry available to users? I glady opted into
         | Synchthing telemetry after seeing this page:
         | https://data.syncthing.net/
         | 
         | When the data is available to the community, just like the
         | source code, its a much easier sell.
        
           | KenoFischer wrote:
           | The plan is to make aggregate usage data available publicly
           | and potentially share more detailed usage data with
           | individual package authors. The exact format is TBD since
           | it'll depend on the quality of the data that we get (this is
           | not active yet, except on the preview build). The raw logs
           | will be accessible to core developers with a reasonable need
           | to access (e.g. they're working on the infrastructure or
           | running the analytics), but will not be public.
        
             | j88439h84 wrote:
             | How about deleting the IP addresses within 48 hours like
             | 1.1.1.1 and 8.8.8.8 do?
             | 
             | https://developers.google.com/speed/public-dns/privacy
        
               | staticfloat wrote:
               | We do have a limited retention policy for the package
               | server logs we keep (which include client IP addresses).
               | It's not publicly stated anywhere right now, but one
               | reason why we need to keep IP addresses is for abuse
               | mitigation. We have been hit in the past by users that do
               | things like download large (100MB+) files from our
               | package cache servers multiple times a second for days on
               | end. This is a particularly easy case to catch (since it
               | easily pops to the top of any analysis you'd care to run,
               | across any timespan) but there are more subtle forms that
               | require a longer time window of analysis (e.g. users that
               | download once per hour, all month) that would be lost in
               | the noise without the ability to see what's going on.
               | 
               | This comment is not meant to serve as an official policy,
               | just pointing out one of the reasons why we can't delete
               | IP addresses like 1.1.1.1 and 8.8.8.8 do; because the
               | abuse vectors for a server that serves the community
               | large resources is very different from that of a DNS
               | server.
               | 
               | Most of the "abuse" we see is not malicious in nature,
               | but is instead users that have some kind of very poorly-
               | configured autoinstaller on a cluster. In the case of a
               | catastrophic issue like the one mentioned above, we null-
               | routed the IP address, reached out to the abuse contact
               | for that IP, and worked with the user to architect a
               | better system. Everyone is happy now, and we can continue
               | to provide a high quality service for the community
               | without breaking the bank.
        
               | edw wrote:
               | How about hashing IPs? You could still see if someone
               | were on your abuse list if
               | abusers.contains(hashfn(req.addr)).
        
               | codedokode wrote:
               | Hash of IPv4 address can be easily reverted because there
               | is a limited number of addresses.
        
               | KenoFischer wrote:
               | Doesn't help for two reasons 1) If the has has enough
               | bits to be useful for blocking, it's trivial to reverse
               | 2) Even if it did make the IPs anonymous, we want to be
               | able to email the NOC at whoever is sending the abusive
               | traffic, so they can go investigate
        
               | j88439h84 wrote:
               | > we want to be able to email the NOC at whoever is
               | sending the abusive traffic, so they can go investigate
               | 
               | If you block their traffic with HTTP 429 Too Many
               | Requests, they can email you instead.
        
               | KenoFischer wrote:
               | We prefer not to break researchers' workflow because the
               | group next door misconfigured their server. Happens all
               | the time. We only sinkhole IPs if the traffic is
               | malicious or on track to exceed or budget.
        
               | KenoFischer wrote:
               | I don't work on this particular thing, so I can't say
               | precisely what the planned retention period is. I suspect
               | 48 hours is too short, since people do take weekends ;).
               | It'll probably become clear with experience what
               | retention periods work. DNS servers are in a very unique
               | position of course since they essentially get your
               | browsing history.
        
             | ptx wrote:
             | Since the data is not being made public, presumably it is
             | judged to be sensitive to some extent?
             | 
             | So if follows then that users are right to be concerned and
             | would have every reason to not opt in if they were
             | presented with the choice.
        
               | KenoFischer wrote:
               | What's sensitive or not depends very much on what other
               | information the entity doing the analysis has available.
               | Of course raw log records are more sensitive than
               | aggregate data. For example, if somebody is wiretapping
               | your internet connection, then even if the connection is
               | encrypted raw logs would let them draw conclusions from
               | timing. To some extent you're trusting the Julia project
               | (or at least the people who have access) to not
               | clandestinely be in the wiretapping business, but then
               | again you're already trusting it with arbitrary code
               | execution on your machine, so if it were in that
               | business, you'd have bigger problems ;).
               | 
               | In any case that's why it's important to be transparent
               | about what is sent, and for what purpose and who has
               | access, so people can make informed decisions.
               | Ironically, I think people are jumping on the authors of
               | this particular piece of functionality precisely because
               | they tried to be very transparent.
        
               | ptx wrote:
               | Yes, they are transparent about deciding not to offer the
               | user the choice in a straight-forward upfront way (i.e.
               | opt in) because "the vast majority of users will not opt-
               | in". In other words, deciding that what the users want is
               | not as important as the marketing stats.
               | 
               | And, as you say, users trust the developers with access
               | to their systems and data. Deciding unilaterally to
               | sacrifice user privacy to benefit other interests might
               | be seen as a breach of that trust.
        
       | bencollier49 wrote:
       | Wow, if this is done without prompting the user, then it's
       | illegal in the EU and UK. IP addresses are considered PII.
        
         | KenoFischer wrote:
         | As mentioned in the thread, the people who implemented these
         | features obtained appropriate legal advice from lawyers
         | specializing in this area and implemented their
         | recommendations.
        
         | staticfloat wrote:
         | The GDPR explicitly allows for the processing of personal
         | information without consent in the event that such processing
         | is required for ensuring network security and availability, see
         | [1], [2] and [3] for more reading on this. Note that I am not a
         | lawyer, and you should consult a lawyer (as we did) to ensure
         | that all policies fall within GDPR laws.
         | 
         | That is precisely what the logged IP addresses are used for (an
         | example: nginx access logs), and is one of the reasons why we
         | would much rather use a random number generated by the client
         | machine than an IP address; because the bits themselves have no
         | meaning, unlike IP addresses.
         | 
         | As mentioned in the linked thread, NumFocus has worked with a
         | legal team that specializes in this type of law, this plan is
         | all in compliance with the GDPR.
         | 
         | [1] https://gdpr-info.eu/recitals/no-49/ (The actual GDPR text
         | regarding security concerns) [2]
         | https://blogs.akamai.com/2018/08/dispelling-the-myths-surrou...
         | (Akamai legal team confirming that this interpretation of
         | logging IP addresses for security purposes is valid) [3]
         | https://law.stackexchange.com/a/28609 (Stack exchange post
         | pointing out that even more exceptions exist beyond just
         | security)
        
         | philzook wrote:
         | I think the discussion is a bit more nuanced than that. They do
         | not appear to be recording IPs. They directly reference
         | carefully complying to GDPR.
        
           | chrispeel wrote:
           | Yes, IP addresses will be logged
           | https://discourse.julialang.org/t/pkg-jl-telemetry-should-
           | be...
        
       | throwawaw wrote:
       | This is an extraordinarily level-headed and well-reasoned version
       | of the "how much telemetry" conversation, from both "sides". The
       | Julia community comes off looking really good here.
        
         | pwdisswordfish2 wrote:
         | Moreover, at present, we have no idea how many people use each
         | solver (and on which platform!). Knowing how many people
         | installed which solver        would allow us to prioritize
         | support from our finite developer time.
         | 
         | Why not just let users vote on that? The support is for the
         | users, no? Instead the developers want to minimise the amount
         | of time they spend on maintenance based on the number of users
         | who could potentially complain. The reason for this is (as we
         | are about to be told) so they can spend more time working on
         | platforms where they believe commercial solver developers could
         | provide for-profit support services "(or $$)".
         | This would also allow us to lobby the commercial solver
         | developers to provide official support (or $$). To quote one
         | company "We'll want to        provide official support at some
         | point, but it looks like the scales haven't tilted quite yet."
         | It'd be nice to know whether 100, 1000, 10000, or        100000
         | people per month use their software; that might change their
         | mind.
         | 
         | The truth comes out. Collecting data via "frictionless"
         | telemetry allows someone else, e.g., commercial solver
         | developers, to make money. Nothing wrong with that if we let
         | users know about these intentions, however when devlopers try
         | to operate under the guise of "free", "non-profit", "open
         | source", etc. while, truthfully, they have commercial motives,
         | then it seems to me they are doing everything they can to avoid
         | tipping users off that this aims to be a commercially-oriented
         | project. Instead of just being transparent about their motives
         | and letting users decide, they want to sneak something by
         | (most) users. The issue raised here is not the collecting
         | statistics (nothing wrong with that), it is the less
         | transparent, opt-out nature of it: telemetry. Deceptiveness,
         | stealth. The message coming from this discussion is "Don't tip
         | (majority of) users off that we are collecting data." And why
         | is that? Because the developers know this is something most
         | users do not want.                  Finally, if it is opt-in,
         | the vast majority of users will not opt-in. This leaves us no
         | better off than we were before. Opt-out is a good
         | compromise.
         | 
         | The discussion should have ended right here. If providing usage
         | statistics is something that the Julia developers _already
         | know_ the vast majority of users do not want to do, then
         | sneaking it by them via opt-out telemetry is wrong, and it
         | tells us much about the people behind Julia. If users do not
         | want it, and you know that, then why the heck are you doing it
         | anyway? Anyone reading this will know why, but most users will
         | probably never read what we are reading here.
         | 
         | The rest of this discussion devolves into "Everyone else is
         | doing it". The lone dissenter finally gives in to peer
         | pressure.
         | 
         | I remember when using download statistics was enough.
         | Developers still maintained software. No "trade-offs" were
         | needed.
        
           | mlubin wrote:
           | > I remember when using download statistics was enough.
           | 
           | No download statistics are currently available for Julia
           | packages. That's essentially the issue that the Pkg.jl
           | telemetry is trying to address.
        
             | ViralBShah wrote:
             | Julia packages are github repos, where all we get are the
             | traffic stats for the last 2 weeks for clones. It doesn't
             | even provide the number of downloads of released software
             | (the tarballs), or even basic stats that you could get from
             | webserver access logs.
        
           | spenczar5 wrote:
           | > I remember when using download statistics was enough.
           | Developers still maintained software. No "trade-offs" were
           | needed.
           | 
           | Do you remember applying for grants to fund software? It's
           | tough out there, right now. Harder than it once was -
           | software is more expensive, and funding agencies are more
           | critical.
        
       | kanonieer wrote:
       | Telemetry deservedly has a terrible reputation due to its usage
       | in proprietary software. In open source software, it's not a deal
       | breaker for me as I have means to get rid of it.
       | 
       | But given the landscape of privacy issues, I wouldn't vote for an
       | opt-out telemetry in any of the OS projects I'm involved with.
        
       | CyberDildonics wrote:
       | I skimmed the link but still have the same question - is there
       | really a justification for having any telemetry turned on by
       | default? I think most people wouldn't want any network traffic
       | unless they instigated it, let alone unique identifiers and
       | package information.
        
         | ishcheklein wrote:
         | It helps developing and prioritizing features faster. What is
         | so harmful about it? Assuming it's anonymized properly, if no
         | one resells it, if it's explicit (doesn't matter opt-in or opt-
         | out).
        
         | KenoFischer wrote:
         | Note that this is about metadata for package requests, so
         | you're downloading something from a server already. The
         | question is what information is in that request.
        
       | Tarrosion wrote:
       | The back-and-forth in that thread is a great discussion. One
       | thing I hadn't realized is that many other popular languages are
       | already doing something similar. See this post for a bit more
       | detail: https://discourse.julialang.org/t/pkg-jl-telemetry-
       | should-be...
        
       ___________________________________________________________________
       (page generated 2020-07-05 23:00 UTC)