[HN Gopher] Pkg.jl telemetry should be opt-in ___________________________________________________________________ Pkg.jl telemetry should be opt-in Author : boromi Score : 93 points Date : 2020-07-05 16:24 UTC (6 hours ago) (HTM) web link (discourse.julialang.org) (TXT) w3m dump (discourse.julialang.org) | parsimo2010 wrote: | There are a few concerns I have as a occasional Julia user. When | I update my packages is this going to be a silent change, or can | we get a notification and a Y/N option to opt out when updating? | How visible and easy is it to change this setting after updating | if I change my mind? | | I don't have a specific concern about the Julia team using my | data, but I have general concerns about companies collecting | telemetry. Can't they get a rough estimate of active users by | counting unique IP addresses over the past X months which doesn't | require opting people in to telemetry? | | Edit: I think I read the link incorrectly. This person is arguing | that users should have to actively opt-in, not that they are | opted in automatically. They are arguing for a change that would | increase privacy, and I need to opt-out in my current | installation. I didn't know I was sending telemetry right now. | [deleted] | dnautics wrote: | Downthread there's an comment that addresses your point by | JohnMylesWhite: | | > I think this is the crux of the issue: you're already doing | that across the Internet since your IP address is part of many | (most?) normal HTTP requests. It's not perfectly uniquely | identifiable, but it's not so far away from being that and it's | being submitted without even the possibility of opt-out in most | cases / for most people. | | > So I think the core issue this thread should resolve: would | it be better for Julia to just do everything via logging IP | addresses? That's what everyone else in OSS is already doing | (seemingly without almost any concerns), so perhaps the problem | is just that Julia is talking about how to best do things | rather than just doing them? That feels quite perverse to me, | but it's my big fear after reading this thread. | nwvg_7257 wrote: | You are not sending telemetry right now. That is a feature | which will be activated in the upcoming 1.5 release. It will | display a notification. | fiddlerwoaroof wrote: | If you make an HTTP request, you are sending "telemetry" | information in the form of endpoints, headers and IP | information. The server may not track this information, but | it's exploitable | parsimo2010 wrote: | My issue is that I've come to terms with the fact that the | IP address of every connection can be tracked server side- | I can use a VPN to get a little anonymity but can't stop a | server from logging connections and downloads. But | telemetry adds data on top of that, and it seems like a lot | of software wants to track me. I'd feel it was okay if I | was required to register an account and log in before | downloading/updating packages, that's a noticeable action | that lets my brain process the idea that I'm able to be | tracked. But sending "anonymous" metadata with almost no | action on my part rubs me the wrong way. Lots of devs try | to optimize things so they are low friction for users, but | I think the Julia user base is a little different than | normal software and wouldn't mind a little friction if it | meant they had better control of their privacy. | fiddlerwoaroof wrote: | As far as I can tell, this isn't adding anything to IP | sharing: the package manager just attaches a persistent | UUID to every request. In fact, it is more private than | IPs because it can't be tied to an ISP or geographical | region. | SweetestRug wrote: | This strongly resonates with me. As as sometime who has | considered Julia, I would be happy to sign up so that | devs could use the information. Opt-out instead of opt-in | is a dark pattern; I am uncomfortable seeing it used. | Julia needs adopters to grow it's community. This move | frankly makes me less likely to use the language in the | future. I am certain I am not the only person who feels | this way. | KenoFischer wrote: | While this is true of the feature mentioned, do note that | packages are currently hosted on various third party hosting | services that can and do track substantially similar | information. In 1.5, we're moving to our own infrastructure | for serving packages (which should give better performance | and allow things like incremental updates). This thread is | about what information gets sent along with those requests. | mixologic wrote: | I feel like many developers fail to understand the difference | between the ethos of Free/Libre/Open source _software_ , and the | realities of running a networked _service_. | | Services are not free (as in beer) - they always take time, | money, and labor to provide. A PkgServer.jl is exactly the kind | of thing that has to be sustained _somehow_. | | It's not possible to use a networked service without exchanging | some information with that service, which may or may not be | useful for the service providers to collect, so that they can | provide a better service (Read: make it cost less) | | The idea that one should be entitled to use a service, for free, | and at the same time ask that the service does not collect any | data, or make it opt-in by default, is akin to demanding free | beer that people can optionally pay for. | | Caveat: My bias is from being a service provider for a packaging | endpoint, a security updates endpoint, and a community CI | service. Any telemetry data we can get our hands on to help us | make informed decisions about what to support, and what to drop | support is absolutely invaluable. | dnautics wrote: | It's a fascinating discussion! I don't use Julia much anymore due | to job change, I hope all language package teams get to read the | back and forth. | systemvoltage wrote: | Why telemetry at all? I don't expect a programming language to | have telemetry as an default feature. | | I want to hammer this rule into everyone regardless of the domain | you're working in when it comes to privacy: | | - _Explicitly ask the user. Respect their privacy. Explain why | you would like to collect data, may be show past examples of what | you 've done with the data and don't deploy dark patterns or | default behavior._ | | It is not that hard. No backlash. No problem at all if you ask | the user. Sure, that would lead to less than optimal telemetry | for the collecting party but there should not be any way around | this. Want more data? Incentivize users, may be give them free | subscription for helping out with the beta testing. Give them a | discount. Treat data just like a commodity that costs money to | obtain responsibly. Right now, everyone is a data-cartel trying | to hoard as much as possible. | | Why is this so hard to understand? This is opposite of "level- | headed". I usually allow PyCharm to collect telemetry, I allow | Apple to use Siri requests for improving it. It is because they | do this as respectfully as possible without deceiving the user. | 3JPLW wrote: | I encourage you to read https://julialang.org/legal/data | | This is very minimal data that gets sent along with requests | that you're already making to a (user-selectable) package | server. | ssivark wrote: | > _Why telemetry at all? I don 't expect a programming language | to have telemetry as an default feature._ | | That expectation is incorrect if you've ever used a package | server or pulled packages from some website including Github | (for ANY language). HTTP requests do communicate your IP | address, and it is standard practice to store them and use them | for analytics. | systemvoltage wrote: | No problem if they do it on the server side. Don't pollute | the user space with telemetry without asking. | | If I download julia binaries from their website, they can | collect IP information if the local laws allow it. Once it is | in my possession, it is reprehensible to do anything without | asking me first. | improbable22 wrote: | In case this isn't clear, the telemetry being discussed is | only doing anything when you ask the package manager to | connect to a server, to download things. | systemvoltage wrote: | It is clear. There is a UUID generated on the user-side | to identify them. | | It's one thing to collect statistics of downloads on the | server side and another thing to profile me. It's pretty | clear to me. | umvi wrote: | How does a single random number "profile you" ...? | gnud wrote: | With that identifier, an individual can be tracked across | different networks. This might well make you | identifiable. | | (Not that I think Julia does this) | umvi wrote: | Have any zealous "opt in" folks ever been in a position where | they need to somehow obtain statistical information about their | user base (to raise funding, to make business decisions, to | know what features are most being used, etc)? Opt in is like | hard mode and practically worthless, <1% of your user base will | opt in. | rightbyte wrote: | If you ask nicely and don't pretick the yes or no box I have | a hard time believing <1% would opt-in. | | This is just excuses. Usage statistics could be tied to | downloads or public source code analysis. No need for | tracking. | wlesieutre wrote: | _> I allow Apple to use Siri requests for improving it. It is | because they do this as respectfully as possible without | deceiving the user._ | | Let's not forget that despite Apple's otherwise good privacy | record, Siri was saving your recordings to be listened to and | reviewed by 3rd party contractors without giving you any | opportunity to opt out. It was only late last year after their | competitors were called out for the same issues that Apple | provided an opt-out option. | | And given how proactive they were with privacy warnings about | donating voicemail transcriptions to improve voicemail | accuracy, it was pretty reasonable to think "Surely if they | were saving the Siri input recordings and letting people listen | to them, they would have warned me about that and asked me if | it's OK." | | https://www.cnbc.com/2019/10/28/ios-13point2-has-new-siri-pr... | | It's possible they still had better privacy protections in | place for handling the recordings once they have them on their | servers(compared to Amazon and others), but even the contents | of voice recordings can be enough to de-anonymize them | depending on what you've said. | m4r35n357 wrote: | If you don't like it, write your own code! | KenoFischer wrote: | Hi HN, please note that this is an active discussion thread in | the Julia community. You are all more than welcome to chime in, | but we do try to keep discussions as productive as possible, so | if you do decide to comment, I'd ask that | | 1) You familiarize yourself with the actual proposal and the | improvements that are currently underway and | | 2) Be kind | | A number of people have put in an enormous amount of effort to | try and get this right - please remember that they are indeed | people. | papaf wrote: | Is the telemetry available to users? I glady opted into | Synchthing telemetry after seeing this page: | https://data.syncthing.net/ | | When the data is available to the community, just like the | source code, its a much easier sell. | KenoFischer wrote: | The plan is to make aggregate usage data available publicly | and potentially share more detailed usage data with | individual package authors. The exact format is TBD since | it'll depend on the quality of the data that we get (this is | not active yet, except on the preview build). The raw logs | will be accessible to core developers with a reasonable need | to access (e.g. they're working on the infrastructure or | running the analytics), but will not be public. | j88439h84 wrote: | How about deleting the IP addresses within 48 hours like | 1.1.1.1 and 8.8.8.8 do? | | https://developers.google.com/speed/public-dns/privacy | staticfloat wrote: | We do have a limited retention policy for the package | server logs we keep (which include client IP addresses). | It's not publicly stated anywhere right now, but one | reason why we need to keep IP addresses is for abuse | mitigation. We have been hit in the past by users that do | things like download large (100MB+) files from our | package cache servers multiple times a second for days on | end. This is a particularly easy case to catch (since it | easily pops to the top of any analysis you'd care to run, | across any timespan) but there are more subtle forms that | require a longer time window of analysis (e.g. users that | download once per hour, all month) that would be lost in | the noise without the ability to see what's going on. | | This comment is not meant to serve as an official policy, | just pointing out one of the reasons why we can't delete | IP addresses like 1.1.1.1 and 8.8.8.8 do; because the | abuse vectors for a server that serves the community | large resources is very different from that of a DNS | server. | | Most of the "abuse" we see is not malicious in nature, | but is instead users that have some kind of very poorly- | configured autoinstaller on a cluster. In the case of a | catastrophic issue like the one mentioned above, we null- | routed the IP address, reached out to the abuse contact | for that IP, and worked with the user to architect a | better system. Everyone is happy now, and we can continue | to provide a high quality service for the community | without breaking the bank. | edw wrote: | How about hashing IPs? You could still see if someone | were on your abuse list if | abusers.contains(hashfn(req.addr)). | codedokode wrote: | Hash of IPv4 address can be easily reverted because there | is a limited number of addresses. | KenoFischer wrote: | Doesn't help for two reasons 1) If the has has enough | bits to be useful for blocking, it's trivial to reverse | 2) Even if it did make the IPs anonymous, we want to be | able to email the NOC at whoever is sending the abusive | traffic, so they can go investigate | j88439h84 wrote: | > we want to be able to email the NOC at whoever is | sending the abusive traffic, so they can go investigate | | If you block their traffic with HTTP 429 Too Many | Requests, they can email you instead. | KenoFischer wrote: | We prefer not to break researchers' workflow because the | group next door misconfigured their server. Happens all | the time. We only sinkhole IPs if the traffic is | malicious or on track to exceed or budget. | KenoFischer wrote: | I don't work on this particular thing, so I can't say | precisely what the planned retention period is. I suspect | 48 hours is too short, since people do take weekends ;). | It'll probably become clear with experience what | retention periods work. DNS servers are in a very unique | position of course since they essentially get your | browsing history. | ptx wrote: | Since the data is not being made public, presumably it is | judged to be sensitive to some extent? | | So if follows then that users are right to be concerned and | would have every reason to not opt in if they were | presented with the choice. | KenoFischer wrote: | What's sensitive or not depends very much on what other | information the entity doing the analysis has available. | Of course raw log records are more sensitive than | aggregate data. For example, if somebody is wiretapping | your internet connection, then even if the connection is | encrypted raw logs would let them draw conclusions from | timing. To some extent you're trusting the Julia project | (or at least the people who have access) to not | clandestinely be in the wiretapping business, but then | again you're already trusting it with arbitrary code | execution on your machine, so if it were in that | business, you'd have bigger problems ;). | | In any case that's why it's important to be transparent | about what is sent, and for what purpose and who has | access, so people can make informed decisions. | Ironically, I think people are jumping on the authors of | this particular piece of functionality precisely because | they tried to be very transparent. | ptx wrote: | Yes, they are transparent about deciding not to offer the | user the choice in a straight-forward upfront way (i.e. | opt in) because "the vast majority of users will not opt- | in". In other words, deciding that what the users want is | not as important as the marketing stats. | | And, as you say, users trust the developers with access | to their systems and data. Deciding unilaterally to | sacrifice user privacy to benefit other interests might | be seen as a breach of that trust. | bencollier49 wrote: | Wow, if this is done without prompting the user, then it's | illegal in the EU and UK. IP addresses are considered PII. | KenoFischer wrote: | As mentioned in the thread, the people who implemented these | features obtained appropriate legal advice from lawyers | specializing in this area and implemented their | recommendations. | staticfloat wrote: | The GDPR explicitly allows for the processing of personal | information without consent in the event that such processing | is required for ensuring network security and availability, see | [1], [2] and [3] for more reading on this. Note that I am not a | lawyer, and you should consult a lawyer (as we did) to ensure | that all policies fall within GDPR laws. | | That is precisely what the logged IP addresses are used for (an | example: nginx access logs), and is one of the reasons why we | would much rather use a random number generated by the client | machine than an IP address; because the bits themselves have no | meaning, unlike IP addresses. | | As mentioned in the linked thread, NumFocus has worked with a | legal team that specializes in this type of law, this plan is | all in compliance with the GDPR. | | [1] https://gdpr-info.eu/recitals/no-49/ (The actual GDPR text | regarding security concerns) [2] | https://blogs.akamai.com/2018/08/dispelling-the-myths-surrou... | (Akamai legal team confirming that this interpretation of | logging IP addresses for security purposes is valid) [3] | https://law.stackexchange.com/a/28609 (Stack exchange post | pointing out that even more exceptions exist beyond just | security) | philzook wrote: | I think the discussion is a bit more nuanced than that. They do | not appear to be recording IPs. They directly reference | carefully complying to GDPR. | chrispeel wrote: | Yes, IP addresses will be logged | https://discourse.julialang.org/t/pkg-jl-telemetry-should- | be... | throwawaw wrote: | This is an extraordinarily level-headed and well-reasoned version | of the "how much telemetry" conversation, from both "sides". The | Julia community comes off looking really good here. | pwdisswordfish2 wrote: | Moreover, at present, we have no idea how many people use each | solver (and on which platform!). Knowing how many people | installed which solver would allow us to prioritize | support from our finite developer time. | | Why not just let users vote on that? The support is for the | users, no? Instead the developers want to minimise the amount | of time they spend on maintenance based on the number of users | who could potentially complain. The reason for this is (as we | are about to be told) so they can spend more time working on | platforms where they believe commercial solver developers could | provide for-profit support services "(or $$)". | This would also allow us to lobby the commercial solver | developers to provide official support (or $$). To quote one | company "We'll want to provide official support at some | point, but it looks like the scales haven't tilted quite yet." | It'd be nice to know whether 100, 1000, 10000, or 100000 | people per month use their software; that might change their | mind. | | The truth comes out. Collecting data via "frictionless" | telemetry allows someone else, e.g., commercial solver | developers, to make money. Nothing wrong with that if we let | users know about these intentions, however when devlopers try | to operate under the guise of "free", "non-profit", "open | source", etc. while, truthfully, they have commercial motives, | then it seems to me they are doing everything they can to avoid | tipping users off that this aims to be a commercially-oriented | project. Instead of just being transparent about their motives | and letting users decide, they want to sneak something by | (most) users. The issue raised here is not the collecting | statistics (nothing wrong with that), it is the less | transparent, opt-out nature of it: telemetry. Deceptiveness, | stealth. The message coming from this discussion is "Don't tip | (majority of) users off that we are collecting data." And why | is that? Because the developers know this is something most | users do not want. Finally, if it is opt-in, | the vast majority of users will not opt-in. This leaves us no | better off than we were before. Opt-out is a good | compromise. | | The discussion should have ended right here. If providing usage | statistics is something that the Julia developers _already | know_ the vast majority of users do not want to do, then | sneaking it by them via opt-out telemetry is wrong, and it | tells us much about the people behind Julia. If users do not | want it, and you know that, then why the heck are you doing it | anyway? Anyone reading this will know why, but most users will | probably never read what we are reading here. | | The rest of this discussion devolves into "Everyone else is | doing it". The lone dissenter finally gives in to peer | pressure. | | I remember when using download statistics was enough. | Developers still maintained software. No "trade-offs" were | needed. | mlubin wrote: | > I remember when using download statistics was enough. | | No download statistics are currently available for Julia | packages. That's essentially the issue that the Pkg.jl | telemetry is trying to address. | ViralBShah wrote: | Julia packages are github repos, where all we get are the | traffic stats for the last 2 weeks for clones. It doesn't | even provide the number of downloads of released software | (the tarballs), or even basic stats that you could get from | webserver access logs. | spenczar5 wrote: | > I remember when using download statistics was enough. | Developers still maintained software. No "trade-offs" were | needed. | | Do you remember applying for grants to fund software? It's | tough out there, right now. Harder than it once was - | software is more expensive, and funding agencies are more | critical. | kanonieer wrote: | Telemetry deservedly has a terrible reputation due to its usage | in proprietary software. In open source software, it's not a deal | breaker for me as I have means to get rid of it. | | But given the landscape of privacy issues, I wouldn't vote for an | opt-out telemetry in any of the OS projects I'm involved with. | CyberDildonics wrote: | I skimmed the link but still have the same question - is there | really a justification for having any telemetry turned on by | default? I think most people wouldn't want any network traffic | unless they instigated it, let alone unique identifiers and | package information. | ishcheklein wrote: | It helps developing and prioritizing features faster. What is | so harmful about it? Assuming it's anonymized properly, if no | one resells it, if it's explicit (doesn't matter opt-in or opt- | out). | KenoFischer wrote: | Note that this is about metadata for package requests, so | you're downloading something from a server already. The | question is what information is in that request. | Tarrosion wrote: | The back-and-forth in that thread is a great discussion. One | thing I hadn't realized is that many other popular languages are | already doing something similar. See this post for a bit more | detail: https://discourse.julialang.org/t/pkg-jl-telemetry- | should-be... ___________________________________________________________________ (page generated 2020-07-05 23:00 UTC)