[HN Gopher] Big data may not know your name, but it knows everyt... ___________________________________________________________________ Big data may not know your name, but it knows everything else Author : nemoniac Score : 16 points Date : 2021-12-30 09:40 UTC (13 hours ago) (HTM) web link (www.wired.com) (TXT) w3m dump (www.wired.com) | aledalgrande wrote: | To me it is crazy selling that data is even legal. | aledalgrande wrote: | throwaway_465 wrote: | Enjoy: | | FinanceIQ by AnalyticsIQ - Consumer Finance Data USA - 241M | Individuals https://datarade.ai/data-products/financeiq | | Individual Consumer Data | https://datarade.ai/search?utf8=%E2%9C%93&category=individua... | agsnu wrote: | If you're interested in this topic, I recommend the chapter on | Inference Control in Ross Anderson's excellent book "Security | Engineering". It's one of the ones which is freely available on | his web site | https://www.cl.cam.ac.uk/~rja14/Papers/SEv3-ch11-7sep.pdf | specialist wrote: | Two tangential "yes and" points: | | 1) | | I'm not smart enough to understand differential privacy. | | So my noob mental model is: Fuzz the data to create hash | collisions. Differential privacy's heuristics guide the effort. | Like how much source data and how much fuzz you need to get X% | certainty of "privacy". Meaning the likelihood someone could | reverse the hash to recover the source identity. | | BUT: This is entirely moot if original (now fuzzed) data set can | be correlated with another data set. | | 2) | | All PII should be encrypted at rest, at the field level. | | I really wish Wayner's Translucent Databases was more well known. | TLDR: Wayner shows clever ways of using salt+hash to protect | identity. Just like how properly protected password files should | be salt+hash protected. | | Again, entirely moot if protected data is correlated with another | data set. | | http://wayner.org/node/46 | | https://www.amazon.com/Translucent-Databases-Peter-Wayner/dp... | | Bonus point 3) | | The privacy "fix" is to extend property rights to all personal | data. | | My data is me. I own it. If someone's using my data, for any | reason, I want my cut. | | Pay me. | lrem wrote: | In Google we have a bunch of researchers on anonymity and that | whole thing is _hard_. I vaguely remember supporting, a couple | years ago, a pipeline where logs stripped out of "all PII" came | at one end, aggregated data out of the middle... Into an | anonymity verifier, that then redirected much of it into | /dev/null, because a technique is known to de-anonymise it. And | the research on differential anonymity advanced quite a bit in | the meantime. | geoduck14 wrote: | Did that stuff work? | dang wrote: | We've merged https://news.ycombinator.com/item?id=29734713 into | this thread since the previous submission wasn't to the original | source. That's why some comment timestamps are older. | MrDresden wrote: | Anecdotally, the only time I've seen a truly anonymized database | was in a european genetics research company, due mainly to the | rightly high amount of regulation required in the medical field. | | There was a whole separate legal entity, with its own board, that | did the phenotype measurement gathering and stored the data in a | big database on premise. The link between those measurements and | the individual's personal identifiable record was then stored in | a separate airgapped database which had cryptographic locks | implemented (on the data and physical access to the server) so | accessing the data took the physical presence of the privacy | officers of each of the two companies (the measurement lab and | the research lab) and finally what I found at the time to be the | unique move; a representative from the state run privacy | watchdog. | | To be able to backtrack the data to a person, there was always | going to be a need to go through the watchdog. Required, not just | legally mandated. | | All of the measurement data that was stored in the database came | from very restricted input fields in the custom software that was | made on prem (no long form text input fields for instance, where | identifying data could be put in accidentally), and there was a | lot of thought put into the design of the UI to limit the | possibility that anyone could put identifiable data into the | record. | | For instance numerical ranges for a specific phenotype where all | prefilled in a dropdown, so as to keep user key input to a | minimum. Much of the data also came from direct connections to | the medical equipment (I wrote a serial connector for a Humphrey | medical eye scanner that parsed the results straight into the | software, skipping the human element all together). | | This didn't make for the nicest looking software (endless | dropdowns or scales), but it fulfilled its goal of functionality | and privacy concerns perfectly. | | The measurement data would then go through many automatic filters | and further anonymizing processes before being delivered through | a dedicated network pipeline (configured through the local isp to | be unidirectional) to the research lab. | | Is this guaranteed to never leak any private information? No, | nothing is 100%. This comes damn near close to it, but ofcourse | would not work in most other normal business situations. | Hendrikto wrote: | > To be able to backtrack the data to a person, there was | always going to be a need to go through the watchdog. | | The assumption in the deanonymization literature is that this | data is unavailable. So no, you don't need to go through any | watchdog. | MrDresden wrote: | Yes they had to, in case the person giving the data had opted | for being notified about some severe medical condition or | other revelations that would show up during the analysis | process. For those cases, these mappings were kept around, | and did require going through the watchdog. | Hendrikto wrote: | I think you misunderstood my point. There are ways of | deanonymizing data, just by looking at the data alone. In | fact, this is the standard assumption. | | This watchdog stuff is nice for the "good" actors, but | irrelevant for adversaries. | MrDresden wrote: | I did understand the point perfectly. The mechanism was | there simply for the _good_ actors to backtrace the data | to the matching person, it 's purpose was never to play a | part in making the data more anonymous. | | If what you meant to say was the more clear statement | that adversaries wouldn't need to do that then I'd have | agreed with you. | | Everything else that was mentioned; the strict processes | in determining what data could be stored, what it exposed | about the user if anything, eliminating as much of the | human input as possible and the post processing of the | data before it left the measurement lab. These are the | steps that achieved anonymity (as far as everyone | believed had been achieved). | benreesman wrote: | One click removed from an original source that is soaked in | second-rate adtech crap. | | To my dying day I will regret being one of the architects of this | insidious mechanism. | | The problem with trying to evade, or defeat, or even sidestep | this stuff is that latent-rep embeddings break human intuition in | their effectiveness. | | There was a time when the uniqueness of one's signature could | move money. | | Those pen twitches are still there to see in what order you click | on links. | kurthr wrote: | Just wait until they have your unconscious eye-twitches. | | Foveal rendering is required for adequate resolution/refresh of | VR/AR due to the bandwidths and GPU calcs involved (e.g. | 6kx6kxRGBx2eyesx120Hz=~200Gb/s). Updating only the 20-30deg | around the eye's focus allows reducing this by >10x which | reduces power/weight and GPU cost dramatically. | jaytaylor wrote: | Having a conscience is good, but don't be too hard on yourself | old friend. | sanxiyn wrote: | Remember that 33 bits of entropy are enough to identify everyone. | It may not be legally so, but any data with 33 bits of entropy is | technically PII, and you should treat it as such. | unixhero wrote: | What do you mean here? I am asking because this is potentially | useful. | jodrellblank wrote: | There are ~8,000,000,000 people in the world; that's a ten- | digit number so that's the smallest size of number which | could count out a unique number for everyone in the world, 9 | digits doesn't have enough possible values. If the digit | values are based on details about you, e.g. being in USA sets | the second digit to 0/1/2, being in Canada and male sets it | to 3, being in Canada and female sets it to 4, the last two | digits are your height in inches, etc. etc. then you don't | have to count out the numbers and give one to everyone, the | ten digits become a kind of identifier of their own. | 1,445,234,170 narrows down to (a woman in Canada 70 inches | tall ... ) until it only matches one person. There are lots | of people of the same height so perhaps it won't quite | identify a single person, but it will be close. Maybe one or | two more digits is enough to tiebreak and reduce it to one | person. | | Almost anything will do as a tie-break between two people - | married, wears glasses, keeps snakes, once visited | whitehouse.gov, walked past a Bluetooth advertising beacon on | Main Street San Francisco. Starting from 8 billion people and | making some yes/no tiebreaks that split people into two | groups, a divide and conquer approach, split the group in | two, split in two again, cheerful/miserable, speaks Spanish | yes/no, once owned a dog yes/no, once had a Google account | yes/no, once took a photo at a wedding yes/no, ever had a | tooth filling yes/no, moved house more than once in a year | yes/no, ever broke a bone yes/no, has a Steam account yes/no, | anything which divides people you will "eventually" winnow | down from 8 billion to 1 person and have a set of tiebreaks | with enough information in them to uniquely identify | individual people. | | I say "eventually", if you can find tiebreaks that split the | groups perfectly in half each time then you only need 33 of | them to reduce 8 billion down to 1. This is all another way | of saying counting in binary, 1010010110101001011010100101101 | is a 33 bit binary number and it can be seen as 33 yes/no | tiebreaks and it's long enough to count up past 8 billion. | It's 2^33, two possible values in each position, 33 times. | | That means any collection of data about people which gets to | 33bits of information about each person is getting close to | being enough data to have a risk of uniquely identifying | people. If you end up gathering how quickly someone clicks a | cookie banner, that has some information hiding in it about | how familiar they are with cookie banners and how physically | able they are, that starts to divide people into groups. If | you gather data about their web browser, that tells you what | OS they run, what version, how up to date it is, those divide | people into buckets. What time they opened your email with a | marketing advert in it gives a clue to their timezone and | work hours. Not very precise, but it only needs 33 bits | before it approaches enough to start identifying individual | people. Gather megabytes of this stuff about each person, and | identities fall out - the person who searched X is the same | person who stood for longer by the advert beacon and supports | X political candidate and lives in this area and probably has | an income in X range and ... can only be Joe Bloggs. | unixhero wrote: | Jawdropping. So this is the 33 degrees I've heard people | throw around. Thank you so much for elaborating in such a | detailed and insightful way. | konschubert wrote: | That makes no sense, sorry. | | Ok, 2^33 > world population, but that doesn't mean that the | string "Hello world" is PII. | halhen wrote: | That depends on the encoding, does it not? The binary | sequence equal to ASCII "Hello world" might well be PII with | many different encodings. By accident, of course, but | nevertheless 33 bits of information would be enough. | stillicidious wrote: | 33 bits of entropy, not just 33 bits | [deleted] | AstralStorm wrote: | Unless someone is actually called Hello World. Or perhaps | Bobby Tables. ;) | lb1lf wrote: | -Anecdotally, at a former employer we had an annual questionnaire | used to estimate how content we were. | | The results, we were assured, would only be used in aggregate | after having been anonymized. | | I laughed quite hard when the results were back - the 'Engineer, | female, 40-49yrs, $SITE' in the office next door wasn't as | amused. All her responses had been printed in the report. Sample | size: 1. | sqrt17 wrote: | At our (fairly large) company, you can query by team (and maybe | job role) but it will hide responses where the sample size is | smaller than a set number (I think 8 or 10). | | So yes it can be done but people have to actually care about | it. | | The cautionary tale about k-anonymity (from Aaron Schwartz's | book I think) is when the behavior of aggregates is also | something that should be kept privates - the example was that | the morning run of an army base in a foreign country was | revealed because enough people did this with their smartwatches | on that it formed a neat cluster. | pfraze wrote: | Isn't location data particularly easy to de-anonymize? I | remember reading some research that because people tend to be | so consistent with their location, you could deanonymize most | people in a dataset with 3 random location samples through | the day | teraku wrote: | In Germany (I think all of the EU), a dataset can only be | published if the sample size is at least >=7. | williamtrask wrote: | Just fwiw sample size isn't a robust defence against this | kind of attack. Check out Differential Privacy. | ocdtrekkie wrote: | More details on the Strava Run incident: | https://www.bbc.com/news/technology-42853072 | AlexTWithBeard wrote: | Can confirm. | | I used to have a team and at some point they all had to submit | their feedback on my performance. The answers were then fed | back to me unattributed, but it was pretty obvious who wrote | what. | rotten wrote: | Never answer those honestly. More than half the time they | aren't there to "help management understand how to do better", | rather to purge people who aren't happy. "We don't need | employees who don't love it here." | lb1lf wrote: | -Oh, we were already deemed beyond redemption - we'd been a | small company, quite successful in our narrow niche, only to | be bought up by $MEGACORP. | | That was a culture clash. Big time. | | Nevertheless, I thought the same thing you did and filled in | my questionnaire so that my answers created a nice, | symmetrical pattern - it looked almost like a pattern for an | arts&crafts project... | vorpalhex wrote: | I always say I am unhappy and would take another offer in a | heatbeat, and so far that strategy has worked well for me - I | usually get offered a good salary bump amd bonus every year. | Obviously YMMV. | AstralStorm wrote: | Even without sample size having these aggregated results makes | it very easy to predict who picked what with a modicum of extra | information. (Even silly binned personality type.) | float4 wrote: | Two things I'd like to say here | | 1. All anonymisation algorithms (k-anonymity, l-divergence, | t-closeness, e-differential privacy, (e,d)-differential privacy, | etc.) have, as you can see, at least one parameter that states | _to what degree_ the data has been anonymised. This parameter | should not be kept secret, as it tells entities that are part of | a dataset how well, and in what way, their privacy is being | preserved. Take something like k-anonymity: the k tells you that | every equivalence class in the dataset has a size >= k, i.e. for | every entity in the dataset, there are at least k-1 other | identical entities in the dataset. There are a lot of things | wrong with k-anonymity, but at least it's transparent. Tech | companies however just state in their Privacy Policies that | "[they] care a lot about your privacy and will therefore | anonymise your data", without specifying _how_ they do that. | | 2. Sharing anonymised data with other organisations (this is | called Privacy Preserving Data Publishing, or PPDP) is virtually | always a bad idea if you care about privacy, because there is | something called the privacy-utility tradeoff: you either have | data with sufficient utility, or you have data with sufficient | privacy preservation, but you can't have both. You either | publish/share useless data, or you publish/share data that does | not preserve privacy well. You can decide for yourself whether | companies care more about privacy or utility. | | Luckily, there's an alternative to PPDP: Privacy Preserving Data | Mining (PPDM). With PPDM, data analysts can submit statistical | queries (queries that only return aggregate information) to the | owner of the original, non-anonymised dataset. The owner will run | the queries, and return the result to the data analyst. | Obviously, one can still infer the full dataset as long as they | submit a sufficient number of specific queries (this is called a | Reconstruction Attack). That's why a privacy mechanism is | introduced, e.g. epsilon-differential privacy. With | e-differential privacy, you essentially _guarantee_ that no query | result depends significantly on one specific entity. This makes | reconstruction attacks impossible. | | The problem with PPDM is that you can't sell your high-utility | "anonymised" datasets, which sucks if you're a big boi data | broker. | motohagiography wrote: | Important concepts. Key thing that has changed in privacy in | last couple years is that de-identified data has recently been | made into a legal concept instead of a technical one, whereby | you do a re-identification risk assessment (not a very mature | methodology in place yet), figure out who is accountable in the | event of a breach, label the data as de-identified, and include | the obligation of the recipients to protect it in the data | sharing agreement. | | The effect on data sharing has been notable because nobody | wants to hold risk, where previously "de-identification" | schemes (and even encryption) made their risk and obligation | evaporate as it magically transformed the data from sensitive | to less sensitive using encryption or data masking. Privacy | Preserving Data Publishing is sympathetic magic from a | technical perspective, as it just obfuscates the data | ownership/custodianship and accountability. | | FHE is the only candidate technology I am aware of that meets | this need, and DBAs, whose jobs are to manage these issues, are | notoriously insufficiently skilled to produce even a | synthesized test data set from a data model, let alone | implement privacy preserving query schemes like differential | privacy. What I learned from working on the issue with | institutions was nobody really cared about the data subjects, | they cared about avoiding accountability, which seems natural, | but only if you remove altruism and social responsibility | altogether. You can't rely on managers to respect privacy as an | abstract value or principle. | | Whether you have a technical or policy control is really at the | crux of security vs. privacy, where as technologists we mostly | have a cryptographic/information theoretic understanding of | data and identification, but the privacy side is really about | responsibilities around collection, use, disclosure, and | retention. Privacy really is a legal concept, and you can kick | the can down the road with security tools, but the reason | someone wants to pay you for your privacy tool is that you are | telling them you are taking on breach risk on their behalf by | providing a tool. The people using privacy tools aren't using | them because they preserve privacy, they use them because it's | a magic feather that absolves them of responsibility. It's a | different understanding of tools. | | However, it does imply a market opportunity for a crappy | snakeoil freemium privacy product that says it implements the | aformentioned techniques but barely does anything at all, and | just allows organizations to say they are using it. Their | problem isn't cryptographic, it just has to be sophisticated | enough that non-technical managers can't be held accountable | for reasoning about it, and they're using a tool so they are | compliant. I wonder what the "whitebox cryptography" people are | doing these days... | [deleted] | amelius wrote: | Can advertisers be legally forced to use these mathematical | techniques? ___________________________________________________________________ (page generated 2021-12-30 23:00 UTC)