[HN Gopher] Big data may not know your name, but it knows everyt...
       ___________________________________________________________________
        
       Big data may not know your name, but it knows everything else
        
       Author : nemoniac
       Score  : 16 points
       Date   : 2021-12-30 09:40 UTC (13 hours ago)
        
 (HTM) web link (www.wired.com)
 (TXT) w3m dump (www.wired.com)
        
       | aledalgrande wrote:
       | To me it is crazy selling that data is even legal.
        
         | aledalgrande wrote:
        
         | throwaway_465 wrote:
         | Enjoy:
         | 
         | FinanceIQ by AnalyticsIQ - Consumer Finance Data USA - 241M
         | Individuals https://datarade.ai/data-products/financeiq
         | 
         | Individual Consumer Data
         | https://datarade.ai/search?utf8=%E2%9C%93&category=individua...
        
       | agsnu wrote:
       | If you're interested in this topic, I recommend the chapter on
       | Inference Control in Ross Anderson's excellent book "Security
       | Engineering". It's one of the ones which is freely available on
       | his web site
       | https://www.cl.cam.ac.uk/~rja14/Papers/SEv3-ch11-7sep.pdf
        
       | specialist wrote:
       | Two tangential "yes and" points:
       | 
       | 1)
       | 
       | I'm not smart enough to understand differential privacy.
       | 
       | So my noob mental model is: Fuzz the data to create hash
       | collisions. Differential privacy's heuristics guide the effort.
       | Like how much source data and how much fuzz you need to get X%
       | certainty of "privacy". Meaning the likelihood someone could
       | reverse the hash to recover the source identity.
       | 
       | BUT: This is entirely moot if original (now fuzzed) data set can
       | be correlated with another data set.
       | 
       | 2)
       | 
       | All PII should be encrypted at rest, at the field level.
       | 
       | I really wish Wayner's Translucent Databases was more well known.
       | TLDR: Wayner shows clever ways of using salt+hash to protect
       | identity. Just like how properly protected password files should
       | be salt+hash protected.
       | 
       | Again, entirely moot if protected data is correlated with another
       | data set.
       | 
       | http://wayner.org/node/46
       | 
       | https://www.amazon.com/Translucent-Databases-Peter-Wayner/dp...
       | 
       | Bonus point 3)
       | 
       | The privacy "fix" is to extend property rights to all personal
       | data.
       | 
       | My data is me. I own it. If someone's using my data, for any
       | reason, I want my cut.
       | 
       | Pay me.
        
       | lrem wrote:
       | In Google we have a bunch of researchers on anonymity and that
       | whole thing is _hard_. I vaguely remember supporting, a couple
       | years ago, a pipeline where logs stripped out of "all PII" came
       | at one end, aggregated data out of the middle... Into an
       | anonymity verifier, that then redirected much of it into
       | /dev/null, because a technique is known to de-anonymise it. And
       | the research on differential anonymity advanced quite a bit in
       | the meantime.
        
         | geoduck14 wrote:
         | Did that stuff work?
        
       | dang wrote:
       | We've merged https://news.ycombinator.com/item?id=29734713 into
       | this thread since the previous submission wasn't to the original
       | source. That's why some comment timestamps are older.
        
       | MrDresden wrote:
       | Anecdotally, the only time I've seen a truly anonymized database
       | was in a european genetics research company, due mainly to the
       | rightly high amount of regulation required in the medical field.
       | 
       | There was a whole separate legal entity, with its own board, that
       | did the phenotype measurement gathering and stored the data in a
       | big database on premise. The link between those measurements and
       | the individual's personal identifiable record was then stored in
       | a separate airgapped database which had cryptographic locks
       | implemented (on the data and physical access to the server) so
       | accessing the data took the physical presence of the privacy
       | officers of each of the two companies (the measurement lab and
       | the research lab) and finally what I found at the time to be the
       | unique move; a representative from the state run privacy
       | watchdog.
       | 
       | To be able to backtrack the data to a person, there was always
       | going to be a need to go through the watchdog. Required, not just
       | legally mandated.
       | 
       | All of the measurement data that was stored in the database came
       | from very restricted input fields in the custom software that was
       | made on prem (no long form text input fields for instance, where
       | identifying data could be put in accidentally), and there was a
       | lot of thought put into the design of the UI to limit the
       | possibility that anyone could put identifiable data into the
       | record.
       | 
       | For instance numerical ranges for a specific phenotype where all
       | prefilled in a dropdown, so as to keep user key input to a
       | minimum. Much of the data also came from direct connections to
       | the medical equipment (I wrote a serial connector for a Humphrey
       | medical eye scanner that parsed the results straight into the
       | software, skipping the human element all together).
       | 
       | This didn't make for the nicest looking software (endless
       | dropdowns or scales), but it fulfilled its goal of functionality
       | and privacy concerns perfectly.
       | 
       | The measurement data would then go through many automatic filters
       | and further anonymizing processes before being delivered through
       | a dedicated network pipeline (configured through the local isp to
       | be unidirectional) to the research lab.
       | 
       | Is this guaranteed to never leak any private information? No,
       | nothing is 100%. This comes damn near close to it, but ofcourse
       | would not work in most other normal business situations.
        
         | Hendrikto wrote:
         | > To be able to backtrack the data to a person, there was
         | always going to be a need to go through the watchdog.
         | 
         | The assumption in the deanonymization literature is that this
         | data is unavailable. So no, you don't need to go through any
         | watchdog.
        
           | MrDresden wrote:
           | Yes they had to, in case the person giving the data had opted
           | for being notified about some severe medical condition or
           | other revelations that would show up during the analysis
           | process. For those cases, these mappings were kept around,
           | and did require going through the watchdog.
        
             | Hendrikto wrote:
             | I think you misunderstood my point. There are ways of
             | deanonymizing data, just by looking at the data alone. In
             | fact, this is the standard assumption.
             | 
             | This watchdog stuff is nice for the "good" actors, but
             | irrelevant for adversaries.
        
               | MrDresden wrote:
               | I did understand the point perfectly. The mechanism was
               | there simply for the _good_ actors to backtrace the data
               | to the matching person, it 's purpose was never to play a
               | part in making the data more anonymous.
               | 
               | If what you meant to say was the more clear statement
               | that adversaries wouldn't need to do that then I'd have
               | agreed with you.
               | 
               | Everything else that was mentioned; the strict processes
               | in determining what data could be stored, what it exposed
               | about the user if anything, eliminating as much of the
               | human input as possible and the post processing of the
               | data before it left the measurement lab. These are the
               | steps that achieved anonymity (as far as everyone
               | believed had been achieved).
        
       | benreesman wrote:
       | One click removed from an original source that is soaked in
       | second-rate adtech crap.
       | 
       | To my dying day I will regret being one of the architects of this
       | insidious mechanism.
       | 
       | The problem with trying to evade, or defeat, or even sidestep
       | this stuff is that latent-rep embeddings break human intuition in
       | their effectiveness.
       | 
       | There was a time when the uniqueness of one's signature could
       | move money.
       | 
       | Those pen twitches are still there to see in what order you click
       | on links.
        
         | kurthr wrote:
         | Just wait until they have your unconscious eye-twitches.
         | 
         | Foveal rendering is required for adequate resolution/refresh of
         | VR/AR due to the bandwidths and GPU calcs involved (e.g.
         | 6kx6kxRGBx2eyesx120Hz=~200Gb/s). Updating only the 20-30deg
         | around the eye's focus allows reducing this by >10x which
         | reduces power/weight and GPU cost dramatically.
        
         | jaytaylor wrote:
         | Having a conscience is good, but don't be too hard on yourself
         | old friend.
        
       | sanxiyn wrote:
       | Remember that 33 bits of entropy are enough to identify everyone.
       | It may not be legally so, but any data with 33 bits of entropy is
       | technically PII, and you should treat it as such.
        
         | unixhero wrote:
         | What do you mean here? I am asking because this is potentially
         | useful.
        
           | jodrellblank wrote:
           | There are ~8,000,000,000 people in the world; that's a ten-
           | digit number so that's the smallest size of number which
           | could count out a unique number for everyone in the world, 9
           | digits doesn't have enough possible values. If the digit
           | values are based on details about you, e.g. being in USA sets
           | the second digit to 0/1/2, being in Canada and male sets it
           | to 3, being in Canada and female sets it to 4, the last two
           | digits are your height in inches, etc. etc. then you don't
           | have to count out the numbers and give one to everyone, the
           | ten digits become a kind of identifier of their own.
           | 1,445,234,170 narrows down to (a woman in Canada 70 inches
           | tall ... ) until it only matches one person. There are lots
           | of people of the same height so perhaps it won't quite
           | identify a single person, but it will be close. Maybe one or
           | two more digits is enough to tiebreak and reduce it to one
           | person.
           | 
           | Almost anything will do as a tie-break between two people -
           | married, wears glasses, keeps snakes, once visited
           | whitehouse.gov, walked past a Bluetooth advertising beacon on
           | Main Street San Francisco. Starting from 8 billion people and
           | making some yes/no tiebreaks that split people into two
           | groups, a divide and conquer approach, split the group in
           | two, split in two again, cheerful/miserable, speaks Spanish
           | yes/no, once owned a dog yes/no, once had a Google account
           | yes/no, once took a photo at a wedding yes/no, ever had a
           | tooth filling yes/no, moved house more than once in a year
           | yes/no, ever broke a bone yes/no, has a Steam account yes/no,
           | anything which divides people you will "eventually" winnow
           | down from 8 billion to 1 person and have a set of tiebreaks
           | with enough information in them to uniquely identify
           | individual people.
           | 
           | I say "eventually", if you can find tiebreaks that split the
           | groups perfectly in half each time then you only need 33 of
           | them to reduce 8 billion down to 1. This is all another way
           | of saying counting in binary, 1010010110101001011010100101101
           | is a 33 bit binary number and it can be seen as 33 yes/no
           | tiebreaks and it's long enough to count up past 8 billion.
           | It's 2^33, two possible values in each position, 33 times.
           | 
           | That means any collection of data about people which gets to
           | 33bits of information about each person is getting close to
           | being enough data to have a risk of uniquely identifying
           | people. If you end up gathering how quickly someone clicks a
           | cookie banner, that has some information hiding in it about
           | how familiar they are with cookie banners and how physically
           | able they are, that starts to divide people into groups. If
           | you gather data about their web browser, that tells you what
           | OS they run, what version, how up to date it is, those divide
           | people into buckets. What time they opened your email with a
           | marketing advert in it gives a clue to their timezone and
           | work hours. Not very precise, but it only needs 33 bits
           | before it approaches enough to start identifying individual
           | people. Gather megabytes of this stuff about each person, and
           | identities fall out - the person who searched X is the same
           | person who stood for longer by the advert beacon and supports
           | X political candidate and lives in this area and probably has
           | an income in X range and ... can only be Joe Bloggs.
        
             | unixhero wrote:
             | Jawdropping. So this is the 33 degrees I've heard people
             | throw around. Thank you so much for elaborating in such a
             | detailed and insightful way.
        
         | konschubert wrote:
         | That makes no sense, sorry.
         | 
         | Ok, 2^33 > world population, but that doesn't mean that the
         | string "Hello world" is PII.
        
           | halhen wrote:
           | That depends on the encoding, does it not? The binary
           | sequence equal to ASCII "Hello world" might well be PII with
           | many different encodings. By accident, of course, but
           | nevertheless 33 bits of information would be enough.
        
           | stillicidious wrote:
           | 33 bits of entropy, not just 33 bits
        
             | [deleted]
        
           | AstralStorm wrote:
           | Unless someone is actually called Hello World. Or perhaps
           | Bobby Tables. ;)
        
       | lb1lf wrote:
       | -Anecdotally, at a former employer we had an annual questionnaire
       | used to estimate how content we were.
       | 
       | The results, we were assured, would only be used in aggregate
       | after having been anonymized.
       | 
       | I laughed quite hard when the results were back - the 'Engineer,
       | female, 40-49yrs, $SITE' in the office next door wasn't as
       | amused. All her responses had been printed in the report. Sample
       | size: 1.
        
         | sqrt17 wrote:
         | At our (fairly large) company, you can query by team (and maybe
         | job role) but it will hide responses where the sample size is
         | smaller than a set number (I think 8 or 10).
         | 
         | So yes it can be done but people have to actually care about
         | it.
         | 
         | The cautionary tale about k-anonymity (from Aaron Schwartz's
         | book I think) is when the behavior of aggregates is also
         | something that should be kept privates - the example was that
         | the morning run of an army base in a foreign country was
         | revealed because enough people did this with their smartwatches
         | on that it formed a neat cluster.
        
           | pfraze wrote:
           | Isn't location data particularly easy to de-anonymize? I
           | remember reading some research that because people tend to be
           | so consistent with their location, you could deanonymize most
           | people in a dataset with 3 random location samples through
           | the day
        
           | teraku wrote:
           | In Germany (I think all of the EU), a dataset can only be
           | published if the sample size is at least >=7.
        
             | williamtrask wrote:
             | Just fwiw sample size isn't a robust defence against this
             | kind of attack. Check out Differential Privacy.
        
           | ocdtrekkie wrote:
           | More details on the Strava Run incident:
           | https://www.bbc.com/news/technology-42853072
        
         | AlexTWithBeard wrote:
         | Can confirm.
         | 
         | I used to have a team and at some point they all had to submit
         | their feedback on my performance. The answers were then fed
         | back to me unattributed, but it was pretty obvious who wrote
         | what.
        
         | rotten wrote:
         | Never answer those honestly. More than half the time they
         | aren't there to "help management understand how to do better",
         | rather to purge people who aren't happy. "We don't need
         | employees who don't love it here."
        
           | lb1lf wrote:
           | -Oh, we were already deemed beyond redemption - we'd been a
           | small company, quite successful in our narrow niche, only to
           | be bought up by $MEGACORP.
           | 
           | That was a culture clash. Big time.
           | 
           | Nevertheless, I thought the same thing you did and filled in
           | my questionnaire so that my answers created a nice,
           | symmetrical pattern - it looked almost like a pattern for an
           | arts&crafts project...
        
           | vorpalhex wrote:
           | I always say I am unhappy and would take another offer in a
           | heatbeat, and so far that strategy has worked well for me - I
           | usually get offered a good salary bump amd bonus every year.
           | Obviously YMMV.
        
         | AstralStorm wrote:
         | Even without sample size having these aggregated results makes
         | it very easy to predict who picked what with a modicum of extra
         | information. (Even silly binned personality type.)
        
       | float4 wrote:
       | Two things I'd like to say here
       | 
       | 1. All anonymisation algorithms (k-anonymity, l-divergence,
       | t-closeness, e-differential privacy, (e,d)-differential privacy,
       | etc.) have, as you can see, at least one parameter that states
       | _to what degree_ the data has been anonymised. This parameter
       | should not be kept secret, as it tells entities that are part of
       | a dataset how well, and in what way, their privacy is being
       | preserved. Take something like k-anonymity: the k tells you that
       | every equivalence class in the dataset has a size  >= k, i.e. for
       | every entity in the dataset, there are at least k-1 other
       | identical entities in the dataset. There are a lot of things
       | wrong with k-anonymity, but at least it's transparent. Tech
       | companies however just state in their Privacy Policies that
       | "[they] care a lot about your privacy and will therefore
       | anonymise your data", without specifying _how_ they do that.
       | 
       | 2. Sharing anonymised data with other organisations (this is
       | called Privacy Preserving Data Publishing, or PPDP) is virtually
       | always a bad idea if you care about privacy, because there is
       | something called the privacy-utility tradeoff: you either have
       | data with sufficient utility, or you have data with sufficient
       | privacy preservation, but you can't have both. You either
       | publish/share useless data, or you publish/share data that does
       | not preserve privacy well. You can decide for yourself whether
       | companies care more about privacy or utility.
       | 
       | Luckily, there's an alternative to PPDP: Privacy Preserving Data
       | Mining (PPDM). With PPDM, data analysts can submit statistical
       | queries (queries that only return aggregate information) to the
       | owner of the original, non-anonymised dataset. The owner will run
       | the queries, and return the result to the data analyst.
       | Obviously, one can still infer the full dataset as long as they
       | submit a sufficient number of specific queries (this is called a
       | Reconstruction Attack). That's why a privacy mechanism is
       | introduced, e.g. epsilon-differential privacy. With
       | e-differential privacy, you essentially _guarantee_ that no query
       | result depends significantly on one specific entity. This makes
       | reconstruction attacks impossible.
       | 
       | The problem with PPDM is that you can't sell your high-utility
       | "anonymised" datasets, which sucks if you're a big boi data
       | broker.
        
         | motohagiography wrote:
         | Important concepts. Key thing that has changed in privacy in
         | last couple years is that de-identified data has recently been
         | made into a legal concept instead of a technical one, whereby
         | you do a re-identification risk assessment (not a very mature
         | methodology in place yet), figure out who is accountable in the
         | event of a breach, label the data as de-identified, and include
         | the obligation of the recipients to protect it in the data
         | sharing agreement.
         | 
         | The effect on data sharing has been notable because nobody
         | wants to hold risk, where previously "de-identification"
         | schemes (and even encryption) made their risk and obligation
         | evaporate as it magically transformed the data from sensitive
         | to less sensitive using encryption or data masking. Privacy
         | Preserving Data Publishing is sympathetic magic from a
         | technical perspective, as it just obfuscates the data
         | ownership/custodianship and accountability.
         | 
         | FHE is the only candidate technology I am aware of that meets
         | this need, and DBAs, whose jobs are to manage these issues, are
         | notoriously insufficiently skilled to produce even a
         | synthesized test data set from a data model, let alone
         | implement privacy preserving query schemes like differential
         | privacy. What I learned from working on the issue with
         | institutions was nobody really cared about the data subjects,
         | they cared about avoiding accountability, which seems natural,
         | but only if you remove altruism and social responsibility
         | altogether. You can't rely on managers to respect privacy as an
         | abstract value or principle.
         | 
         | Whether you have a technical or policy control is really at the
         | crux of security vs. privacy, where as technologists we mostly
         | have a cryptographic/information theoretic understanding of
         | data and identification, but the privacy side is really about
         | responsibilities around collection, use, disclosure, and
         | retention. Privacy really is a legal concept, and you can kick
         | the can down the road with security tools, but the reason
         | someone wants to pay you for your privacy tool is that you are
         | telling them you are taking on breach risk on their behalf by
         | providing a tool. The people using privacy tools aren't using
         | them because they preserve privacy, they use them because it's
         | a magic feather that absolves them of responsibility. It's a
         | different understanding of tools.
         | 
         | However, it does imply a market opportunity for a crappy
         | snakeoil freemium privacy product that says it implements the
         | aformentioned techniques but barely does anything at all, and
         | just allows organizations to say they are using it. Their
         | problem isn't cryptographic, it just has to be sophisticated
         | enough that non-technical managers can't be held accountable
         | for reasoning about it, and they're using a tool so they are
         | compliant. I wonder what the "whitebox cryptography" people are
         | doing these days...
        
         | [deleted]
        
         | amelius wrote:
         | Can advertisers be legally forced to use these mathematical
         | techniques?
        
       ___________________________________________________________________
       (page generated 2021-12-30 23:00 UTC)