[HN Gopher] The Problem with Perceptual Hashes ___________________________________________________________________ The Problem with Perceptual Hashes Author : rivo Score : 171 points Date : 2021-08-06 19:29 UTC (3 hours ago) (HTM) web link (rentafounder.com) (TXT) w3m dump (rentafounder.com) | stickfigure wrote: | I've also implemented perceptual hashing algorithms for use in | the real world. Article is correct, there really is no way to | eliminate false positives while still catching minor changes | (say, resizing, cropping, or watermarking). | | I'm sure I'm not the only person with naked pictures of my wife. | Do you really want a false positive to result in your intimate | moments getting shared around some outsourced boiler room for | laughs? | jjtheblunt wrote: | Why would other people have a naked picture of your wife? | planb wrote: | I fully agree with you. But while scrolling to next comment, a | question came to my mind: Would it really bother me if some | person that does not known my name, has never met me in real | life and never will is looking at my pictures without me ever | knowing about it? To be honest, I'm not sure if I'd care. | Because for all I know, that might be happening right now... | zxcvbn4038 wrote: | Rookie mistake. | | Three rules to live by: | | 1) Always pay your taxes | | 2) Don't talk to the police | | 3) Don't take photographs with your clothes off | jimmygrapes wrote: | I might amend #2 a bit to read "Be friends with the police" | as that has historically been more beneficial to those who | are. | vineyardmike wrote: | > Do you really want a false positive to result in your | intimate moments getting shared around some outsourced boiler | room for laughs? | | these people also have no incentive to find you innocent for | innocent photos. If they err on the side of false-negative, | they might find themselves at the wrong end of a criminal | search ("why didn't you catch this"), but if they false- | positive they at worse ruin a random person's life. | jdavis703 wrote: | Even still this has to go to the FBI or other law enforcement | agency, then it's passed on to a prosecutor and finally a | jury will evaluate. I have a tough time believing that false | positives would slip through that many layers. | | That isn't to say CASM scanning or any other type of drag net | is OK. But I'm not concerned about a perceptual hash ruining | someone's life, just like I'm not concerned about a botched | millimeter wave scan ruining someone's life for weapons | possession. | gambiting wrote: | >>I have a tough time believing that false positives would | slip through that many layers. | | I don't, not in the slightest. Back in the days when Geek | Squad had to report any suspicious images found during | routine computer repairs, a guy got reported to the police | for having child porn, arrested, fired from his job, named | in the local newspaper as a pedophile, all before the | prosecutor was actually persuaded by the defense attorney | to look at these "disgusting pictures".....which turned out | to be his own grand children in a pool. Of course he was | immediately released but not before the damage to his life | was done. | | >>But I'm not concerned about a perceptual hash ruining | someone's life | | I'm incredibly concerned about this, I don't see how you | can not be. | zimpenfish wrote: | > Do you really want a false positive to result in your | intimate moments getting shared around some outsourced boiler | room for laughs? | | You'd have to have several positive matches against the | specific hashes of CSAM from NCMEC before they'd be flagged up | for human review, right? Which presumably lowers the threshold | of accidental false positives quite a bit? | mjlee wrote: | > I'm sure I'm not the only person with naked pictures of my | wife. | | I'm not completely convinced that says what you want it to. | enedil wrote: | Didn't she possibly have previous partners? | iratewizard wrote: | I don't even have nude photos of my wife. The only person | who might would be the NSA contractor assigned to watch | her. | websites2023 wrote: | Presumably she wasn't his wife then. But also people have | various arrangements so I'm not here to shame. | nine_k wrote: | Buy a subcompact camera. Never upload such photos to any cloud. | Use your local NAS / external disk / your Linux laptop's | encrypted hard drive. | | Unless you prefer to live dangerously, of course. | ohazi wrote: | Consumer NAS boxes like the ones from Synology or QNAP have | "we update your box at our whim" cloud software running on | them and are effectively subject to the same risks, even if | you try to turn off all of the cloud options. I probably | wouldn't include a NAS on this list unless you built it | yourself. | | It looks like you've updated your comment to clarify _Linux_ | laptop 's encrypted hard drive, and I agree with your line of | thinking. Modern Windows and Mac OS are effectively cloud | operating systems where more or less anything can be pushed | at you at any time. | derefr wrote: | With Synology's DSM, at least, there's no "firmware" per | se; it's just a regular Linux install that you have sudo(1) | privileges on, so you can just SSH in and modify the OS as | you please (e.g. removing/disabling the update service.) | cm2187 wrote: | At least you can deny the NAS access to the WAN by blocking | it on the router or not configuring the right gateway. | marcinzm wrote: | Given all the zero day exploits on iOS I wonder if it's now going | to be viable to hack someone's phone and upload child porn to | their account. Apple with happily flag the photos and then, | likely, get those people arrested. Now they have to, in practice, | prove they were hacked which might be impossible. Will either | ruin their reputation or put them in jail for a long time. Given | past witch hunts it could be decades before people get | exonerated. | new_realist wrote: | This is already possible using other services (Google Drive, | gmail, Instagram, etc.) that already scan for CP. | toxik wrote: | This is really a difficult problem to solve I think. However, I | think most people who are prosecuted for CP distribution are | hoarding it by the terabyte. It's hard to claim that you were | unaware of that. A couple of gigabytes though? Plausible. And | that's what this CSAM scanner thing is going to find on phones. | emodendroket wrote: | A couple gigabytes is a lot of photos... and they'd all be | showing up in your camera roll. Maybe possible but stretching | the bounds of plausibility. | danachow wrote: | A couple gigabytes is enough to ruin someone's day but not | a lot to surreptitiously transfer, it's literally seconds. | Just backdate them and they may very well go unnoticed. | [deleted] | yellow_lead wrote: | Regarding false positives re:Apple, the Ars Technica article | claims | | > Apple offers technical details, claims 1-in-1 trillion chance | of false positives. | | There are two ways to read this, but I'm assuming it means, for | each scan, there is a 1-in-1 trillion chance of a false positive. | | Apple has over 1 billion devices. Assuming ten scans per device | per day, you would reach one trillion scans in ~100 days. Okay, | but not all the devices will be on the latest iOS, not all are | active, etc, etc. But this is all under the assumption those | numbers are accurate. I imagine reality will be much worse. And I | don't think the police will be very understanding. Maybe you will | get off, but you'll be in a huge debt from your legal defense. Or | maybe, you'll be in jail, because the police threw the book at | you. | klodolph wrote: | > Even at a Hamming Distance threshold of 0, that is, when both | hashes are identical, I don't see how Apple can avoid tons of | collisions... | | You'd want to look at the particular perceptual hash | implementation. There is no reason to expect, without knowing the | hash function, that you would end up with tons of collisions at | distance 0. | SavantIdiot wrote: | This article covers three methods, all of which just look for | alterations of a source image to find a fast match (in fact, | that's the paper referenced). It is still a "squint to see if it | is similar" test. I was under the impression there were more | sophisticated methods that looked for _types_ of images, not just | altered known images. Am I misunderstanding? | jbmsf wrote: | I am fairly ignorant if this space. Do any of the standard | methods use multiple hash functions vs just one? | jdavis703 wrote: | Yes, I worked on such a product. Users had several hashing | algorithms they could chose from, and the ability to create | custom ones if they wanted. | heavyset_go wrote: | I've built products that utilize different phash algorithms at | once, and it's entirely possible, and quite common, to get | false positives across hashing algorithms. | lordnacho wrote: | Why wouldn't the algo check that one image has a face while the | other doesn't? That would remove this particular false positive, | though I'm not sure what it might cause of new ones. | PUSH_AX wrote: | Because where do you draw the line with classifying arbitrary | features in the images? The concept is it should work with an | image of anything. | rustybolt wrote: | > an Apple employee will then look at your (flagged) pictures. | | This means that there will be people paid to look at child | pornography and probably a lot of private nude pictures as well. | emodendroket wrote: | And what do you think the content moderation teams employed by | Facebook, YouTube, et al. do all day? | mattigames wrote: | Yeah, we obviously needed one more company doing it as well, | and I'm sure having more positions in the job market which | pretty much could be described as "Get paid to watch | pedophilia all day long" will not backfire in any way. | emodendroket wrote: | You could say there are harmful effects of these jobs but | probably not in the sense you're thinking. | https://www.wired.com/2014/10/content-moderation/ | josephcsible wrote: | They look at content that people actively and explicitly | chose to share with wider audiences. | [deleted] | Spivak wrote: | Yep! I guess this announcement is when everyone is collectively | finding out how this has, apparently quietly, worked for years. | | It's a "killing floor" type job where you're limited in how | long you're allowed to do it in a lifetime. | varjag wrote: | There are people who are paid to do that already, just | generally not in corporate employment. | pkulak wrote: | Apple, with all those Apple == Privacy billboards plastered | everywhere, is going to have a full-time staff of people with | the job of looking through it's customers' private photos. | mattigames wrote: | I'm sure thats the dream position for most pedophiles, watching | child porn fully legally and being paid for it, plus on the | record being someone who helps destroy it; and given that CP | will exist for as long as human beings do there will be no | shortage no matter how much they help capturing other | pedophiles. | ivalm wrote: | I am not exactly buying the premise here, if you train a CNN on | useful semantic categories then the representations they generate | will be semantically meaningful (so the error shown in blog | wouldn't occur). | | I dislike the general idea of iCloud having back doors but I | don't think the criticism in this blog is entirely valid. | | Edit: it was pointed out apple doesn't have semantically | meaningful classifier so the blog post's criticism is valid. | jeffbee wrote: | I agree the article is a straw-man argument and is not | addressing the system that Apple actually describes. | SpicyLemonZest wrote: | Apple's description of the training process | (https://www.apple.com/child- | safety/pdf/CSAM_Detection_Techni...) sounds like they're just | training it to recognize some representative perturbations, not | useful semantic categories. | ivalm wrote: | Ok, good point, thanks. | ajklsdhfniuwehf wrote: | whatsapp and other apps place pictures from groups chats in | folders deep in your IOS gallery. | | Swatting will be a problem all over again.... wait, did it ever | stop being a problem? | karmakaze wrote: | It really all comes down to if Apple has and is willing to | maintain the effort of human evaluations prior to taking action | on the potentially false positives: | | > According to Apple, a low number of positives (false or not) | will not trigger an account to be flagged. But again, at these | numbers, I believe you will still get too many situations where | an account has multiple photos triggered as a false positive. | (Apple says that probability is "1 in 1 trillion" but it is | unclear how they arrived at such an estimate.) These cases will | be manually reviewed. | | At scale, even human classification which ought to be clear will | fail, accidentally clicking 'not ok' when they saw something they | thought was 'ok'. It will be interesting to see what happens | then. | jdavis703 wrote: | Then law enforcement, a prosecutor and a jury would get | involved. Hopefully law enforcement would be the first and | final stage if it was merely the case that a person pressed | "ok" by accident. | at_a_remove wrote: | I do not know as much about perceptual hashing as I would like, | but have considered it for a little project of my own. | | Still, I know it has been floating around in the wild. I recently | came across it on Discord when I attempted to push an ancient | image, from the 4chan of old, to a friend, which mysteriously | wouldn't send. Saved it as a PNG, no dice. This got me | interested. I stripped the EXIF data off of the original JPEG. I | resized it slightly. I trimmed some edges. I adjusted colors. I | did a one degree rotation. Only after a reasonably complete | combination of those factors would the image make it through. How | interesting! | | I just don't know how well this little venture of Apple's will | scale, and I wonder if it won't even up being easy enough to | bypass in a variety of ways. I think the tradeoff will do very | little, as stated, but is probably a glorious apportunity for | black-suited goons of state agencies across the globe. | | We're going to find out in a big big way soon. | | * The image is of the back half of a Sphynx cat atop a CRT. From | the angle of the dangle, the presumably cold, man-made feline is | draping his unexpectedly large testicles across the similarly | man-made device to warm them, suggesting that people create | problems and also their solutions, or that, in the Gibsonian | sense, the street finds its own uses for things. I assume that | the image was blacklisted, although I will allow for the somewhat | baffling concept of a highly-specialized scrotal matching neural- | net that overreached a bit or a byte on species, genus, family, | and order. | judge2020 wrote: | AFAIK Discord's NSFW filter is not a perceptual hash nor uses | the NCMEC database (although that might indeed be in the | pipeline elsewhere) but instead uses a ML classifier (I'm | certain it doesn't use perceptual hashes as Discord doesn't | have a catalogue of NSFW image hashes to compare against). I've | guessed it's either open_nsfw[0] or Google's Cloud Vision since | the rest of Discord's infrastructure uses Google Cloud VMs. | There's a web demo available of this api[1], Discord probably | pulls the safe search classifications for determining NSFW. | | 0: https://github.com/yahoo/open_nsfw | | 1: https://cloud.google.com/vision#section-2 | a_t48 wrote: | Adding your friend as a "friend" on discord should disable the | filter. | ttul wrote: | Apple would not be so naive as to roll out a solution to child | abuse images that has a high false positive rate. They do test | things prior to release... | bjt wrote: | I'm guessing you don't remember all the errors in the initial | launch of Apple Maps. | smlss_sftwr wrote: | ah yes, from the same company that shipped this: | https://medium.com/hackernoon/new-macos-high-sierra-vulnerab... | | and this: | https://www.theverge.com/2017/11/6/16611756/ios-11-bug-lette... | celeritascelery wrote: | Test it... how exactly? This is detecting illegal material that | they can't use to test against. | bryanrasmussen wrote: | Not knowing anything about it but I suppose various | governmental agencies maintain corpora of nasty stuff and | that you can say to them - hey we want to roll out anti-nasty | stuff functionality in our service therefore we need access | to corpora to test at which point there is probably a pretty | involved process that requires governmental access also to | make sure things work and are not misused otherwise - | | how does anyone ever actually fight the nasty stuff? This | problem structure of how do I catch examples of A if examples | of A are illegal must apply in many places and ways. | vineyardmike wrote: | Test it against innocent data sets, then in prod swap it | for the opaque gov db of nasty stuff and hope the gov was | honest about what is in it :) | | They don't need to train a model to detect the actual data | set. They need to train a model to follow a pre-defined | algo | zimpenfish wrote: | > This is detecting illegal material that they can't use to | test against. | | But they can because they're matching the hashes to the ones | provided by NCMEC, not directly against CSAM itself (which | presumably stays under some kind of lock and key at NCMEC.) | | Same as you can test whether you get false positives against | a bunch of MD5 hashes that Fred provides without knowing the | contents of his documents. | ben_w wrote: | While I don't have any inside knowledge at all, I would | expect a company as big as Apple to be able to ask law | enforcement to run Apple's algorithm on data sets Apple | themselves don't have access to and report the result. | | No idea if they did (or will), but I do expect it's possible. | zimpenfish wrote: | > ask law enforcement to run Apple's algorithm on data sets | Apple themselves don't have access to | | Sounds like that's what they did since they say they're | matching against hashes provided by NCMEC generated from | their 200k CSAM corpus. | | [edit: Ah, in the PDF someone else linked, "First, Apple | receives the NeuralHashes corresponding to known CSAM from | the above child-safety organizations."] | IfOnlyYouKnew wrote: | They want to avoid false powitives, so you would test for | that by running it over innocuous photos, anyway. | [deleted] | IfOnlyYouKnew wrote: | Apple's documents said they require multiple hits before anything | happens, as the article notes. They can (and have) adjusted that | number to any desired balance of false positive to negatives. | | How can they say it's 1 in a trillion? You test the algorithm on | a bunch of random negatives, see how many positives you get, and | do one division and one multiplication. This isn't rocket | science. | | So, while there are many arguments against this program, this | isn't it. It's also somewhat strange to believe the idea of | collisions in hashes of far smaller size than the images they are | run on somehow escaped Apple and/or really anyone mildly | competent. | bt1a wrote: | That would not be a good way to arrive at an accurate estimate. | Would you not need dozens of trillions of photos to begin with | in order to get an accurate estimate when the occurrence rate | is so small? | KarlKemp wrote: | What? No... | | Or, more accurately: if you need "dozens of trillions" that | implies a false positive rate so low, it's practically of no | concern. | | You'd want to look up the poisson distribution for this. But, | to get at this intuitively: say you have a bunch of eggs, | some of which may be spoiled. How many would you have to | crack open, to get a meaningful idea of how many are still | fine, and how many are not? | | The absolute number depends on the fraction that are off. But | independent of that, you'd usually start trusting your sample | when you've seen 5 to 10 spoiled ones. | | So Apple runs the hash algorithm on random photos. They find | 20 false positives in the first ten million. Given that error | rate, how many positives would it require for the average | photo collection of 10,000 to be certain at at 1:a trillion | level that it's not just coincidence? | | Throw it into, for example, | https://keisan.casio.com/exec/system/1180573179 with lambda = | 0.2 (you're expecting one false positive for every 50,000 at | the error rate we assumed, or 0.2 for 10,000), and n = 10 | (we've found 10 positives in this photo library) to see the | chances of that, 2.35x10^-14, or 2.35 / 100 trillion. | mrtksn wrote: | The technical challenges aside, I'm very disturbed that my device | will be reporting me to the authorities. | | That's very different from authorities taking a sneak peek into | my stuff. | | That's like the theological concept of always being watched. | | It starts with child pornography but the technology is | indifferent towards it, it can be anything. | | It's always about the children because we all want to save the | children. Soon then will start asking you start saving your | country. Depending on your location they will start checking | against sins against religion, race, family values, political | activities. | | I bet you, after the next election in the US your device will be | reporting you for spreading far right or deep state lies, | depending on who wins. | baggy_trough wrote: | Totally agree. This is very sinister indeed. Horrible idea, | Apple. | zionic wrote: | So what are we going to _do_ about it? | | I have a large user base on iOS. Considering a blackout | protest. | drzoltar wrote: | The other issue with these hashes is non-robustness to | adversarial attacks. Simply rotating the image by a few degrees, | or slightly translating/shearing it will move the hash well | outside the threshold. The only way to combat this would be to | use a face bounding box algorithm to somehow manually realign the | image. | foobarrio wrote: | In my admittedly limited experience in image hashing, typically | you extract some basic feature and transform the image before | hashing (eg darkest corner in the upper left or look for | verticals/horizontals and align). You also take multiple hashes | of the images to handle various crops, black and white vs | color. This increases robustness a bit but overall yea you can | always transform the image in such a way to come up with a | different enough hash. One thing that would be hard to catch is | if you do something like a swirl and then the consumers of that | content will use a plugin or something to "deswirl" the image. | | There's also something like the Scale Invariant Feature | Transform that would protect against all affine transformations | (scale, rotate, translate, skew). | | I believe one thing that's done is whenever any CP is found, | the hashes of all images in the "collection" is added to the DB | whether or not they actually contain abuse. So if there are any | common transforms of existing images then those also now have | their hashes added to the db. The idea being that a high | percent of hits from even the benign hashes means the presence | of the same "collection". | ris wrote: | I agree with the article in general except part of the final | conclusion | | > The simple fact that image data is reduced to a small number of | bits leads to collisions and therefore false positives | | Our experience with regular hashes suggests this is not the | underlying problem. SHA256 hashes have 256 bits and still there | are _no known_ collisions, even with people deliberately trying | to find them. SHA-1 only has only 160 bits to play with and it 's | still hard enough to find collisions. MD5 is easier to find | collisions but at 128 bits, still people don't come across them | by chance. | | I think the actual issue is that perceptual hashes tend to be | used with this "nearest neighbour" comparison scheme which is | clearly needed to compensate for the inexactness of the whole | problem. | marcinzm wrote: | > an Apple employee will then look at your (flagged) pictures. | | Always fun when unknown strangers get to look at your potentially | sensitive photos with probably no notice given to you. | judge2020 wrote: | They already do this for photodna-matched iCloud Photos (and | Google Photos, Flickr, Imgur, etc), perceptual hashes do not | change that. | version_five wrote: | I'm not familiar with iPhone picture storage. Are the | pictures automatically sync'ed with cloud storage? I would | assume (even if I don't like it) that cloud providers may be | scanning my data. But I would not expect anyone to be able to | see or scan what is stored on my phone. | | Incidentally, I work in computer vision and handle | proprietary images. I would be violating client agreements if | I let anyone else have access to them. This is a concern I've | had in the past e.g. with Office365 (the gold standard in | disregarding privacy) that defaults to sending pictures in | word documents to Microsoft servers for captioning, etc. I | use a Mac now for work, but if somehow this snooping applies | to computers as well I can't keep doing so while respecting | the privacy of my clients. | | I echo the comment on another post, Apple is an entertainment | company, I don't know why we all started using their products | for business applications. | Asdrubalini wrote: | You can disable automatic backups, this way your photos | won't ever be uploaded to iCloud. | abawany wrote: | By default it is enabled. One has to go through Settings to | turn off the default iCloud upload, afaik. | starkd wrote: | The method Apple is using looks more like a cryptographic hash. | That's entirely different (and more secure) than a perceptual | hash. | | From https://www.apple.com/child-safety/ | | "Before an image is stored in iCloud Photos, an on-device | matching process is performed for that image against the known | CSAM hashes. This matching process is powered by a cryptographic | technology called private set intersection, which determines if | there is a match without revealing the result. The device creates | a cryptographic safety voucher that encodes the match result | along with additional encrypted data about the image. This | voucher is uploaded to iCloud Photos along with the image." | | Elsewhere, it does explain the use of neuralhashes which I take | to be the perceptual hash part of it. | | I did some work on a similar attempt awhile back. I also have a | way to store hashes and find similar images. Here's my blog post. | I'm currently working on a full site. | | http://starkdg.github.io/posts/concise-image-descriptor ___________________________________________________________________ (page generated 2021-08-06 23:00 UTC)