[HN Gopher] Meow Hash (2018) ___________________________________________________________________ Meow Hash (2018) Author : DeathArrow Score : 327 points Date : 2021-10-29 14:27 UTC (8 hours ago) (HTM) web link (mollyrocket.com) (TXT) w3m dump (mollyrocket.com) | Cilvic wrote: | I know +1 is against the rules, but I think this deserves a | positive comment. I enjoyed reading it, thanks for taking the | time and giving back to the ecosystem. | anonova wrote: | A detailed analysis of Meow Hash: https://peter.website/meow- | hash-cryptanalysis | | It's not the highest of quality hash functions (see the SMHasher | benchmarks), but it is fast. A great alternative is XXH3 | (https://cyan4973.github.io/xxHash/), which has seen far more | usage in practice. | thomasahle wrote: | XXH3 has issues too. There are certain inputs, independent of | the seed, on which it collides much more often than it should. | | On the other hand a hash like UMash guarantees low collisions | on any input: | https://engineering.backtrace.io/2020-08-24-umash-fast-enoug... | jdcarter wrote: | I'm using XXHash3 for verifying data integrity of large (10+MB) | blobs. Very fast and appears to work very well in my testing-- | it's never missed any bit error I've thrown at it. | | Aside: when storing hashes, be sure to store the hash type as | well so that you can change it later if needed, e.g. | "xxh3-[hash value]". RFC-6920 also has things to say about | storing hash types and values, although I haven't seen its | format in common use. | DeathArrow wrote: | Useless analysys since the author says it's not a cryptographic | hash but useful as a fast hash for change detection. | | "we wanted a fast, non-cryptographic hash for use in change | detection and deduplication" | | >A great alternative is XXH3 | | Meow Hash is twice as fast. | iratewizard wrote: | Meow hash is also written by people who initially thought | SHA-1 was acceptable for large scale change hashing. | MauranKilom wrote: | If you had made it one section into the analysis, you would | have seen that _at the time_ MeowHash made certain | cryptographic claims that the author set out to disprove. | | The readme has since been updated. I didn't check whether any | algorithmic changes were made on top, but the discussion of | the analysis on github didn't point to a lot of low-hanging | fruit. | [deleted] | LeoPanthera wrote: | It's not useless analysis, because even for non-cryptographic | hashes you want the likelihood of any arbitrary hash to be | roughly equal. A hash function which "prefers" certain | outputs has a far higher probability of collision. | IncRnd wrote: | Don't you think asset planting is an attack against a game's | pipeline? | | The author of the article's page claims the hash is not | cryptographic but actually goes on to make security claims | about the hash. People who do not understand cryptography | should be careful about making such claims. The author appear | to understand this more than your comment demonstrates. | | For example, a claim about change detection is a | cryptographic claim of detecting preimage attacks. In a | threat model, a security professional would determine whether | a first preimage or a second preimage attack is what should | be guarded in attack scenarios. Then, the professional would | help with analysis, determining mitigations, defense in | depth, and prioritization of fixing the vulnerabilities | exposed by how the hash is used. | | A hash cannot be considered standalone. It is the | architecture and use-case where the hash's security | properties are used to determine what security properties of | the application are fulfilled. | | So, if the author is correct, which seems to be the case, | then meowhash should not be used in a production environment | outside of the most simplistic checks. It seems faster for | its intended use case to simply check for a single bit | difference between two images - no hash required. | null000 wrote: | > it seems faster for its intended use... | | But then you have to store the entire before & after | locally? That's the entire point of using a hash for change | detection. | IncRnd wrote: | > But then you have to store the entire before & after | locally? | | Yes, there is difference between two (as you say) and | there is integrity (modification detection). In the case | of comparing new assets in a pipeline to those that were | created earlier, it sounds plausible both copies would be | present. | | > That's the entire point of using a hash for change | detection. | | This is called integrity protection. Change detection is | the incorrect term to use here. Please see what I | referenced earlier for first and second preimage. | ilitirit wrote: | I can vouch for xxHash (it does what it says on the can), but | I'm really curious to hear from who have experience with meow | hash. | kazinator wrote: | > _Because we have to process hundreds of gigabytes of art assets | to build game packages, we wanted a fast, non-cryptographic hash | for use in change detection and deduplication. We had been using | a cryptographic hash (SHA-1), but it was unnecessarily slowing | things down._ | | That's because you were doing a potentially silly thing: hashing | every byte. | | What does a hash tell you? If two hashes are different, it | confirms that the objects are different. If two hashes are the | same, they could be the same object or they could be different. | | For that purpose, you don't have to hash everything: just hash | the areas where objects are likely to differ. | | If two 15 gigabyte files are likely to have differences in their | first kilobyte, then hashing just first kilobyte (negligibly | fast) will work just as well as hashing 15 gigabytes. | | If two files are of different length, then they are not the same. | How about: | | hash(<length> + <first kilobyte> + <last kilobyte>) | | For deduplication, a 32 bit CRC could probably be reasonably | efficient. No matter what hash you use, you have to verify the | collisions. If CRC32 is enough to make false positives rare over | your type of data set, it's good enough. | andruc wrote: | "just hash the areas where objects are likely to differ" | | Correct me if I'm wrong but given these are art assets, | wouldn't "determining where the objects are likely to differ" | itself be an expensive/hard problem? | cb321 wrote: | It really depends. The early bytes probably contain things | like image dimensions which might discriminate a lot - or not | at all. They might contain palette-style color information | which might discriminate tons - or not at all. Most things | with hash functions depend upon distributions of inputs | relative to the function, though. | | I think it would be accurate to say "Determining where they | are likely to differ is not necessarily hard, but it could | be." If it happens to be easy then you can radically change | the scale of the problem. | | File length alone is a weak hash actively maintained by any | file system and so available "for free" (or one stat call). | And it can work surprisingly well, as the grandparent alludes | to. | | EDIT: You actually do not need to even open(2) files that are | not part of "same size clusters", never mind hash their data. | This alone can really squash the problem size over a more | naive approach even if you did feel the need to hash the | whole file. | | EDIT: And in terms of performance, if this data has to | actually be read off a drive then the hash need not go any | faster than the drive IO which is probably much, much slower | than most candidate hash functions. Even PCIe5.0 NVMes are | only like 14 GB/s, SATA buses like 0.75 GB/s, etc. | kazinator wrote: | To avoid the hash being entirely oblivious to bytes beyond | a small block at the start, we could take additional | samples of the file at exponentially increasing offsets: | | Hash the first kilobyte. | | Hash the 16th kilobyte | | Hash the 256th kilobyte | | Hash the 4096th kilobyte (4M) | | Hash the 65536th kilobyte (64M) | | Hash the 1048576th kilobyte (1G) | | Then 16G, 256G, ... | | It might be worth including the last kilobyte. (I think I | mentioned that.) If the file format has trailing metadata, | that can help, especially if there is any sort of checksum | in it. | cb321 wrote: | Sure. I agree various sampling strategies are a good | design, as long as they don't add up to much IO. Making | that a "user parameter" to be customizable seems ideal. | What you suggest may make a fine default - if adapted to | varying file sizes somehow. E.g., that Nim dups program I | mentioned elsewhere could just take a sequence of slices | instead of only one..maybe just ignoring out of bounds | slices. { And for size clusters where the size < N, screw | it - just hash the whole thing, probably. ;-) } | cb321 wrote: | This is roughly how | https://github.com/c-blake/cligen/blob/master/examples/dups.... | can work. There is little question that changing the scale of | the problem very much lessons performance sensitivity to the | hash. | pepoluan wrote: | CRC32 is actually very slow in modern CPUs. So if you can use | an algorithm that is both (1) faster _and_ (2) better quality, | why not use that algo instead? | kazinator wrote: | > _CRC32 is actually very slow in modern CPUs_ | | Dd you read my whole comment? If we are hashing just a few | kilobytes of a multi-gigabyte file, it almost certainly | doesn't matter. | midnightclubbed wrote: | I wouldn't categorize it as 'very slow', at worse 50% slower | than 128bit XXH3 hash (and faster than the 32bit XXH hash). | | Modern intel processors have the CRC32C instructions and ARM | have both CRC32 and CRC32C. There is a reluctance in some of | the hash comparisons to use the hardware CRC implementations | for compatibility reasons, but most modern CPUs have them. | | The main argument for using a hardware CRC32 (and to some | extent a software one too) is simplicity. If your use-case | needs decent (but not blinding) performance, then there is | something to be said for a hash implementation that can be | written in ~10 lines of code. You still have to handle the | possibility of collisions but any 32bit hash has that | restriction, which is fine for many uses. | ed25519FUUU wrote: | > _The Meow hash is a high-speed hash function named after the | character Meow in Meow the Infinite._ | | As someone who doesn't watch cartoons I'm constantly surprised at | the influence cartoons and anime is having on modern tech | especially with regards to computer science. | phtrivier wrote: | In this case, the comic Meow the Infinite is actually done by | the same game shop behing the meow hash. | caeril wrote: | Not sure you can call MR a "game shop". Pretty sure it's just | Casey, with Anna on contract. | arduinomancer wrote: | To add some context Casey is one of the creators of Meow the | Infinite | alecco wrote: | (2018) | amenghra wrote: | Would be interesting to compare it to siphash. | adamgordonbell wrote: | Found this in the readme: Due to recent | discoveries by Peter Schmidt-Nielsen, we have decided to | reclassify Meow hash 0.5/calico from level 3 to level 1. This | means that we recommend not to use this hash for message | authentication codes, or for hash tables in scenarios where | collision induced denial-of-service attacks are a concern. | For level 3/MAC capabilities consider migrating to SipHash. If | the performance of SipHash is not satisfying, continuing to use | Meow 0.5 for hash tables is better than migrating to another | fast hash. | CodesInChaos wrote: | Siphash has pretty bad performance on large inputs. From what I | remember, even MD5 was faster. | Comevius wrote: | Google made faster Siphash variants, and also HighwayHash | that's much faster. | | https://github.com/google/highwayhash | kortex wrote: | This is great! Many applications (like dedupe) don't need full | crypto guarantees. | | If you need something fast, and crypto secure, I recommend | checking out Blake3/b3sum. I'm just learning about XXH3 in this | thread so I cannot comment on how it stacks up but I love b3sum | for fast file hashing. | | https://github.com/BLAKE3-team/BLAKE3/tree/master/b3sum | scriptdevil wrote: | I recently used seahash in rust for a similar job. This seems to | fall in the same niche. | buildbot wrote: | This could be super useful for deduplicating large photo | datasets! | | It might be worth changing this link to the github, there's a lot | more info on the hash there. | NKosmatos wrote: | Came here to say the same thing :-) Anyone know of a free(ish) | software based on a very fast hashing algorith for checking | large number of images? My intention is to check for bit-rot | and other unforeseen data corruptions between backup disks, I | assume many others must have a similar use case. | bityard wrote: | > This could be super useful for deduplicating large photo | datasets! | | That is indeed the use case mentioned in the article. A hash | like this is useful for deduplicating large _file_ datasets but | for deduplicating images in particular, you really want a | perceptual hash. Because two images files can _look_ identical | but have slightly (or wildly) different byte streams. The | trade-off is that perceptual hashes are notorious for false | positives in large datasets. | kayamon wrote: | I once saw a guy dedupe a folder by batch-converting them to | low quality JPEG thumbnails then sorting by size. | buildbot wrote: | Yep, just happy someone was looking into it, I currently use | xxHash for this. | | Perceptual hashing for photos isn't great in my opinion, for | example I have many raw photos that look very similar, the | only difference may be a slightly higher shutter speed, or a | correction of the camera angle. I'm guessing many perceptual | hashes would mark these as duplicate. Maybe that's what many | want, it's not what I want personally. | pepoluan wrote: | And what's wrong with xxHash? | | According to xxHash's performance comparison page, xxHash | is faster than MeowHash. | da_big_ghey wrote: | Seahash came out very well last I have seen in it, any | comparison? https://docs.rs/seahash/4.1.0/seahash/ | pepoluan wrote: | XXHash author last updated the following page on Aug 10 of this | year: | | https://github.com/Cyan4973/xxHash/wiki/Performance-comparis... | wiskinator wrote: | I love this website because I went to the link, didn't know what | a PWA was, went to the root website and _STILL_ don 't know what | a PWA is. | | I'm going to try the old trick of saying the wrong thing so | someone corrects me. | | Clearly a PWA is a Pug With Attitude. Is this a Jonathon and | Noodle sponsored app? | PebblesRox wrote: | A PWA is a Progressive Web App. | | https://developer.mozilla.org/en-US/docs/Web/Progressive_web... | bowmessage wrote: | I think you may have commented on the wrong article. | | (Did you mean to comment on "Publish Your PWAs to the iOS App | Store" [0])? | | [0] https://news.ycombinator.com/item?id=29040793 | apavlo wrote: | SMHasher benchmark numbers are here (including Meow Hash): | | https://github.com/rurban/smhasher#readme | CodesInChaos wrote: | The quality column on that page seems dubious, since it lists | test failures for secure cryptographic hashes. | pkhuong wrote: | Cryptographic hashes mostly look at collisions for the whole | output (or precisely truncated outputs). There can still be | patterns or clumping in short ranges of the output bits for | certain input families, as long as the full outputs differ. | | For example, the identity function is extremely resistant to | collisions. However, it doesn't mix the input bits, so it's | not a great hash function. | lmilcin wrote: | Now the only issue is, I will be an instant subject of ridicule | if my team sees me using something called "Meow Hash". | 0des wrote: | Why? | failrate wrote: | Too bad, there's also the very useful Squirrel3 Noise Hash. | recursive wrote: | Not everyone can do it, but I recommend ignoring ridicule | that's not well founded. | jnwatson wrote: | You better not use a VCS that's also a common insult then. | NelsonMinar wrote: | I wish there was a consensus non-cryptographic hash algorithm. | Something that hashes arbitrary data to 128 or 256 bit keys with | good randomness, few collisions on typical input data, and | universal implementation. | | Most programmers I know reach for SHA-1 (or if we're being | honest, MD5). But a cryptohash is really wasteful if you don't | need the cryptographic properties. | | Last I looked at this Murmurhash was the closest to widely used | with some folks saying Metrohash was better. But that was years | ago. Judging by this discussion xxHash is the new hotness? | bmn__ wrote: | > good randomness, few collisions on typical input data, and | universal implementation | | The relevant standards body NIST has issued SHA-3 in 2015. | | Spread the word that programmers should not reach for outdated | hash algos. <https://valerieaurora.org/hash.html> | tptacek wrote: | The chart on that page has always squicked me out a bit, | because SHA-2 is certainly still "considered strong" by | cryptography engineers. | pepoluan wrote: | xxHash is not really "new", it's been available since 2014. At | least, that's the oldest Python bindings available on PyPI: | | https://pypi.org/project/xxhash/0.0.1/ | | Sure, as it developed, xxHash variants appear. The latest in | the family is XXH3 and it's already being used at least since | 2019: | | http://fastcompression.blogspot.com/2019/03/presenting-xxh3.... | chronogram wrote: | You want to have XXH128 for that right? 128 bit, portable, | virtually impossible to collide, and only slightly slower than | XXH3 while still way faster than most options. | pspeter3 wrote: | I'm curious too about the differences between Murmurhash and | xxHash. I've been using Murmur in my side projects. | pkhuong wrote: | The production of seed-independent collisions for various | versions of murmurhash (e.g., | http://emboss.github.io/blog/2012/12/14/breaking-murmur- | hash...) motivated siphash. In general, when there's no | positive proof of collision bounds, I would assume (un)lucky | inputs can collide much more often than usual, even when you | change the seed. | | The difference between murmurhash and xxhash in that regard | is that anyone can download code to generate murmurhash | collisions, while code to make xxhash cry is still | unknown/secret. | | XXH3 is also a lot faster for long strings, and I believe | comparable to murmurhash for short inputs. | tromp wrote: | > I wish there was a consensus non-cryptographic hash algorithm | | I think Siphash serves that role pretty well. It's one of the | most if not the most secure among non-cryptographic hashes, and | quite speedy, especially with SIMD implementations. One need | only consider alternatives if one wants to trade off a bit of | security for even higher speed. | NelsonMinar wrote: | Siphash is more like a crypto has right? According to | smhasher its performance is somewhere between a non-crypto | hash like xxHash or Meow and something like SHA-1. | tromp wrote: | As its Wikipedia page says: | | Although designed for use as a hash function to ensure | security, SipHash is fundamentally different from | cryptographic hash functions like SHA in that it is only | suitable as a message authentication code: a keyed hash | function like HMAC. That is, SHA is designed so that it is | difficult for an attacker to find two messages X and Y such | that SHA(X) = SHA(Y), even though anyone may compute | SHA(X). SipHash instead guarantees that, having seen Xi and | SipHash(Xi, k), an attacker who does not know the key k | cannot find (any information about) k or SipHash(Y, k) for | any message Y [?] {Xi} which they have not seen before. | Traubenfuchs wrote: | > we have benchmarked all the ones we could find | | Those results would increase the value of this article from "very | low" to "medium". | chalst wrote: | The README for xxhash has benchmarks covering fast hashes | including Meow: | | https://github.com/Cyan4973/xxHash/wiki/Performance-comparis... | cb321 wrote: | I have measured (on a Skylake core) the hash from Daniel Lemire, | Owen Kaser, "Faster 64-bit universal hashing using carry-less | multiplications, Journal of Cryptographic Engineering (to | appear)" at t_cycles = (0.00741 +- 0.00098 cycles/byte) * bytes + | (37.43 +- 0.48) which is roughly 135 B/cycle (675 GB/s at 5 GHz) | or over 2x faster than the claim here. Admittedly, that 37 cycle | fixed cost is probably quite a bit higher, and I don't know how | well the bit mixing compares. | | Parenthetically, a time equation (like t_cycles = coef * bytes + | fixedCost) is a more convenient way to summarize performance of | this kind of thing than the constant short string/long string | mumble-mumble. Yes, loop unrolling/etc. can create step-ish | functions and a linear regression is pretty approximate (but my | stderrs are pretty small above), but even so..much less tripping | over words. | pkhuong wrote: | umash (https://github.com/backtrace-labs/umash) has a similar | structure PH block structure (and similar universality | guarantees), but was designed for decent bit mixing (enough to | satisfy smhasher, unlike CLHASH, which needs an additional | finalizer) with a lower fixed time cost: 22 cycles for a one- | byte hash on a 2.5 GHz Xeon 8175M. | | I'm not sure how one would use that linear regression. What | kind of hardware offers 675 GB/s of memory bandwidth? 140 | bytes/cycle is easily more than twice the L2 read bandwidth | offered by any COTS chip I'm aware of. There are also warm up | effects past the fixed cost of setup and finalizers that slow | down hashing for short input. For what range of input sizes | (and hot/cold cache state) would you say the regression is a | useful model? | cb321 wrote: | I was only doing short < 1024 byte strings in L1 (like URI | lengths-ish). I'm well aware no memory system on typical HW | can deliver that (EDIT: "at large space scale"). The article | itself mentions a similar calculation (but at their slower | scale) which is why I included it. | | And yes, warm up absolutely matters. My numbers are for the | best case path which is among the only easy things to even | _define_ , IMO. Worst and average are sooooo much | harder/assumption-riddled. Then you argue a lot about said | assumptions. | | I do the minimum times of many runs in an inner loop and the | regression on those min-time results. So, the min-loops will | warm up everything like branch predictors, etc. I'd suspect | the regression model would hold up to usable L1 cache (e.g. | 30 KiB or so). Beyond that, it's mostly just measuring the | memory system not the hash function..and I'm not sure what | the point of that is. Separation of concerns would suggest L1 | perf is most relevant to the "hash calculation part". (IMO, | many benchmarks confuse this.) | | And, yeah..modern memory systems suck. Simple back of the | envelope would suggest some DIMM-oriented workload would be | waiting on the DIMMs like 95% of the time, usually much | higher on Xeons with their low single core BW (maybe 98%!). | The Meow article sort of alludes to that as well with less | detail. Single core Xeon BW is always a huge disappointment. | Forces you to go (very) parallel just to saturate DIMMs | (EDIT: which in turn cannot saturate even a single L1!). | | BTW, it sounds like you know all of this (and maybe more) and | I don't mean to suggest otherwise, but I thought some | numbers/details might assist other passersby. | | { EDIT: Also, the stderr on that slope is ~13%. So, it's | possible 135 is really 128 (1.05x lower), the 2 cache | line/cycle load limit or something like that. This doesn't | mean the 640 GB/s rate number is wrong - it just underscores | how mismatched memory systems are at their various scales. } | | { EDIT: and yes, 128/16 = 8x faster than meow, not 2x. Oops. | } | pbsd wrote: | 1/135 cycles per byte on Skylake is just plain impossible, even | if the hash consisted of simply one xor per 32 bytes of input. | | The lower bound for CLHASH would be the cost of one carryless | multiplication per 16 bytes of input, or in other words 1/16 ~ | 0.0625 cycles per byte, since you can only issue one such | multiplication per cycle. | cb321 wrote: | Feel free to measure it yourself instead of just speculating. | (And as mentioned elsewhere, it is probably 1/128.) { EDIT: | and I never measured meow - it could be faster than 1/16 | cycle/byte but poorly assessed; just be sure to use a min | loop and small data. } | invincivlepvt wrote: | <a href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">Search Engine Optimization</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">Search Engine Optimisation</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">Search Engine Optimisation | Jalandhar</a> <a href="https://invincivlepvt.com/search-engine- | optimization-jalandh..." rel="nofollow">Search Engine | Optimization Jalandhar</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">SEO</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">SEO Services</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">SEO Jalandhar</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">SEO Services Jalandhar</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">SEO Services Agency</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">SEO Services Company</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">SEO Services Agency Jalandhar</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">SEO Services Company Jalandhar</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">Best SEO Services Company</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">Best SEO Services</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">Best SEO Services Agency</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">Best Search Engine Optimisation | Services</a> <a href="https://invincivlepvt.com/search-engine- | optimization-jalandh..." rel="nofollow">Best Search Engine | Optimization Services</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">Best Search Engine Optimization | Services Jalandhar</a> <a href="https://invincivlepvt.com/search- | engine-optimization-jalandh..." rel="nofollow">Best Search Engine | Optimisation Services Jalandhar</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">Local SEO</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">Best Local SEO</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">National SEO</a> <a | href="https://invincivlepvt.com/search-engine-optimization- | jalandh..." rel="nofollow">National SEO</a> | arduinomancer wrote: | A collision was reported a couple months ago | | https://github.com/cmuratori/meow_hash/issues/80 | Accacin wrote: | The back and forth in that thread is incredibly interesting | too. | cbsmith wrote: | As noted in the post title, the original article is from 2018. I | had either forgotten about Meow hash or just never knew about it, | so I appreciate the post, but I was curious about the timing of | the post. | bmn__ wrote: | > I was curious about the timing of the post. | | cmuratori mentioned meowhash in passing in a recent (i.e. few | days ago) Refterm or optimisation video. It is possible that | user DeathArrow watched this, too, and decided to submit the | project homepage to HN. | cateye wrote: | I really enjoyed reading this the clear and concise way of | explanation as documentation. | | While I was totally not interested in a hash function whatsoever, | I read the whole thing :) | rurban wrote: | For comparison: https://github.com/rurban/smhasher/ | | It's indeed pretty fast ( among the top 5), but not particularly | good. It fails some basic tests. Rather use xxh3, t1ha or | farmhash | daniel-thompson wrote: | Can you elaborate on what it fails? | https://mollyrocket.com/meowhash claims: | | > It cleanly passes every test in smhasher | pepoluan wrote: | MeowHash page last updated 2018. | | smhasher page last updated several days ago. | [deleted] | oefrha wrote: | Discussed at the time: | https://news.ycombinator.com/item?id=18262627 | dang wrote: | Thanks! Macroexpanded: | | _Meow Hash: A high-speed non-cryptographic hash function_ - | https://news.ycombinator.com/item?id=18262627 - Oct 2018 (68 | comments) | | Also: | | _Cryptanalysis of Meow Hash_ - | https://news.ycombinator.com/item?id=27978878 - July 2021 (9 | comments) | wolf550e wrote: | How does it compare to XXH3? | | https://github.com/Cyan4973/xxHash | | https://github.com/rurban/smhasher#readme says XXH3 is 16GB/s | while meow is 17GB/s. | caeril wrote: | > How does it compare to XXH3? | | XX is a slightly better hashing function, meow is slightly | faster. | | Take your pick. Life is about tradeoffs. | pepoluan wrote: | Well, the page for MeowHash is written in 2018. | | xxHash Performance Comparison page is last updated in 2021, | and in that page XXH3 is faster than MeowHash: | | https://github.com/Cyan4973/xxHash/wiki/Performance- | comparis... | DeathArrow wrote: | i7-9700K can reach up to 4.9 Ghz and Meow Hash can hash at | a rate of 16 bytes per cycle. That means 78 GB/s. | | xxHash rate is 31.5 GB/s on that CPU. | wolf550e wrote: | Scroll down, XXH3 is 59 GB/s using AVX2 while meow is 58 | GB/s in their benchmark. | snapetom wrote: | I've been using Hashids (https://hashids.org/). It's been fine | performance-wise and well supported across languages. | andruc wrote: | Care to share your definition of fine, and give at least a | ballpark on the size of the assets you're using it on? | snapetom wrote: | It's a pretty small app for some researchers. Front end is | Angular, backend is an API in Bottle. At most, there's 10 | people generating data, and about 40 people on reading data. | Most times are 10 users uploading, and 5-10 users reading. | The readers are external partners. Hashid is to obfuscate ids | we have in another system so these external people cannot tie | the data back due to privacy issues. It generates about 1000 | rows of data across 4 PostGresql tables in a given day. | | I like how the algorithm has been well supported across | languages. We have no problems with it on the Python side. An | analyst runs monthly reports using R against the data (so | about 30k records pulled. Numbers are summarized and stats | are generated. The hashing isn't a bottleneck at all. ___________________________________________________________________ (page generated 2021-10-29 23:00 UTC)