[HN Gopher] Reasons to Prefer Blake3 over Sha256 ___________________________________________________________________ Reasons to Prefer Blake3 over Sha256 Author : ementally Score : 168 points Date : 2023-11-13 12:34 UTC (10 hours ago) (HTM) web link (peergos.org) (TXT) w3m dump (peergos.org) | ndsipa_pomu wrote: | At this rate, it's going to take over 700 years before we get | Blake's 7 | benj111 wrote: | I had to scroll disappointingly far down to get to the Blake's | 7 reference. | | Thank you for not disappointing though. | | The down side of that algorithm though is that everything dies | at the end. | dragontamer wrote: | > BLAKE3 is much more efficient (in time and energy) than SHA256, | like 14 times as efficient in typical use cases on typical | platforms. | | [snip] | | > AVX in Intel/AMD, Neon and Scalable Vector Extensions in Arm, | and RISC-V Vector computing in RISC-V. BLAKE3 can take advantage | of all of it. | | Uh huh... AVX/x86 and NEON/ARM you say? | | https://www.felixcloutier.com/x86/sha256rnds2 | | https://developer.arm.com/documentation/ddi0596/2021-12/SIMD... | | If we're talking about vectorized instruction sets like AVX | (Intel/AMD) or NEON (aka: ARM), the advantage is clearly with | SHA256. I don't think Blake3 has any hardware implementation at | all yet. | | Your typical cell phone running ARMv8 / NEON will be more | efficient with the SHA256 instructions than whatever software | routine you need to run Blake3. Dedicated hardware inside the | cores is very difficult to beat on execution speed or efficiency. | | I admit that I haven't run any benchmarks on my own. But I'd be | very surprised if any software routine were comparable to the | dedicated SHA256 instructions found on modern cores. | eatonphil wrote: | From another thread: | | > On my machine with sha extensions, blake3 is about 15% faster | (single threaded in both cases) than sha256. | | https://news.ycombinator.com/item?id=22237387 | vluft wrote: | followup to this now with further blake3 improvements, on a | faster machine now but with sha extensions vs single-threaded | blake3; blake3 is about 2.5x faster than sha256 now. (b3sum | 1.5.0 vs openssl 3.0.11). b3sum is about 9x faster than | sha256sum from coreutils (GNU, 9.3) which does not use the | sha extensions. Benchmark 1: openssl sha256 | /tmp/rand_1G Time (mean +- s): 576.8 ms +- 3.5 | ms [User: 415.0 ms, System: 161.8 ms] Range (min | ... max): 569.7 ms ... 580.3 ms 10 runs | Benchmark 2: b3sum --num-threads 1 /tmp/rand_1G Time | (mean +- s): 228.7 ms +- 3.7 ms [User: 168.7 ms, | System: 59.5 ms] Range (min ... max): 223.5 ms ... | 234.9 ms 13 runs Benchmark 3: sha256sum | /tmp/rand_1G Time (mean +- s): 2.062 s +- 0.025 | s [User: 1.923 s, System: 0.138 s] Range (min ... | max): 2.046 s ... 2.130 s 10 runs Summary | b3sum --num-threads 1 /tmp/rand_1G ran 2.52 +- 0.04 | times faster than openssl sha256 /tmp/rand_1G 9.02 | +- 0.18 times faster than sha256sum /tmp/rand_1G | 0xdeafbeef wrote: | Interestingly, sha256sum and openssl don't use sha_ni. | | iced-cpuid $(which b3sum) AVX AVX2 AVX512F AVX512VL BMI1 | CET_IBT CMOV SSE SSE2 SSE4_1 SSSE3 SYSCALL XSAVE | | iced-cpuid $(which openssl ) CET_IBT CMOV SSE SSE2 | | iced-cpuid $(which sha256sum) CET_IBT CMOV SSE SSE2 | | Also, sha256sum in my case is a bit faster | | Benchmark 1: openssl sha256 /tmp/rand_1G Time (mean +- s): | 540.0 ms +- 1.1 ms [User: 406.2 ms, System: 132.0 ms] Range | (min ... max): 538.5 ms ... 542.3 ms 10 runs | | Benchmark 2: b3sum --num-threads 1 /tmp/rand_1G Time (mean | +- s): 279.6 ms +- 0.8 ms [User: 213.9 ms, System: 64.4 ms] | Range (min ... max): 278.6 ms ... 281.1 ms 10 runs | | Benchmark 3: sha256sum /tmp/rand_1G Time (mean +- s): 509.0 | ms +- 6.3 ms [User: 386.4 ms, System: 120.5 ms] Range (min | ... max): 504.6 ms ... 524.2 ms 10 runs | vluft wrote: | not sure that tool is correct; on my openssl it shows | same output as you have there, but not aes-ni which is | definitely enabled and functional. | | ETA: ahh you want to do that on libcrypto: | iced-cpuid <...>/libcrypto.so.3: ADX AES AVX AVX2 | AVX512BW AVX512DQ AVX512F AVX512VL AVX512_IFMA BMI1 BMI2 | CET_IBT CLFSH CMOV D3NOW MMX MOVBE MSR PCLMULQDQ | PREFETCHW RDRAND RDSEED RTM SHA SMM SSE SSE2 SSE3 SSE4_1 | SSSE3 SYSCALL TSC VMX XOP XSAVE | vluft wrote: | further research suggests that GNU coreutils cksum will | use libcrypto in some configurations (though not mine); I | expect that both both your commands above are actually | using sha-ni | TacticalCoder wrote: | > ...on a faster machine now but with sha extensions vs | single-threaded blake3; blake3 is about 2.5x faster than | sha256 now | | But one of the sweet benefit of blake3 is that it is | parallelized _by default_. I picked blake3 not because it | 's 2.5x faster than sha256 when running b3sum with "--num- | threads 1" but because, with the default, it's 16x faster | than sha256 (on a machine with "only" 8 cores). | | And Blake3, contrarily to some other "parallelizable" | hashes, give the same hash no matter if you run it with one | thread or any number of threads (IIRC there are hashes | which have different executables depending if you want to | run the single-threaded or multi-threader version of the | hash, and they give different checksums). | | I tried on my machine (which is a bit slower than yours) | and I get: 990 ms openssl sha256 | 331 ms b3sum --num-threads 1 70 ms b3sum | | That's where the big performance gain is when using Blake3 | IMO (even though 2.5x faster than a fast sha256 even when | single-threaded is already nice). | vluft wrote: | yup, for comparison, same file as above using all the | threads (32) on my system, I get about 45ms with fully | parallel permitted b3. It does run into diminishing | returns fairly quickly though; unsurprisingly no | improvements in perf using hyperthreading, but also | improvements drop off pretty fast with more cores. | b3sum --num-threads 16 /tmp/rand_1G ran 1.01 +- | 0.02 times faster than b3sum --num-threads 15 | /tmp/rand_1G 1.01 +- 0.02 times faster than b3sum | --num-threads 14 /tmp/rand_1G 1.03 +- 0.02 times | faster than b3sum --num-threads 13 /tmp/rand_1G | 1.04 +- 0.02 times faster than b3sum --num-threads 12 | /tmp/rand_1G 1.07 +- 0.02 times faster than b3sum | --num-threads 11 /tmp/rand_1G 1.10 +- 0.02 times | faster than b3sum --num-threads 10 /tmp/rand_1G | 1.13 +- 0.02 times faster than b3sum --num-threads 9 | /tmp/rand_1G 1.20 +- 0.03 times faster than b3sum | --num-threads 8 /tmp/rand_1G 1.27 +- 0.03 times | faster than b3sum --num-threads 7 /tmp/rand_1G | 1.37 +- 0.02 times faster than b3sum --num-threads 6 | /tmp/rand_1G 1.53 +- 0.05 times faster than b3sum | --num-threads 5 /tmp/rand_1G 1.72 +- 0.03 times | faster than b3sum --num-threads 4 /tmp/rand_1G | 2.10 +- 0.04 times faster than b3sum --num-threads 3 | /tmp/rand_1G 2.84 +- 0.06 times faster than b3sum | --num-threads 2 /tmp/rand_1G 5.03 +- 0.12 times | faster than b3sum --num-threads 1 /tmp/rand_1G | | (over 16 elided from this run as they're all ~= the 16 | time) | oconnor663 wrote: | Figure 4 from https://github.com/BLAKE3-team/BLAKE3-specs | /blob/master/blak... is a related benchmark. On that | particular machine we saw good scaling up to 16 threads, | but yeah somewhere in that neighborhood you start to run | into memory bandwidth issues. Which is the problem you | want I guess :) | api wrote: | I'd be curious to see power consumption. SHA (and AES) are | usually available as what amounts to an ASIC built into the | processor, while this requires a lot more work to be done | with vector instructions. | vluft wrote: | this is less precise than the perf numbers as I don't | really have a way to measure power directly, but with | rerunning the benchmarks above locked to a cpu core, it | boosted ~the same level for all 3 commands (about 5.5ghz), | so should be ~the same power usage. | wmf wrote: | If Blake3 is 2.5x faster then it's going to be roughly 2.5x | less energy. | silotis wrote: | That's not how it works on modern CPUs. Power draw at | "100%" utilization can vary widely depending on what part | of the core is being utilized. The SIMD units are | typically the most power hungry part of the CPU by a | large margin so just because a job finishes in less time | doesn't mean total energy is necessarily lower. | wmf wrote: | I'm assuming that both SHA256 and Blake3 are using | integer SIMD. | api wrote: | In some CPUs that may be true, but in many there are | dedicated SHA instructions that amount to a SHA ASIC in | the CPU. | | AES units are even more common. Most non-tiny CPUs have | them these days. | insanitybit wrote: | > I don't think Blake3 has any hardware implementation at all | yet. | | > https://github.com/BLAKE3-team/BLAKE3 | | > The blake3 Rust crate, which includes optimized | implementations for SSE2, SSE4.1, AVX2, AVX-512, and NEON, with | automatic runtime CPU feature detection on x86. The rayon | feature provides multithreading. | | There aren't blake3 instructions, like some hardware has for | SHA1, but it does use hardware acceleration. | | edit: Re-reading, I think you're saying "If we're going to talk | about hardware acceleration, SHA1 still has the advantage | because of specific instructions" - that is true. | jonhohle wrote: | I just tested the C implementation on a utility I wrote[0] and | at least on macOS where SHA256 is hardware accelerated beyond | just NEON, BLAKE3 ends up being slower than SHA256 from | CommonCrypto (the Apple provided crypto library). BLAKE3 ends | up being 5-10% slower for the same input set. | | As far as I'm aware, Apple does not expose any of the hardware | crypto functions, so unless what exists supports BLAKE3 and | they add support in CommonCrypto, there's no advantage to using | it from a performance perspective. | | The rust implementation is multithreaded and ends up beating | SHA256 handily, but again, for my use case the C impl is only | single threaded, and the utility assumes a single threaded | hasher with one running on each core. Hashing is the bottleneck | for `dedup`, so finding a faster hasher would have a lot of | benefits. | | 0 - https://github.com/ttkb-oss/dedup | tromp wrote: | For short inputs, Blake3 behaves very similar to Blake2, on which | it is based. From Blake's wikipedia page [1]: | | BLAKE3 is a single algorithm with many desirable features | (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE and | BLAKE2, which are algorithm families with multiple variants. | BLAKE3 has a binary tree structure, so it supports a practically | unlimited degree of parallelism (both SIMD and multithreading) | given long enough input. | | [1] https://en.wikipedia.org/wiki/BLAKE_(hash_function) | cesarb wrote: | While I really like Blake3, for all reasons mentioned in this | article, I have to say it does have one tiny disadvantage over | older hashes like SHA-256: its internal state is slightly bigger | (due to the tree structure which allows it to be highly | parallelizable). This can matter when running on tiny | microcontrollers with only a few kilobytes of memory. | londons_explore wrote: | The internal state is no bigger when hashing small things | though right? | | I assume most microcontrollers are unlikely to be hashing | things much bigger than RAM. | oconnor663 wrote: | It's hard to give a short answer to that question :) | | - Yes, if you know your input is short, you can use a smaller | state. The limit is roughly a BLAKE2s state plus (32 bytes | times the log_2 of the number of KiB you need to hash). | Section 5.4 of https://github.com/BLAKE3-team/BLAKE3-specs/bl | ob/master/blak... goes into this. | | - But it's hard to take advantage of this space optimization, | because no libraries implement it in practice. | | - But the reason libraries don't implement it is that almost | no one needs it. The max state size is just under 2 KiB, | which is small enough even for | https://github.com/oconnor663/blake3-6502. | | - But it would be super easy to implement if we just put the | "CV stack" on the heap instead of allocating the whole thing | as an array up front. | | - But the platforms that care about this don't have a heap. | | @caesarb mentioned _really_ tiny microcontrollers, even | tinier than the 6502 maybe. The other place I 'd expect to | see this optimization is in a full hardware implementation, | but those are rare. Most hardware accelerators for hash | functions provide the block operation, and they leave it to | software to deal with this sort of bookkeeping. | gavinhoward wrote: | Good, terse article that basically reinforces everything I've | seen in my research about cryptographic hashing. | | Context: I'm building a VCS meant for any size of file, including | massive ones. It needs a cryptographic hash for the Merkle Tree. | | I've chosen BLAKE3, and I'm going to use the original | implementation because of its speed. | | However, I'm going to make it easy to change hash algorithms _per | commit_ , just so I don't run into the case that Git had trying | to get rid of SHA1. | AdamN wrote: | Smart idea doing the hash choice per-commit. Just make sure | that somebody putting in an obscure hash doesn't mess up | everybody's usage of the repo if they don't have a library to | evaluate that hash installed. | gavinhoward wrote: | I agree. | | There will be a set of presets of hash function and settings; | if BLAKE3 fails, then I'll actually have to add SHA3 or | something, with a set of settings, as presets. | | The per-commit storage will then be an enum identifying the | hash and its settings. | | This will let me do other things, like letting companies use | a 512-bit hash if they expect the repo to be large. | agodfrey wrote: | Maybe you're already aware, but you glossed over something: | Since you're using the hash to locate/identify the contect | (you mentioned Merkle and git), if you support multiple | hash functions you need some assurance that the chance of | collisions is low _across all supported hash functions_. | For example two identical functions that differ only in the | value of their padding bytes (when the input size doesn't | match the block size) can't coexist. | gavinhoward wrote: | You are absolutely right. And yes, I am aware. | | Location will actually be done by prefixing the hash with | the value of the enum for the hash function/settings pair | that made the hash. | tczMUFlmoNk wrote: | Since you seem to have done a fair bit of research in | this area, do you have any opinions or thoughts about the | Multihash format? | | https://multiformats.io/multihash/ | | It fills in some of the blanks in your "prefixing the | hash with the value of the enum for the hash" step. | gavinhoward wrote: | The multihash format is an excellent format that I am | tempted to use. | | However, there are a two general problems: | | * The spec is not done, and it doesn't look like much has | been done. | | * While I agree with the FAQ that agreeing on a set of | hash functions is possible, the format only allows 256 | possible hash functions, so it can run out of space, | especially since there can be multiple formats of the | same function (BLAKE2B and BLAKE2S, for example), and | especially since they want to allow non-cryptographic | hash functions. | | Then specifically for my use case: | | * There is the problem brought up by AdamN [1]: if | multihash is supported, an obscure hash may be supported, | and that may cause problems. | | * As an extension of that, once a hash function and set | of settings is supported, it's supported forever, so I | want to be more picky, and an enum allows me to restrict | what is usable. | | * By using a 16-bit enum, I have more space to grow. | | * And finally, by using an enum, if my software | encounters a repo with a enum value greater than all of | the values it knows, it knows that that repo needs a | later version of the software, since I will only _add_ | enum values. | | [1]: https://news.ycombinator.com/item?id=38250444 | FabHK wrote: | > letting companies use a 512-bit hash if they expect the | repo to be large. | | A repo would have to have more than 1e32 documents for a | one in a trillion chance of a collision with a 256 bit | hash. (Total annual world data production is estimated at | less than 1e24 bytes.) | | A 512 bit hash thus seems overkill for almost all purposes. | | https://en.wikipedia.org/wiki/Birthday_problem | | https://www.weforum.org/agenda/2021/05/world-data- | produced-s... | gavinhoward wrote: | For normal VCS's, you are absolutely right. And you're | actually right for mine, but I decided to redo the math | to make sure. | | My VCS will track things at a finer level than documents. | In a C file, it will track individual functions and | structs. In a Java file, it will track individual fields | and classes. In a Microsoft Word document, it might track | individual paragraphs. And in a Blender file, it will | track each object, material, texture, etc. individually. | | Yes, it will handle binary files. | | Anyway, it will also be designed for non-technical users. | To that end, it will hook into the source software and do | a "commit" every time the user saves. | | It will also track individual directories to make renames | work. | | I am a ctrl+s freak, so I save once a minute or more. | However, other people are not, so let's assume 10 minutes | (for autosave, perhaps). | | Now let's assume a monorepo for a company of 100,000 | people. And let's assume that when they save every 10 | minutes, they save one object in one file (also tracked) | two directories down. That means they create 5 hashes | every 10 minutes (the fifth is the top level). | | Let's assume an effective 6-hour work day. | | That's 5 objects times 6 times per hour times 6 hours. | That's 180 objects a day per person. | | That's 18,000,000 total objects per day. Times 5 for days | in a week, times 50 for work weeks in a year. | | That's 4.5 billion. | | Let's multiply that by 40 for 40 years that the repo | exists, which includes some of the oldest software. | | That's 1.8e11 objects. According to [1], a 128-bit hash | would not be enough for the error correction on a disk at | that point. | | However, a 256-bit hash would give us a 10^31 objects | before reaching that point, which gives us 10^20 times 40 | years of space. | | Yep, you're absolutely right that 512 bits is overkill. I | stand corrected. | | [1]: https://en.m.wikipedia.org/wiki/Birthday_attack | deadbeeves wrote: | You're tracking things at the content level? How will you | deal with files that are purposely broken, or which cause | the parser to take impractical (but finite) times to | complete? Also, tracking the history of a class makes | sense to some extent, but you say you want to commit | every time there's a save. How will you maintain a | history when most commits are likely to contain | unparseable code and so break the continuity of objects? | nextaccountic wrote: | In those cases you can just do error recovery in the | parser (truncating an erroring function for example) and | then store out-of-band information necessary to | reconstruct the original file | | This is also necessary to deal with whitespace for | example (if you just reformat the code, you didnt change | the ast but you changed the file) | gavinhoward wrote: | Good questions. | | > How will you deal with files that are purposely broken, | or which cause the parser to take impractical (but | finite) times to complete? | | I've never seen a language parser do that, but if I run | into a language that does that, I'll probably have my VCS | track it at the file level, based on tokens or lines. | | Dumb languages don't get nice things. :) | | > How will you maintain a history when most commits are | likely to contain unparseable code and so break the | continuity of objects? | | This is less of a problem with binary files (assuming the | source software does not have bugs in output), but with | source files, you're right that that problem does exist. | | As of right now, I would do a token-based approach. This | approach removes the need for whitespace-only commits, | and if I track the tokens right, I should be able to | identify which right brace _used_ to end the function | until the broken code was saved. Then I would just save | the function as broken using that same right brace. | | For example, say you have this: int | main() { return 0; } | | My VCS would know that the right brace corresponds to the | end of the function. | | Then you write this: int main() { | if (global_bool) { return 0; } | | Yes, a dumb system might think that the right brace is | for the `if`. | | However, if you break it down by tokens, the VCS will see | that `if (global_bool) {` were added before the return, | so it should be able to tell that the right brace still | ends the function. | | I hope that makes sense. | | Another plausible way to do it (at least in C) would be | to look for things that look like declarations. The | series of tokens `<type> <name> <left_paren>` is probably | a function declaration. Java would be easier; its | declarations are more wordy. | | I still have to prove this is possible, but I think it | is. | EdSchouten wrote: | What I dislike about BLAKE3 is that they added explicit logic to | ensure that identical chunks stored at different offsets result | in different Merkle tree nodes (a.k.a. the 'chunk counter'). | | Though this feature is well intended, it makes this hash function | hard to use for a storage system where you try to do aggressive | data deduplication. | | Furthermore, on platforms that provide native instructions for | SHA hashing, BLAKE3 isn't necessarily faster, and certainly more | power hungry. | lazide wrote: | Huh? | | The storage system doing this wouldn't use that part of the | hash, it would do it itself so no issues? (Hash chunks, instead | of feeding everything in linearly) | | Otherwise the hash isn't going to be even remotely safe for | most inputs? | persnickety wrote: | Could you point to how this is implemented and how it can be | used? From the sound of it, you're trying to do something like | rsync's running-window comparison? | EdSchouten wrote: | Imagine the case where you're trying to create a storage | system for a large number of virtual machine images (e.g., | you're trying to build your own equivalent of AWS Machine | Images). There is obviously a lot of duplication between | parts of images. And not necessarily at the same offset, but | also at different offsets that are n*2^k bytes apart, where | 2^k represents the block/sector size. | | You could consider building this storage system on top of | BLAKE3's tree model. Namely you store blocks as small Merkle | tree. And an image is basically a collection of blocks that | has a different 'hat' on top of it. Unfortunately, BLAKE3 | makes this hard, because the same block will end up having a | different Merkle tree node depending on the offset at which | it's stored. | luoc wrote: | You mean something like a CDC algorithm? I know that some | Backup tools like Restic use this. | | https://en.m.wikipedia.org/wiki/Rolling_hash | EdSchouten wrote: | You can use a CDC algorith, but if you know that | duplication mostly occurs at power-of-two boundaries, | there is no need to use that. Deduplicating on a binary | tree basis is sufficient. | londons_explore wrote: | Sounds to me like you are trying to use the innards of a | hash algorithm for something for which it was not | designed... | | Either modify the algorithm to your needs, and rename it. | | Or just use something thats already suitable off-the-shelf. | Plenty of merkle-trees out there. | luoc wrote: | I thing CDC is what you're looking for. Some backup tools | like restic use it. See | https://en.m.wikipedia.org/wiki/Rolling_hash | luoc wrote: | Duplicated myself, sry | marktangotango wrote: | > You could consider building this storage system on top of | BLAKE3's tree model. | | Consider a crypto currency pow that did that without the | chunk counter. It'd be trivially exploitably by | precalculating all the tree but the chunk that changed per | nonce. | prirun wrote: | Author of HashBackup here. I don't see how any kind of hash | tree would be effective at de-duplicating VM machine | images, other than the degenerate case of an exact copy, | which is easy to detect with a single file hash. | | Most OSes use 4K block sizes. To get the best dedup you | have to hash every 4K block and lookup each one | individually in a dedup table. Two VM images could both | contain an identical 4GB file, but every 4K block of that | file could be stored at different offsets in the VM images. | A tree hash wouldn't let you dedup anything but identical | sections stored at identical offsets, whereas a dedup table | and 4K blocks allows you to dedup the entire file. | oconnor663 wrote: | We go over some of our reasoning around that in section 7.5 of | https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak... | . An early BLAKE3 prototype actually didn't include the chunk | counter (https://github.com/oconnor663/bao/blob/master/docs/spe | c_0.9....), so I'm definitely sympathetic to the use cases that | wish it wasn't there. However, after publication we found out | that something like a chunk counter is necessary for the | security of the Bao streaming verification tool: | https://github.com/oconnor663/bao/issues/41. It could be that | there's a design that's the best of both worlds, but I'm not | sure. | stylepoints wrote: | Until it starts coming installed by default on Linux and other | mojor OS's, it won't be mainstream. | theamk wrote: | Python 3.11 will have it https://bugs.python.org/issue39298 | latexr wrote: | That says "Resolution: rejected" and Python is currently at | 3.12.0. Did the feature land? | theamk wrote: | oops I misread it.. seems it was rejected because it was | not standard enough... | | https://github.com/python/cpython/issues/83479#issuecomment | -... | syonfox wrote: | murmur3 | Snawoot wrote: | murmur3 is not a cryptographic hash, so it's not even in the | same field. | latexr wrote: | It bears mentioning `shasum` is better supported in that it ships | with operating systems (macOS, I guess Linux depends on the | distro, don't know about Windows) and is available directly from | programming languages (Ruby, Swift, Python, ...) without the need | for libraries. | | Even if BLAKE3 is massively faster, it's not like I ever noticed | SHA256's supposed slowness. But I do notice its ubiquity. | | Based on the article, I would consider switching to BLAKE3 | immediately where I use checksums. But until it gets wider | support (might be easier with a public domain license instead of | the current ones) I can't really do it because I need to do | things with minimal dependencies. | | Best of luck to the BLAKE3 team on making their tool more widely | available. | upget_tiding wrote: | > might be easier with a public domain license instead of the | current ones | | There reference implementation is public domain (CC0) or at | your choice Apache 2.0 | | https://github.com/BLAKE3-team/BLAKE3/blob/master/LICENSE | zahllos wrote: | I agree that if you can, BLAKE3 (or even BLAKE2) are nicer | choices than SHA2. However I would like to add the following | comments: | | * SHA-2 fixes the problems with SHA-1. SHA-1 was a step up over | SHA-0 that did not completely resolve flaws in SHA-0's design | (SHA-0 was broken very quickly). | | * JP Aumasson (one of the B3 authors) has said publicly a few | times SHA-2 will never be broken: | https://news.ycombinator.com/item?id=13733069 is an indirect | source, can't seem to locate a direct one from Xitter (thanks | Elon). | | Thus it does not necessarily follow that SHA-2 is a bad choice | because SHA-1 is broken. | gavinhoward wrote: | All that may be true. | | However, I don't think we can say for sure if SHA2 will be | broken. Cryptography is hard like that. | | In addition, SHA2 is still vulnerable to length extension | attacks, so in a sense, SHA2 _is_ broken, at least when length | extension attacks are part of the threat model. | Godel_unicode wrote: | I don't understand why people use sha256 when sha512 is often | significantly faster: | | https://crypto.stackexchange.com/questions/26336/sha-512-fas... | oconnor663 wrote: | A couple reasons just on the performance side: | | - SHA-256 has hardware acceleration on many platforms, but | SHA-512 mostly doesn't. | | - Setting aside hardware acceleration, SHA-256 is faster on | 32-bit platforms, like a lot of embedded devices. If you have | to choose between "fast on a desktop" vs "fast in embedded", it | can make sense to assume that desktops are always fast enough | and that your bottlenecks will be in embedded. | adrian_b wrote: | On older 64-bit CPUs without hardware SHA-256 (i.e. up to the | Skylake derivatives), SHA-512 is faster. | | Many recent Arm CPUs have hardware SHA-512 (and SHA-3). | | Intel will add hardware SHA-512 starting with Arrow Lake S, | to be launched at the end of 2024 (the successor in desktops | of the current Raptor Lake Refresh). | | Most 64-bit CPUs that have been sold during the last 4 years | and many of those older than that have hardware SHA-256. | garblegarble wrote: | This may only be applicable to certain CPUs - e.g. sha512 is a | lot slower on M1 $ openssl speed sha256 | sha512 type 16 bytes 64 bytes 256 | bytes 1024 bytes 8192 bytes 16384 bytes sha256 | 146206.63k 529723.90k 1347842.65k 2051092.82k 2409324.54k | 2446518.95k sha512 85705.68k 331953.22k | 707320.92k 1149420.20k 1406851.34k 1427259.39k | nayuki wrote: | It's an interesting set of reasons, but I prefer Keccak/SHA-3 | over SHA-256, SHA-512, and BLAKE. I trust the standards body and | public competition and auditing that took place - more so than a | single author trumpeting the virtues of BLAKE. | tptacek wrote: | I'd probably use a Blake too. But: | | _SHA256 was based on SHA1 (which is weak). BLAKE was based on | ChaCha20, which was based on Salsa20 (which are both strong)._ | | _NIST /NSA have repeatedly signaled lack of confidence in | SHA256: first by hastily organising the SHA3 contest in the | aftermath of Wang's break of SHA1_ | | No: SHA2 lacks the structure the SHA1 attack relies on it (SHA1 | has a linear message schedule, which made it possible to work out | a differential cryptanalysis attack on it). | | Blake's own authors keep saying SHA2 is secure (modulo length | extension), but people keep writing stuff like this. Blake3 is a | good and interesting choice on the real merits! It doesn't need | the elbow throw. | ianopolous wrote: | Would be interesting to hear Zooko's response to this. (Peergos | lead here) | pclmulqdq wrote: | Most people who publicly opine on the Blake vs. SHA2 debate | seem to be relatively uninformed on the realities of each one. | SHA2 and the Blakes are both usually considered to be secure. | | The performance arguments most people make are also outdated or | specious: the original comparisons of Blake vs SHA2 performance | on CPUs were largely done before Intel and AMD had special SHA2 | instructions. | ianopolous wrote: | The author is one of the creators of blake3, Zooko. | tptacek wrote: | Sorry, I should have been more precise. JP Aumasson is | specifically who I'm thinking of; he's made the semi- | infamous claim that SHA2 won't be broken in his lifetime. | The subtext I gather is that there's just nothing on the | horizon that's going to get it. SHA1 we saw coming a ways | away! | zahllos wrote: | Quoting directly from https://nostarch.com/crypto- | dictionary under the entry SHA-2: | | > Unlike SHA-1, SHA-2 algorithms aren't broken and are | unlikely to ever be. | | There's also the fact NIST themselves deprecated SHA-1 in | 2011 (https://csrc.nist.gov/news/2017/research-results- | on-sha-1-co... not mentioned, but otherwise nice timeline | here: https://crypto.stackexchange.com/a/60655), yet | SHA-2 is still OK. Wiki has a table of cryptanalysis of | SHA-2: https://en.wikipedia.org/wiki/SHA-2#Cryptanalysis_ | and_valida... | | The summary is that either you attack a very reduced | round variant and you get "practical" complexity for the | attack, or you attack almost a full round variant and you | get an entirely practical attack. | | So I think your interpretation of the subtext is entirely | correct. | honzaik wrote: | Also, the NSA is currently recommending to change SHA3/Keccak | inside Dilithium and Kyber into SHA2-based primitives... | https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/SPTp... | twiss wrote: | For those who didn't click the link, it should be noted that | they're suggesting this because it would be easier to deploy | (in places that have a SHA-2 implementation but not SHA-3), | not for reasons related to security or anything like that. | Looking at the responses, there's also some disagreement on | whether it would offer equal security for the particular use | case of ML-DSA and ML-KEM (as the final version of Dilithium | and Kyber standardized by NIST will be called). | hellcow wrote: | > they're suggesting this because it would be easier to | deploy (in places that have a SHA-2 implementation but not | SHA-3), not for reasons related to security | | That's a bit absurd, right? Sure, the NSA didn't overtly | say, "we propose you use SHA-2 because we can break it." | That doesn't mean it's secure against them. | | We can't look at their stated justification for supporting | one algorithm over another because the NSA lies. Their very | _purpose_ as an organization is to defeat encryption, and | one tactic is to encourage the industry to use something | they can defeat while reassuring people it's secure. We | need to look at their recommendations with a lot of | suspicion and assume an ulterior motive. | tptacek wrote: | The article uses inferred NSA preferences as | justification to avoid SHA2. Can't have it both ways. | pbsd wrote: | While there is more confidence now on the security of SHA-2, or | rather the lack of transference of the SHA-1 approach to SHA-2, | this was not the case in 2005-2006 when NIST decided to hold | the SHA-3 competition. See for example the report on Session 4 | of the 2005 NIST workshop on hash functions [1]. | | [1] https://csrc.nist.gov/events/2005/first-cryptographic- | hash-w... | omginternets wrote: | What do you mean by "weak" and "strong", here? | nabla9 wrote: | _Weak_ means that a mathematical flaw has been discovered | that makes it inherently insecure, or that it is so simple | that modern computer technology makes it possible to use | "brute force" to crack. _Strong_ means that it 's not either. | chx wrote: | There are fundamentally two kinds of attacks, preimage which | splits into two: | | In a first-preimage attack, you know a hash value but not the | message that created it, and you want to discover any message | with the known hash value; in the second-preimage attack, you | have a message and you want to find a second message that has | the same hash. Attacks that can find one type of preimage can | often find the other as well. A weak algorithm allows this to | be done in less than 2^(hash length) attempts. | | And then there is collision: two messages which produce the | same hash. A weak algorithm allows this to be done in less | than 2^(half of hash length) attempts. | | Source: https://www.rfc-editor.org/rfc/rfc4270 | RcouF1uZ4gsC wrote: | SHA-256 has the advantage that it is used for BitCoin. It is the | biggest bug bounty of all time. There see literally billions | riding on the security of SHA-256. | aburan28 wrote: | There has been a mountain of cryptanalysis done on SHA256 with no | major breaks compared to a much smaller amount analysis on | blake3. | jrockway wrote: | Fast hash functions are really important, and SHA256 is really | slow. Switching the hash function where you can is enough to | result in user-visible speedups for common hashing use cases; | verifying build artifacts, seeing if on-disk files changed, etc. | I was writing something to produce OCI container images a few | months ago, and the 3x SHA256 required by the spec for layers | actually takes on the order of seconds. (.5s to sha256 a 50MB | file, on my 2019-era Threadripper!) I was shocked to discover | this. (gzip is also very slow, like shockingly slow, but | fortunately the OCI spec lets you use Zstd, which is | significantly faster.) | coppsilgold wrote: | sha256 is not slow on modern hardware. openssl doesn't have | blake3, but here is blake2: type | 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes | 16384 bytes BLAKE2s256 75697.37k 308777.40k | 479373.40k 567875.81k 592687.09k 591254.18k | BLAKE2b512 63478.11k 243125.73k 671822.08k | 922093.51k 1047833.51k 1048959.57k sha256 | 129376.82k 416316.32k 1041909.33k 1664480.49k 2018678.67k | 2043838.46k | | This is with the x86 sha256 instructions: sha256msg1, | sha256msg2, sha256rnds2 | dralley wrote: | "modern hardware" deserves some caveats. AMD has supported | those extensions since the original Zen, but Intel CPUs | generally lacked them until only about 2 years ago. | adrian_b wrote: | For many years, starting in 2016, Intel has supported | SHA-256 only in their Atom CPUs. | | The reason seems to be that the Atom CPUs were compared in | Geekbench with ARM CPUs, and without hardware SHA the Intel | CPUs would have obtained worst benchmark scores. | | In their big cores, SHA has been added in 2019, in Ice Lake | (while Comet Lake still lacked it, being a Skylake | derivative), and since then all newer Intel CPUs have it. | | So except for the Intel Core CPUs, the x86 and ARM CPUs | have had hardware SHA for at least 7 years, while the Intel | Core CPUs have had it for the last 4 years. | richardwhiuk wrote: | If you want a fast hash function (and don't care about it's | cryptographic properties), don't use a cryptographic hash | function. | dralley wrote: | BLAKE3 is actually competitive with non-cryptographic hashes | like crc32. | oconnor663 wrote: | To be fair, it _really_ depends on the platform. There 's | an argument to be made that platforms where you care about | the difference are specifically the ones where BLAKE3 is | slower (no SIMD, no threads). | adrian_b wrote: | SHA256 is very fast on most modern CPUs, i.e. all AMD Zen, all | Intel Atom since 2016, Intel Core Ice Lake or newer, Armv8 and | Armv9. | | I use every day both SHA-256 and BLAKE3. BLAKE3 is faster only | because it is computed by multiple threads using all available | CPU cores. When restricted to a single thread, it is slower on | CPUs with fast hardware SHA-256. | | The extra speed of BLAKE3 is not always desirable. The fact | that it uses all cores can slow down other concurrent | activities, without decreasing the overall execution time of | the application. | | There are cases when the computation of a hash like SHA-256 can | be done as a background concurrent activity, or when the speed | of hashing is limited by the streaming speed of data from the | main memory or from a SSD, so spawning multiple threads does | not gain anything and it only gets in the way of other | activities. | | So the right choice between SHA-256 and BLAKE3 depends on the | application. Both can be useful. SHA-256 has the additional | advantage that it needs negligible additional code, only a few | lines are necessary to write a loop that computes the hash | using the hardware instructions. | ur-whale wrote: | One metric that is seldom mentioned for crypto algos is code | complexity. | | I really wish researchers would at least pay lip service to it. | | TEA (an unfortunately somewhat weak symmetric cipher) was a very | nice push in that direction. | | TweetNaCl was another very nice push in that direction by djb | | Why care about that metric you ask? | | Well here are a couple of reasons: - algo fits in head - algo is | short -> cryptanalysis likely easier - algo is short -> less | likely to have buggy implementation - algo is short -> side- | channel attacks likely easier to analyse - algo fits in a 100 | line c++ header -> can be incorporated into anything - algo can | be printed on a t-shirt, thereby skirting export control | restrictions - algo can easily be implemented on tiny micro- | controllers | | etc ... | oconnor663 wrote: | We put a lot of effort into section 5.1.2 of https://github.com | /BLAKE3-team/BLAKE3-specs/blob/master/blak..., and the | complicated part of BLAKE3 (incrementally building the Merkle | tree) ends up being ~4 lines of code. Let me know what you | think. | rstuart4133 wrote: | > One metric that is seldom mentioned for crypto algos is code | complexity. ... TEA (an unfortunately somewhat weak symmetric | cipher) was a very nice push in that direction. | | Spec is also a push in that direction [0]. It's code looks to | be as complex as TEA's (1/2 a page of C), blindingly fast, yet | as far I know has no known attacks despite being subject to a | fair bit of scrutiny. About the only reason I can see for it | not being largely ignored is it was designed by NSA. | | SHA3 is also a simple algorithm. Downright pretty, in fact. | It's a pity it's so slow. | | [0] https://en.wikipedia.org/wiki/Speck_(cipher) | colmmacc wrote: | It's very hard to see Blake3 getting included in FIPS. Meanwhile, | SHA256 is. That's probably the biggest deciding factor on whether | you want to use it or not. | Retr0id wrote: | Blake3 is a clear winner for large inputs. | | However, for smaller inputs (~1024 bytes and down), the | performance gap between it and everything else (blake2, sha256) | gets much narrower, because you don't get to benefit from the | structural parallelization. | | If you're mostly dealing with small inputs, raw hash throughput | is _probably_ not high on your list of concerns - In the context | of a protocol or application, other costs like IO latency | probably completely dwarf the actual CPU time spent hashing. | | If raw performance is no longer high on your list of priorities, | you care more about the other things - ubiquitous and battle- | tested library support (blake3 is still pretty bleeding-edge, in | the grand scheme of things), FIPS compliance (sha256), greater | on-paper security margin (blake2). Which is all to say, while | blake3 _is_ great, there are still plenty of reasons not to | prefer it for a particular use-case. | LegibleCrimson wrote: | How does the extended output work, and what's the point of | extended output? | | From what I can see, BLAKE3 has 256 bits of security, and | extended output doesn't provide any extra security. In this case, | what's the point of extended output over doing something like | padding with 0-bits or extending by re-hashing the previous | output and appending it to the previous output (eg, for 1024 | bits, doing h(m) . h(h(m)) . h(h(h(m))) . h(h(h(h(m))))). Either | way, you get 256 bits of security. | | Is it just because the design of the hash makes it simple to do, | so it's just offered as a consistent option for arbitrary output | sizes where needed, or is there some greater purpose that I'm | missing? | oconnor663 wrote: | > From what I can see, BLAKE3 has 256 bits of security, and | extended output doesn't provide any extra security. | | 128 bits of collision resistance but otherwise correct. As a | result of that we usually just call it 128 bits across the | board, but yes in an HMAC-like use case you would generally | expect 256 bits of security from the 256 bit output. Extended | outputs don't change that, because the internal chaining values | are 256 bits even when the output is larger. | | > extending by re-hashing the previous output and appending it | to the previous output | | It's not quite that simple, because you don't want later parts | of your output to be predictable from earlier parts (which | might be published, depending on the use case). You also want | it to be parallelizable. | | You could compute H(m) as a "pre-hash" and then make an | extended output something like H(H(m)|1)|H(H(m)|2)|... That's | basically what BLAKE3 is doing in the inside. The advantage of | having the algorithm do it for you is that 1) it's an "off the | shelf" feature that doesn't require users to roll their own | crypto and 2) it's slightly faster when the input is short, | because you don't have to spend an extra block operation | computing the pre-hash. | | > what's the point of extended output? | | It's kind of niche, but for example Ed25519 needs a 512 bit | hash output internally to "stretch" its secret seed into two | 256-bit keys. You could also use a BLAKE3 output reader as a | stream cipher or a random byte generator. (These sorts of use | cases are why it's nice not to make the caller tell you the | output length in advance.) | 15155 wrote: | Keccak is my preference. Keccak is substantially easier to | implement in hardware: fewer operations and no carry propagation | delay issue because there's no addition. ___________________________________________________________________ (page generated 2023-11-13 23:00 UTC)