[HN Gopher] Reasons to Prefer Blake3 over Sha256
       ___________________________________________________________________
        
       Reasons to Prefer Blake3 over Sha256
        
       Author : ementally
       Score  : 168 points
       Date   : 2023-11-13 12:34 UTC (10 hours ago)
        
 (HTM) web link (peergos.org)
 (TXT) w3m dump (peergos.org)
        
       | ndsipa_pomu wrote:
       | At this rate, it's going to take over 700 years before we get
       | Blake's 7
        
         | benj111 wrote:
         | I had to scroll disappointingly far down to get to the Blake's
         | 7 reference.
         | 
         | Thank you for not disappointing though.
         | 
         | The down side of that algorithm though is that everything dies
         | at the end.
        
       | dragontamer wrote:
       | > BLAKE3 is much more efficient (in time and energy) than SHA256,
       | like 14 times as efficient in typical use cases on typical
       | platforms.
       | 
       | [snip]
       | 
       | > AVX in Intel/AMD, Neon and Scalable Vector Extensions in Arm,
       | and RISC-V Vector computing in RISC-V. BLAKE3 can take advantage
       | of all of it.
       | 
       | Uh huh... AVX/x86 and NEON/ARM you say?
       | 
       | https://www.felixcloutier.com/x86/sha256rnds2
       | 
       | https://developer.arm.com/documentation/ddi0596/2021-12/SIMD...
       | 
       | If we're talking about vectorized instruction sets like AVX
       | (Intel/AMD) or NEON (aka: ARM), the advantage is clearly with
       | SHA256. I don't think Blake3 has any hardware implementation at
       | all yet.
       | 
       | Your typical cell phone running ARMv8 / NEON will be more
       | efficient with the SHA256 instructions than whatever software
       | routine you need to run Blake3. Dedicated hardware inside the
       | cores is very difficult to beat on execution speed or efficiency.
       | 
       | I admit that I haven't run any benchmarks on my own. But I'd be
       | very surprised if any software routine were comparable to the
       | dedicated SHA256 instructions found on modern cores.
        
         | eatonphil wrote:
         | From another thread:
         | 
         | > On my machine with sha extensions, blake3 is about 15% faster
         | (single threaded in both cases) than sha256.
         | 
         | https://news.ycombinator.com/item?id=22237387
        
           | vluft wrote:
           | followup to this now with further blake3 improvements, on a
           | faster machine now but with sha extensions vs single-threaded
           | blake3; blake3 is about 2.5x faster than sha256 now. (b3sum
           | 1.5.0 vs openssl 3.0.11). b3sum is about 9x faster than
           | sha256sum from coreutils (GNU, 9.3) which does not use the
           | sha extensions.                 Benchmark 1: openssl sha256
           | /tmp/rand_1G         Time (mean +- s):     576.8 ms +-   3.5
           | ms    [User: 415.0 ms, System: 161.8 ms]         Range (min
           | ... max):   569.7 ms ... 580.3 ms    10 runs
           | Benchmark 2: b3sum --num-threads 1 /tmp/rand_1G         Time
           | (mean +- s):     228.7 ms +-   3.7 ms    [User: 168.7 ms,
           | System: 59.5 ms]         Range (min ... max):   223.5 ms ...
           | 234.9 ms    13 runs            Benchmark 3: sha256sum
           | /tmp/rand_1G         Time (mean +- s):      2.062 s +-  0.025
           | s    [User: 1.923 s, System: 0.138 s]         Range (min ...
           | max):    2.046 s ...  2.130 s    10 runs            Summary
           | b3sum --num-threads 1 /tmp/rand_1G ran           2.52 +- 0.04
           | times faster than openssl sha256 /tmp/rand_1G           9.02
           | +- 0.18 times faster than sha256sum /tmp/rand_1G
        
             | 0xdeafbeef wrote:
             | Interestingly, sha256sum and openssl don't use sha_ni.
             | 
             | iced-cpuid $(which b3sum) AVX AVX2 AVX512F AVX512VL BMI1
             | CET_IBT CMOV SSE SSE2 SSE4_1 SSSE3 SYSCALL XSAVE
             | 
             | iced-cpuid $(which openssl ) CET_IBT CMOV SSE SSE2
             | 
             | iced-cpuid $(which sha256sum) CET_IBT CMOV SSE SSE2
             | 
             | Also, sha256sum in my case is a bit faster
             | 
             | Benchmark 1: openssl sha256 /tmp/rand_1G Time (mean +- s):
             | 540.0 ms +- 1.1 ms [User: 406.2 ms, System: 132.0 ms] Range
             | (min ... max): 538.5 ms ... 542.3 ms 10 runs
             | 
             | Benchmark 2: b3sum --num-threads 1 /tmp/rand_1G Time (mean
             | +- s): 279.6 ms +- 0.8 ms [User: 213.9 ms, System: 64.4 ms]
             | Range (min ... max): 278.6 ms ... 281.1 ms 10 runs
             | 
             | Benchmark 3: sha256sum /tmp/rand_1G Time (mean +- s): 509.0
             | ms +- 6.3 ms [User: 386.4 ms, System: 120.5 ms] Range (min
             | ... max): 504.6 ms ... 524.2 ms 10 runs
        
               | vluft wrote:
               | not sure that tool is correct; on my openssl it shows
               | same output as you have there, but not aes-ni which is
               | definitely enabled and functional.
               | 
               | ETA: ahh you want to do that on libcrypto:
               | iced-cpuid <...>/libcrypto.so.3:         ADX AES AVX AVX2
               | AVX512BW AVX512DQ AVX512F AVX512VL AVX512_IFMA BMI1 BMI2
               | CET_IBT CLFSH CMOV D3NOW MMX MOVBE MSR PCLMULQDQ
               | PREFETCHW RDRAND RDSEED RTM SHA SMM SSE SSE2 SSE3 SSE4_1
               | SSSE3 SYSCALL TSC VMX XOP XSAVE
        
               | vluft wrote:
               | further research suggests that GNU coreutils cksum will
               | use libcrypto in some configurations (though not mine); I
               | expect that both both your commands above are actually
               | using sha-ni
        
             | TacticalCoder wrote:
             | > ...on a faster machine now but with sha extensions vs
             | single-threaded blake3; blake3 is about 2.5x faster than
             | sha256 now
             | 
             | But one of the sweet benefit of blake3 is that it is
             | parallelized _by default_. I picked blake3 not because it
             | 's 2.5x faster than sha256 when running b3sum with "--num-
             | threads 1" but because, with the default, it's 16x faster
             | than sha256 (on a machine with "only" 8 cores).
             | 
             | And Blake3, contrarily to some other "parallelizable"
             | hashes, give the same hash no matter if you run it with one
             | thread or any number of threads (IIRC there are hashes
             | which have different executables depending if you want to
             | run the single-threaded or multi-threader version of the
             | hash, and they give different checksums).
             | 
             | I tried on my machine (which is a bit slower than yours)
             | and I get:                  990 ms openssl sha256
             | 331 ms b3sum --num-threads 1         70 ms b3sum
             | 
             | That's where the big performance gain is when using Blake3
             | IMO (even though 2.5x faster than a fast sha256 even when
             | single-threaded is already nice).
        
               | vluft wrote:
               | yup, for comparison, same file as above using all the
               | threads (32) on my system, I get about 45ms with fully
               | parallel permitted b3. It does run into diminishing
               | returns fairly quickly though; unsurprisingly no
               | improvements in perf using hyperthreading, but also
               | improvements drop off pretty fast with more cores.
               | b3sum --num-threads 16 /tmp/rand_1G ran         1.01 +-
               | 0.02 times faster than b3sum --num-threads 15
               | /tmp/rand_1G         1.01 +- 0.02 times faster than b3sum
               | --num-threads 14 /tmp/rand_1G         1.03 +- 0.02 times
               | faster than b3sum --num-threads 13 /tmp/rand_1G
               | 1.04 +- 0.02 times faster than b3sum --num-threads 12
               | /tmp/rand_1G         1.07 +- 0.02 times faster than b3sum
               | --num-threads 11 /tmp/rand_1G         1.10 +- 0.02 times
               | faster than b3sum --num-threads 10 /tmp/rand_1G
               | 1.13 +- 0.02 times faster than b3sum --num-threads 9
               | /tmp/rand_1G         1.20 +- 0.03 times faster than b3sum
               | --num-threads 8 /tmp/rand_1G         1.27 +- 0.03 times
               | faster than b3sum --num-threads 7 /tmp/rand_1G
               | 1.37 +- 0.02 times faster than b3sum --num-threads 6
               | /tmp/rand_1G         1.53 +- 0.05 times faster than b3sum
               | --num-threads 5 /tmp/rand_1G         1.72 +- 0.03 times
               | faster than b3sum --num-threads 4 /tmp/rand_1G
               | 2.10 +- 0.04 times faster than b3sum --num-threads 3
               | /tmp/rand_1G         2.84 +- 0.06 times faster than b3sum
               | --num-threads 2 /tmp/rand_1G         5.03 +- 0.12 times
               | faster than b3sum --num-threads 1 /tmp/rand_1G
               | 
               | (over 16 elided from this run as they're all ~= the 16
               | time)
        
               | oconnor663 wrote:
               | Figure 4 from https://github.com/BLAKE3-team/BLAKE3-specs
               | /blob/master/blak... is a related benchmark. On that
               | particular machine we saw good scaling up to 16 threads,
               | but yeah somewhere in that neighborhood you start to run
               | into memory bandwidth issues. Which is the problem you
               | want I guess :)
        
           | api wrote:
           | I'd be curious to see power consumption. SHA (and AES) are
           | usually available as what amounts to an ASIC built into the
           | processor, while this requires a lot more work to be done
           | with vector instructions.
        
             | vluft wrote:
             | this is less precise than the perf numbers as I don't
             | really have a way to measure power directly, but with
             | rerunning the benchmarks above locked to a cpu core, it
             | boosted ~the same level for all 3 commands (about 5.5ghz),
             | so should be ~the same power usage.
        
             | wmf wrote:
             | If Blake3 is 2.5x faster then it's going to be roughly 2.5x
             | less energy.
        
               | silotis wrote:
               | That's not how it works on modern CPUs. Power draw at
               | "100%" utilization can vary widely depending on what part
               | of the core is being utilized. The SIMD units are
               | typically the most power hungry part of the CPU by a
               | large margin so just because a job finishes in less time
               | doesn't mean total energy is necessarily lower.
        
               | wmf wrote:
               | I'm assuming that both SHA256 and Blake3 are using
               | integer SIMD.
        
               | api wrote:
               | In some CPUs that may be true, but in many there are
               | dedicated SHA instructions that amount to a SHA ASIC in
               | the CPU.
               | 
               | AES units are even more common. Most non-tiny CPUs have
               | them these days.
        
         | insanitybit wrote:
         | > I don't think Blake3 has any hardware implementation at all
         | yet.
         | 
         | > https://github.com/BLAKE3-team/BLAKE3
         | 
         | > The blake3 Rust crate, which includes optimized
         | implementations for SSE2, SSE4.1, AVX2, AVX-512, and NEON, with
         | automatic runtime CPU feature detection on x86. The rayon
         | feature provides multithreading.
         | 
         | There aren't blake3 instructions, like some hardware has for
         | SHA1, but it does use hardware acceleration.
         | 
         | edit: Re-reading, I think you're saying "If we're going to talk
         | about hardware acceleration, SHA1 still has the advantage
         | because of specific instructions" - that is true.
        
         | jonhohle wrote:
         | I just tested the C implementation on a utility I wrote[0] and
         | at least on macOS where SHA256 is hardware accelerated beyond
         | just NEON, BLAKE3 ends up being slower than SHA256 from
         | CommonCrypto (the Apple provided crypto library). BLAKE3 ends
         | up being 5-10% slower for the same input set.
         | 
         | As far as I'm aware, Apple does not expose any of the hardware
         | crypto functions, so unless what exists supports BLAKE3 and
         | they add support in CommonCrypto, there's no advantage to using
         | it from a performance perspective.
         | 
         | The rust implementation is multithreaded and ends up beating
         | SHA256 handily, but again, for my use case the C impl is only
         | single threaded, and the utility assumes a single threaded
         | hasher with one running on each core. Hashing is the bottleneck
         | for `dedup`, so finding a faster hasher would have a lot of
         | benefits.
         | 
         | 0 - https://github.com/ttkb-oss/dedup
        
       | tromp wrote:
       | For short inputs, Blake3 behaves very similar to Blake2, on which
       | it is based. From Blake's wikipedia page [1]:
       | 
       | BLAKE3 is a single algorithm with many desirable features
       | (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE and
       | BLAKE2, which are algorithm families with multiple variants.
       | BLAKE3 has a binary tree structure, so it supports a practically
       | unlimited degree of parallelism (both SIMD and multithreading)
       | given long enough input.
       | 
       | [1] https://en.wikipedia.org/wiki/BLAKE_(hash_function)
        
       | cesarb wrote:
       | While I really like Blake3, for all reasons mentioned in this
       | article, I have to say it does have one tiny disadvantage over
       | older hashes like SHA-256: its internal state is slightly bigger
       | (due to the tree structure which allows it to be highly
       | parallelizable). This can matter when running on tiny
       | microcontrollers with only a few kilobytes of memory.
        
         | londons_explore wrote:
         | The internal state is no bigger when hashing small things
         | though right?
         | 
         | I assume most microcontrollers are unlikely to be hashing
         | things much bigger than RAM.
        
           | oconnor663 wrote:
           | It's hard to give a short answer to that question :)
           | 
           | - Yes, if you know your input is short, you can use a smaller
           | state. The limit is roughly a BLAKE2s state plus (32 bytes
           | times the log_2 of the number of KiB you need to hash).
           | Section 5.4 of https://github.com/BLAKE3-team/BLAKE3-specs/bl
           | ob/master/blak... goes into this.
           | 
           | - But it's hard to take advantage of this space optimization,
           | because no libraries implement it in practice.
           | 
           | - But the reason libraries don't implement it is that almost
           | no one needs it. The max state size is just under 2 KiB,
           | which is small enough even for
           | https://github.com/oconnor663/blake3-6502.
           | 
           | - But it would be super easy to implement if we just put the
           | "CV stack" on the heap instead of allocating the whole thing
           | as an array up front.
           | 
           | - But the platforms that care about this don't have a heap.
           | 
           | @caesarb mentioned _really_ tiny microcontrollers, even
           | tinier than the 6502 maybe. The other place I 'd expect to
           | see this optimization is in a full hardware implementation,
           | but those are rare. Most hardware accelerators for hash
           | functions provide the block operation, and they leave it to
           | software to deal with this sort of bookkeeping.
        
       | gavinhoward wrote:
       | Good, terse article that basically reinforces everything I've
       | seen in my research about cryptographic hashing.
       | 
       | Context: I'm building a VCS meant for any size of file, including
       | massive ones. It needs a cryptographic hash for the Merkle Tree.
       | 
       | I've chosen BLAKE3, and I'm going to use the original
       | implementation because of its speed.
       | 
       | However, I'm going to make it easy to change hash algorithms _per
       | commit_ , just so I don't run into the case that Git had trying
       | to get rid of SHA1.
        
         | AdamN wrote:
         | Smart idea doing the hash choice per-commit. Just make sure
         | that somebody putting in an obscure hash doesn't mess up
         | everybody's usage of the repo if they don't have a library to
         | evaluate that hash installed.
        
           | gavinhoward wrote:
           | I agree.
           | 
           | There will be a set of presets of hash function and settings;
           | if BLAKE3 fails, then I'll actually have to add SHA3 or
           | something, with a set of settings, as presets.
           | 
           | The per-commit storage will then be an enum identifying the
           | hash and its settings.
           | 
           | This will let me do other things, like letting companies use
           | a 512-bit hash if they expect the repo to be large.
        
             | agodfrey wrote:
             | Maybe you're already aware, but you glossed over something:
             | Since you're using the hash to locate/identify the contect
             | (you mentioned Merkle and git), if you support multiple
             | hash functions you need some assurance that the chance of
             | collisions is low _across all supported hash functions_.
             | For example two identical functions that differ only in the
             | value of their padding bytes (when the input size doesn't
             | match the block size) can't coexist.
        
               | gavinhoward wrote:
               | You are absolutely right. And yes, I am aware.
               | 
               | Location will actually be done by prefixing the hash with
               | the value of the enum for the hash function/settings pair
               | that made the hash.
        
               | tczMUFlmoNk wrote:
               | Since you seem to have done a fair bit of research in
               | this area, do you have any opinions or thoughts about the
               | Multihash format?
               | 
               | https://multiformats.io/multihash/
               | 
               | It fills in some of the blanks in your "prefixing the
               | hash with the value of the enum for the hash" step.
        
               | gavinhoward wrote:
               | The multihash format is an excellent format that I am
               | tempted to use.
               | 
               | However, there are a two general problems:
               | 
               | * The spec is not done, and it doesn't look like much has
               | been done.
               | 
               | * While I agree with the FAQ that agreeing on a set of
               | hash functions is possible, the format only allows 256
               | possible hash functions, so it can run out of space,
               | especially since there can be multiple formats of the
               | same function (BLAKE2B and BLAKE2S, for example), and
               | especially since they want to allow non-cryptographic
               | hash functions.
               | 
               | Then specifically for my use case:
               | 
               | * There is the problem brought up by AdamN [1]: if
               | multihash is supported, an obscure hash may be supported,
               | and that may cause problems.
               | 
               | * As an extension of that, once a hash function and set
               | of settings is supported, it's supported forever, so I
               | want to be more picky, and an enum allows me to restrict
               | what is usable.
               | 
               | * By using a 16-bit enum, I have more space to grow.
               | 
               | * And finally, by using an enum, if my software
               | encounters a repo with a enum value greater than all of
               | the values it knows, it knows that that repo needs a
               | later version of the software, since I will only _add_
               | enum values.
               | 
               | [1]: https://news.ycombinator.com/item?id=38250444
        
             | FabHK wrote:
             | > letting companies use a 512-bit hash if they expect the
             | repo to be large.
             | 
             | A repo would have to have more than 1e32 documents for a
             | one in a trillion chance of a collision with a 256 bit
             | hash. (Total annual world data production is estimated at
             | less than 1e24 bytes.)
             | 
             | A 512 bit hash thus seems overkill for almost all purposes.
             | 
             | https://en.wikipedia.org/wiki/Birthday_problem
             | 
             | https://www.weforum.org/agenda/2021/05/world-data-
             | produced-s...
        
               | gavinhoward wrote:
               | For normal VCS's, you are absolutely right. And you're
               | actually right for mine, but I decided to redo the math
               | to make sure.
               | 
               | My VCS will track things at a finer level than documents.
               | In a C file, it will track individual functions and
               | structs. In a Java file, it will track individual fields
               | and classes. In a Microsoft Word document, it might track
               | individual paragraphs. And in a Blender file, it will
               | track each object, material, texture, etc. individually.
               | 
               | Yes, it will handle binary files.
               | 
               | Anyway, it will also be designed for non-technical users.
               | To that end, it will hook into the source software and do
               | a "commit" every time the user saves.
               | 
               | It will also track individual directories to make renames
               | work.
               | 
               | I am a ctrl+s freak, so I save once a minute or more.
               | However, other people are not, so let's assume 10 minutes
               | (for autosave, perhaps).
               | 
               | Now let's assume a monorepo for a company of 100,000
               | people. And let's assume that when they save every 10
               | minutes, they save one object in one file (also tracked)
               | two directories down. That means they create 5 hashes
               | every 10 minutes (the fifth is the top level).
               | 
               | Let's assume an effective 6-hour work day.
               | 
               | That's 5 objects times 6 times per hour times 6 hours.
               | That's 180 objects a day per person.
               | 
               | That's 18,000,000 total objects per day. Times 5 for days
               | in a week, times 50 for work weeks in a year.
               | 
               | That's 4.5 billion.
               | 
               | Let's multiply that by 40 for 40 years that the repo
               | exists, which includes some of the oldest software.
               | 
               | That's 1.8e11 objects. According to [1], a 128-bit hash
               | would not be enough for the error correction on a disk at
               | that point.
               | 
               | However, a 256-bit hash would give us a 10^31 objects
               | before reaching that point, which gives us 10^20 times 40
               | years of space.
               | 
               | Yep, you're absolutely right that 512 bits is overkill. I
               | stand corrected.
               | 
               | [1]: https://en.m.wikipedia.org/wiki/Birthday_attack
        
               | deadbeeves wrote:
               | You're tracking things at the content level? How will you
               | deal with files that are purposely broken, or which cause
               | the parser to take impractical (but finite) times to
               | complete? Also, tracking the history of a class makes
               | sense to some extent, but you say you want to commit
               | every time there's a save. How will you maintain a
               | history when most commits are likely to contain
               | unparseable code and so break the continuity of objects?
        
               | nextaccountic wrote:
               | In those cases you can just do error recovery in the
               | parser (truncating an erroring function for example) and
               | then store out-of-band information necessary to
               | reconstruct the original file
               | 
               | This is also necessary to deal with whitespace for
               | example (if you just reformat the code, you didnt change
               | the ast but you changed the file)
        
               | gavinhoward wrote:
               | Good questions.
               | 
               | > How will you deal with files that are purposely broken,
               | or which cause the parser to take impractical (but
               | finite) times to complete?
               | 
               | I've never seen a language parser do that, but if I run
               | into a language that does that, I'll probably have my VCS
               | track it at the file level, based on tokens or lines.
               | 
               | Dumb languages don't get nice things. :)
               | 
               | > How will you maintain a history when most commits are
               | likely to contain unparseable code and so break the
               | continuity of objects?
               | 
               | This is less of a problem with binary files (assuming the
               | source software does not have bugs in output), but with
               | source files, you're right that that problem does exist.
               | 
               | As of right now, I would do a token-based approach. This
               | approach removes the need for whitespace-only commits,
               | and if I track the tokens right, I should be able to
               | identify which right brace _used_ to end the function
               | until the broken code was saved. Then I would just save
               | the function as broken using that same right brace.
               | 
               | For example, say you have this:                   int
               | main() {             return 0;         }
               | 
               | My VCS would know that the right brace corresponds to the
               | end of the function.
               | 
               | Then you write this:                   int main() {
               | if (global_bool) {             return 0;         }
               | 
               | Yes, a dumb system might think that the right brace is
               | for the `if`.
               | 
               | However, if you break it down by tokens, the VCS will see
               | that `if (global_bool) {` were added before the return,
               | so it should be able to tell that the right brace still
               | ends the function.
               | 
               | I hope that makes sense.
               | 
               | Another plausible way to do it (at least in C) would be
               | to look for things that look like declarations. The
               | series of tokens `<type> <name> <left_paren>` is probably
               | a function declaration. Java would be easier; its
               | declarations are more wordy.
               | 
               | I still have to prove this is possible, but I think it
               | is.
        
       | EdSchouten wrote:
       | What I dislike about BLAKE3 is that they added explicit logic to
       | ensure that identical chunks stored at different offsets result
       | in different Merkle tree nodes (a.k.a. the 'chunk counter').
       | 
       | Though this feature is well intended, it makes this hash function
       | hard to use for a storage system where you try to do aggressive
       | data deduplication.
       | 
       | Furthermore, on platforms that provide native instructions for
       | SHA hashing, BLAKE3 isn't necessarily faster, and certainly more
       | power hungry.
        
         | lazide wrote:
         | Huh?
         | 
         | The storage system doing this wouldn't use that part of the
         | hash, it would do it itself so no issues? (Hash chunks, instead
         | of feeding everything in linearly)
         | 
         | Otherwise the hash isn't going to be even remotely safe for
         | most inputs?
        
         | persnickety wrote:
         | Could you point to how this is implemented and how it can be
         | used? From the sound of it, you're trying to do something like
         | rsync's running-window comparison?
        
           | EdSchouten wrote:
           | Imagine the case where you're trying to create a storage
           | system for a large number of virtual machine images (e.g.,
           | you're trying to build your own equivalent of AWS Machine
           | Images). There is obviously a lot of duplication between
           | parts of images. And not necessarily at the same offset, but
           | also at different offsets that are n*2^k bytes apart, where
           | 2^k represents the block/sector size.
           | 
           | You could consider building this storage system on top of
           | BLAKE3's tree model. Namely you store blocks as small Merkle
           | tree. And an image is basically a collection of blocks that
           | has a different 'hat' on top of it. Unfortunately, BLAKE3
           | makes this hard, because the same block will end up having a
           | different Merkle tree node depending on the offset at which
           | it's stored.
        
             | luoc wrote:
             | You mean something like a CDC algorithm? I know that some
             | Backup tools like Restic use this.
             | 
             | https://en.m.wikipedia.org/wiki/Rolling_hash
        
               | EdSchouten wrote:
               | You can use a CDC algorith, but if you know that
               | duplication mostly occurs at power-of-two boundaries,
               | there is no need to use that. Deduplicating on a binary
               | tree basis is sufficient.
        
             | londons_explore wrote:
             | Sounds to me like you are trying to use the innards of a
             | hash algorithm for something for which it was not
             | designed...
             | 
             | Either modify the algorithm to your needs, and rename it.
             | 
             | Or just use something thats already suitable off-the-shelf.
             | Plenty of merkle-trees out there.
        
             | luoc wrote:
             | I thing CDC is what you're looking for. Some backup tools
             | like restic use it. See
             | https://en.m.wikipedia.org/wiki/Rolling_hash
        
               | luoc wrote:
               | Duplicated myself, sry
        
             | marktangotango wrote:
             | > You could consider building this storage system on top of
             | BLAKE3's tree model.
             | 
             | Consider a crypto currency pow that did that without the
             | chunk counter. It'd be trivially exploitably by
             | precalculating all the tree but the chunk that changed per
             | nonce.
        
             | prirun wrote:
             | Author of HashBackup here. I don't see how any kind of hash
             | tree would be effective at de-duplicating VM machine
             | images, other than the degenerate case of an exact copy,
             | which is easy to detect with a single file hash.
             | 
             | Most OSes use 4K block sizes. To get the best dedup you
             | have to hash every 4K block and lookup each one
             | individually in a dedup table. Two VM images could both
             | contain an identical 4GB file, but every 4K block of that
             | file could be stored at different offsets in the VM images.
             | A tree hash wouldn't let you dedup anything but identical
             | sections stored at identical offsets, whereas a dedup table
             | and 4K blocks allows you to dedup the entire file.
        
         | oconnor663 wrote:
         | We go over some of our reasoning around that in section 7.5 of 
         | https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blak...
         | . An early BLAKE3 prototype actually didn't include the chunk
         | counter (https://github.com/oconnor663/bao/blob/master/docs/spe
         | c_0.9....), so I'm definitely sympathetic to the use cases that
         | wish it wasn't there. However, after publication we found out
         | that something like a chunk counter is necessary for the
         | security of the Bao streaming verification tool:
         | https://github.com/oconnor663/bao/issues/41. It could be that
         | there's a design that's the best of both worlds, but I'm not
         | sure.
        
       | stylepoints wrote:
       | Until it starts coming installed by default on Linux and other
       | mojor OS's, it won't be mainstream.
        
         | theamk wrote:
         | Python 3.11 will have it https://bugs.python.org/issue39298
        
           | latexr wrote:
           | That says "Resolution: rejected" and Python is currently at
           | 3.12.0. Did the feature land?
        
             | theamk wrote:
             | oops I misread it.. seems it was rejected because it was
             | not standard enough...
             | 
             | https://github.com/python/cpython/issues/83479#issuecomment
             | -...
        
       | syonfox wrote:
       | murmur3
        
         | Snawoot wrote:
         | murmur3 is not a cryptographic hash, so it's not even in the
         | same field.
        
       | latexr wrote:
       | It bears mentioning `shasum` is better supported in that it ships
       | with operating systems (macOS, I guess Linux depends on the
       | distro, don't know about Windows) and is available directly from
       | programming languages (Ruby, Swift, Python, ...) without the need
       | for libraries.
       | 
       | Even if BLAKE3 is massively faster, it's not like I ever noticed
       | SHA256's supposed slowness. But I do notice its ubiquity.
       | 
       | Based on the article, I would consider switching to BLAKE3
       | immediately where I use checksums. But until it gets wider
       | support (might be easier with a public domain license instead of
       | the current ones) I can't really do it because I need to do
       | things with minimal dependencies.
       | 
       | Best of luck to the BLAKE3 team on making their tool more widely
       | available.
        
         | upget_tiding wrote:
         | > might be easier with a public domain license instead of the
         | current ones
         | 
         | There reference implementation is public domain (CC0) or at
         | your choice Apache 2.0
         | 
         | https://github.com/BLAKE3-team/BLAKE3/blob/master/LICENSE
        
       | zahllos wrote:
       | I agree that if you can, BLAKE3 (or even BLAKE2) are nicer
       | choices than SHA2. However I would like to add the following
       | comments:
       | 
       | * SHA-2 fixes the problems with SHA-1. SHA-1 was a step up over
       | SHA-0 that did not completely resolve flaws in SHA-0's design
       | (SHA-0 was broken very quickly).
       | 
       | * JP Aumasson (one of the B3 authors) has said publicly a few
       | times SHA-2 will never be broken:
       | https://news.ycombinator.com/item?id=13733069 is an indirect
       | source, can't seem to locate a direct one from Xitter (thanks
       | Elon).
       | 
       | Thus it does not necessarily follow that SHA-2 is a bad choice
       | because SHA-1 is broken.
        
         | gavinhoward wrote:
         | All that may be true.
         | 
         | However, I don't think we can say for sure if SHA2 will be
         | broken. Cryptography is hard like that.
         | 
         | In addition, SHA2 is still vulnerable to length extension
         | attacks, so in a sense, SHA2 _is_ broken, at least when length
         | extension attacks are part of the threat model.
        
       | Godel_unicode wrote:
       | I don't understand why people use sha256 when sha512 is often
       | significantly faster:
       | 
       | https://crypto.stackexchange.com/questions/26336/sha-512-fas...
        
         | oconnor663 wrote:
         | A couple reasons just on the performance side:
         | 
         | - SHA-256 has hardware acceleration on many platforms, but
         | SHA-512 mostly doesn't.
         | 
         | - Setting aside hardware acceleration, SHA-256 is faster on
         | 32-bit platforms, like a lot of embedded devices. If you have
         | to choose between "fast on a desktop" vs "fast in embedded", it
         | can make sense to assume that desktops are always fast enough
         | and that your bottlenecks will be in embedded.
        
           | adrian_b wrote:
           | On older 64-bit CPUs without hardware SHA-256 (i.e. up to the
           | Skylake derivatives), SHA-512 is faster.
           | 
           | Many recent Arm CPUs have hardware SHA-512 (and SHA-3).
           | 
           | Intel will add hardware SHA-512 starting with Arrow Lake S,
           | to be launched at the end of 2024 (the successor in desktops
           | of the current Raptor Lake Refresh).
           | 
           | Most 64-bit CPUs that have been sold during the last 4 years
           | and many of those older than that have hardware SHA-256.
        
         | garblegarble wrote:
         | This may only be applicable to certain CPUs - e.g. sha512 is a
         | lot slower on M1                   $ openssl speed sha256
         | sha512         type             16 bytes     64 bytes    256
         | bytes   1024 bytes   8192 bytes  16384 bytes         sha256
         | 146206.63k   529723.90k  1347842.65k  2051092.82k  2409324.54k
         | 2446518.95k         sha512           85705.68k   331953.22k
         | 707320.92k  1149420.20k  1406851.34k  1427259.39k
        
       | nayuki wrote:
       | It's an interesting set of reasons, but I prefer Keccak/SHA-3
       | over SHA-256, SHA-512, and BLAKE. I trust the standards body and
       | public competition and auditing that took place - more so than a
       | single author trumpeting the virtues of BLAKE.
        
       | tptacek wrote:
       | I'd probably use a Blake too. But:
       | 
       |  _SHA256 was based on SHA1 (which is weak). BLAKE was based on
       | ChaCha20, which was based on Salsa20 (which are both strong)._
       | 
       |  _NIST /NSA have repeatedly signaled lack of confidence in
       | SHA256: first by hastily organising the SHA3 contest in the
       | aftermath of Wang's break of SHA1_
       | 
       | No: SHA2 lacks the structure the SHA1 attack relies on it (SHA1
       | has a linear message schedule, which made it possible to work out
       | a differential cryptanalysis attack on it).
       | 
       | Blake's own authors keep saying SHA2 is secure (modulo length
       | extension), but people keep writing stuff like this. Blake3 is a
       | good and interesting choice on the real merits! It doesn't need
       | the elbow throw.
        
         | ianopolous wrote:
         | Would be interesting to hear Zooko's response to this. (Peergos
         | lead here)
        
         | pclmulqdq wrote:
         | Most people who publicly opine on the Blake vs. SHA2 debate
         | seem to be relatively uninformed on the realities of each one.
         | SHA2 and the Blakes are both usually considered to be secure.
         | 
         | The performance arguments most people make are also outdated or
         | specious: the original comparisons of Blake vs SHA2 performance
         | on CPUs were largely done before Intel and AMD had special SHA2
         | instructions.
        
           | ianopolous wrote:
           | The author is one of the creators of blake3, Zooko.
        
             | tptacek wrote:
             | Sorry, I should have been more precise. JP Aumasson is
             | specifically who I'm thinking of; he's made the semi-
             | infamous claim that SHA2 won't be broken in his lifetime.
             | The subtext I gather is that there's just nothing on the
             | horizon that's going to get it. SHA1 we saw coming a ways
             | away!
        
               | zahllos wrote:
               | Quoting directly from https://nostarch.com/crypto-
               | dictionary under the entry SHA-2:
               | 
               | > Unlike SHA-1, SHA-2 algorithms aren't broken and are
               | unlikely to ever be.
               | 
               | There's also the fact NIST themselves deprecated SHA-1 in
               | 2011 (https://csrc.nist.gov/news/2017/research-results-
               | on-sha-1-co... not mentioned, but otherwise nice timeline
               | here: https://crypto.stackexchange.com/a/60655), yet
               | SHA-2 is still OK. Wiki has a table of cryptanalysis of
               | SHA-2: https://en.wikipedia.org/wiki/SHA-2#Cryptanalysis_
               | and_valida...
               | 
               | The summary is that either you attack a very reduced
               | round variant and you get "practical" complexity for the
               | attack, or you attack almost a full round variant and you
               | get an entirely practical attack.
               | 
               | So I think your interpretation of the subtext is entirely
               | correct.
        
         | honzaik wrote:
         | Also, the NSA is currently recommending to change SHA3/Keccak
         | inside Dilithium and Kyber into SHA2-based primitives...
         | https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/SPTp...
        
           | twiss wrote:
           | For those who didn't click the link, it should be noted that
           | they're suggesting this because it would be easier to deploy
           | (in places that have a SHA-2 implementation but not SHA-3),
           | not for reasons related to security or anything like that.
           | Looking at the responses, there's also some disagreement on
           | whether it would offer equal security for the particular use
           | case of ML-DSA and ML-KEM (as the final version of Dilithium
           | and Kyber standardized by NIST will be called).
        
             | hellcow wrote:
             | > they're suggesting this because it would be easier to
             | deploy (in places that have a SHA-2 implementation but not
             | SHA-3), not for reasons related to security
             | 
             | That's a bit absurd, right? Sure, the NSA didn't overtly
             | say, "we propose you use SHA-2 because we can break it."
             | That doesn't mean it's secure against them.
             | 
             | We can't look at their stated justification for supporting
             | one algorithm over another because the NSA lies. Their very
             | _purpose_ as an organization is to defeat encryption, and
             | one tactic is to encourage the industry to use something
             | they can defeat while reassuring people it's secure. We
             | need to look at their recommendations with a lot of
             | suspicion and assume an ulterior motive.
        
               | tptacek wrote:
               | The article uses inferred NSA preferences as
               | justification to avoid SHA2. Can't have it both ways.
        
         | pbsd wrote:
         | While there is more confidence now on the security of SHA-2, or
         | rather the lack of transference of the SHA-1 approach to SHA-2,
         | this was not the case in 2005-2006 when NIST decided to hold
         | the SHA-3 competition. See for example the report on Session 4
         | of the 2005 NIST workshop on hash functions [1].
         | 
         | [1] https://csrc.nist.gov/events/2005/first-cryptographic-
         | hash-w...
        
         | omginternets wrote:
         | What do you mean by "weak" and "strong", here?
        
           | nabla9 wrote:
           | _Weak_ means that a mathematical flaw has been discovered
           | that makes it inherently insecure, or that it is so simple
           | that modern computer technology makes it possible to use
           | "brute force" to crack. _Strong_ means that it 's not either.
        
           | chx wrote:
           | There are fundamentally two kinds of attacks, preimage which
           | splits into two:
           | 
           | In a first-preimage attack, you know a hash value but not the
           | message that created it, and you want to discover any message
           | with the known hash value; in the second-preimage attack, you
           | have a message and you want to find a second message that has
           | the same hash. Attacks that can find one type of preimage can
           | often find the other as well. A weak algorithm allows this to
           | be done in less than 2^(hash length) attempts.
           | 
           | And then there is collision: two messages which produce the
           | same hash. A weak algorithm allows this to be done in less
           | than 2^(half of hash length) attempts.
           | 
           | Source: https://www.rfc-editor.org/rfc/rfc4270
        
       | RcouF1uZ4gsC wrote:
       | SHA-256 has the advantage that it is used for BitCoin. It is the
       | biggest bug bounty of all time. There see literally billions
       | riding on the security of SHA-256.
        
       | aburan28 wrote:
       | There has been a mountain of cryptanalysis done on SHA256 with no
       | major breaks compared to a much smaller amount analysis on
       | blake3.
        
       | jrockway wrote:
       | Fast hash functions are really important, and SHA256 is really
       | slow. Switching the hash function where you can is enough to
       | result in user-visible speedups for common hashing use cases;
       | verifying build artifacts, seeing if on-disk files changed, etc.
       | I was writing something to produce OCI container images a few
       | months ago, and the 3x SHA256 required by the spec for layers
       | actually takes on the order of seconds. (.5s to sha256 a 50MB
       | file, on my 2019-era Threadripper!) I was shocked to discover
       | this. (gzip is also very slow, like shockingly slow, but
       | fortunately the OCI spec lets you use Zstd, which is
       | significantly faster.)
        
         | coppsilgold wrote:
         | sha256 is not slow on modern hardware. openssl doesn't have
         | blake3, but here is blake2:                   type
         | 16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
         | 16384 bytes         BLAKE2s256        75697.37k   308777.40k
         | 479373.40k   567875.81k   592687.09k   591254.18k
         | BLAKE2b512        63478.11k   243125.73k   671822.08k
         | 922093.51k  1047833.51k  1048959.57k         sha256
         | 129376.82k   416316.32k  1041909.33k  1664480.49k  2018678.67k
         | 2043838.46k
         | 
         | This is with the x86 sha256 instructions: sha256msg1,
         | sha256msg2, sha256rnds2
        
           | dralley wrote:
           | "modern hardware" deserves some caveats. AMD has supported
           | those extensions since the original Zen, but Intel CPUs
           | generally lacked them until only about 2 years ago.
        
             | adrian_b wrote:
             | For many years, starting in 2016, Intel has supported
             | SHA-256 only in their Atom CPUs.
             | 
             | The reason seems to be that the Atom CPUs were compared in
             | Geekbench with ARM CPUs, and without hardware SHA the Intel
             | CPUs would have obtained worst benchmark scores.
             | 
             | In their big cores, SHA has been added in 2019, in Ice Lake
             | (while Comet Lake still lacked it, being a Skylake
             | derivative), and since then all newer Intel CPUs have it.
             | 
             | So except for the Intel Core CPUs, the x86 and ARM CPUs
             | have had hardware SHA for at least 7 years, while the Intel
             | Core CPUs have had it for the last 4 years.
        
         | richardwhiuk wrote:
         | If you want a fast hash function (and don't care about it's
         | cryptographic properties), don't use a cryptographic hash
         | function.
        
           | dralley wrote:
           | BLAKE3 is actually competitive with non-cryptographic hashes
           | like crc32.
        
             | oconnor663 wrote:
             | To be fair, it _really_ depends on the platform. There 's
             | an argument to be made that platforms where you care about
             | the difference are specifically the ones where BLAKE3 is
             | slower (no SIMD, no threads).
        
         | adrian_b wrote:
         | SHA256 is very fast on most modern CPUs, i.e. all AMD Zen, all
         | Intel Atom since 2016, Intel Core Ice Lake or newer, Armv8 and
         | Armv9.
         | 
         | I use every day both SHA-256 and BLAKE3. BLAKE3 is faster only
         | because it is computed by multiple threads using all available
         | CPU cores. When restricted to a single thread, it is slower on
         | CPUs with fast hardware SHA-256.
         | 
         | The extra speed of BLAKE3 is not always desirable. The fact
         | that it uses all cores can slow down other concurrent
         | activities, without decreasing the overall execution time of
         | the application.
         | 
         | There are cases when the computation of a hash like SHA-256 can
         | be done as a background concurrent activity, or when the speed
         | of hashing is limited by the streaming speed of data from the
         | main memory or from a SSD, so spawning multiple threads does
         | not gain anything and it only gets in the way of other
         | activities.
         | 
         | So the right choice between SHA-256 and BLAKE3 depends on the
         | application. Both can be useful. SHA-256 has the additional
         | advantage that it needs negligible additional code, only a few
         | lines are necessary to write a loop that computes the hash
         | using the hardware instructions.
        
       | ur-whale wrote:
       | One metric that is seldom mentioned for crypto algos is code
       | complexity.
       | 
       | I really wish researchers would at least pay lip service to it.
       | 
       | TEA (an unfortunately somewhat weak symmetric cipher) was a very
       | nice push in that direction.
       | 
       | TweetNaCl was another very nice push in that direction by djb
       | 
       | Why care about that metric you ask?
       | 
       | Well here are a couple of reasons: - algo fits in head - algo is
       | short -> cryptanalysis likely easier - algo is short -> less
       | likely to have buggy implementation - algo is short -> side-
       | channel attacks likely easier to analyse - algo fits in a 100
       | line c++ header -> can be incorporated into anything - algo can
       | be printed on a t-shirt, thereby skirting export control
       | restrictions - algo can easily be implemented on tiny micro-
       | controllers
       | 
       | etc ...
        
         | oconnor663 wrote:
         | We put a lot of effort into section 5.1.2 of https://github.com
         | /BLAKE3-team/BLAKE3-specs/blob/master/blak..., and the
         | complicated part of BLAKE3 (incrementally building the Merkle
         | tree) ends up being ~4 lines of code. Let me know what you
         | think.
        
         | rstuart4133 wrote:
         | > One metric that is seldom mentioned for crypto algos is code
         | complexity. ... TEA (an unfortunately somewhat weak symmetric
         | cipher) was a very nice push in that direction.
         | 
         | Spec is also a push in that direction [0]. It's code looks to
         | be as complex as TEA's (1/2 a page of C), blindingly fast, yet
         | as far I know has no known attacks despite being subject to a
         | fair bit of scrutiny. About the only reason I can see for it
         | not being largely ignored is it was designed by NSA.
         | 
         | SHA3 is also a simple algorithm. Downright pretty, in fact.
         | It's a pity it's so slow.
         | 
         | [0] https://en.wikipedia.org/wiki/Speck_(cipher)
        
       | colmmacc wrote:
       | It's very hard to see Blake3 getting included in FIPS. Meanwhile,
       | SHA256 is. That's probably the biggest deciding factor on whether
       | you want to use it or not.
        
       | Retr0id wrote:
       | Blake3 is a clear winner for large inputs.
       | 
       | However, for smaller inputs (~1024 bytes and down), the
       | performance gap between it and everything else (blake2, sha256)
       | gets much narrower, because you don't get to benefit from the
       | structural parallelization.
       | 
       | If you're mostly dealing with small inputs, raw hash throughput
       | is _probably_ not high on your list of concerns - In the context
       | of a protocol or application, other costs like IO latency
       | probably completely dwarf the actual CPU time spent hashing.
       | 
       | If raw performance is no longer high on your list of priorities,
       | you care more about the other things - ubiquitous and battle-
       | tested library support (blake3 is still pretty bleeding-edge, in
       | the grand scheme of things), FIPS compliance (sha256), greater
       | on-paper security margin (blake2). Which is all to say, while
       | blake3 _is_ great, there are still plenty of reasons not to
       | prefer it for a particular use-case.
        
       | LegibleCrimson wrote:
       | How does the extended output work, and what's the point of
       | extended output?
       | 
       | From what I can see, BLAKE3 has 256 bits of security, and
       | extended output doesn't provide any extra security. In this case,
       | what's the point of extended output over doing something like
       | padding with 0-bits or extending by re-hashing the previous
       | output and appending it to the previous output (eg, for 1024
       | bits, doing h(m) . h(h(m)) . h(h(h(m))) . h(h(h(h(m))))). Either
       | way, you get 256 bits of security.
       | 
       | Is it just because the design of the hash makes it simple to do,
       | so it's just offered as a consistent option for arbitrary output
       | sizes where needed, or is there some greater purpose that I'm
       | missing?
        
         | oconnor663 wrote:
         | > From what I can see, BLAKE3 has 256 bits of security, and
         | extended output doesn't provide any extra security.
         | 
         | 128 bits of collision resistance but otherwise correct. As a
         | result of that we usually just call it 128 bits across the
         | board, but yes in an HMAC-like use case you would generally
         | expect 256 bits of security from the 256 bit output. Extended
         | outputs don't change that, because the internal chaining values
         | are 256 bits even when the output is larger.
         | 
         | > extending by re-hashing the previous output and appending it
         | to the previous output
         | 
         | It's not quite that simple, because you don't want later parts
         | of your output to be predictable from earlier parts (which
         | might be published, depending on the use case). You also want
         | it to be parallelizable.
         | 
         | You could compute H(m) as a "pre-hash" and then make an
         | extended output something like H(H(m)|1)|H(H(m)|2)|... That's
         | basically what BLAKE3 is doing in the inside. The advantage of
         | having the algorithm do it for you is that 1) it's an "off the
         | shelf" feature that doesn't require users to roll their own
         | crypto and 2) it's slightly faster when the input is short,
         | because you don't have to spend an extra block operation
         | computing the pre-hash.
         | 
         | > what's the point of extended output?
         | 
         | It's kind of niche, but for example Ed25519 needs a 512 bit
         | hash output internally to "stretch" its secret seed into two
         | 256-bit keys. You could also use a BLAKE3 output reader as a
         | stream cipher or a random byte generator. (These sorts of use
         | cases are why it's nice not to make the caller tell you the
         | output length in advance.)
        
       | 15155 wrote:
       | Keccak is my preference. Keccak is substantially easier to
       | implement in hardware: fewer operations and no carry propagation
       | delay issue because there's no addition.
        
       ___________________________________________________________________
       (page generated 2023-11-13 23:00 UTC)