[HN Gopher] Using Bloom filters to efficiently synchronise hash ...
       ___________________________________________________________________
        
       Using Bloom filters to efficiently synchronise hash graphs
        
       Author : ignoramous
       Score  : 64 points
       Date   : 2020-12-02 16:39 UTC (1 days ago)
        
 (HTM) web link (martin.kleppmann.com)
 (TXT) w3m dump (martin.kleppmann.com)
        
       | wiml wrote:
       | This is nice. How does it compare to something like a polynomial
       | based representation like in [1] (used by the old keyserver
       | gossip network)? I find Bloom filters more intuitive to reason
       | about, but [1] does mention them as prior work and seems to think
       | its scheme outperforms them.
       | 
       | [1] http://dx.doi.org/10.1109/TIT.2003.815784
       | http://ipsit.bu.edu/documents/ieee-it3-web.pdf
        
         | thomasahle wrote:
         | I think the approaches based on invertible bloom filters are
         | generally more efficient.
         | 
         | https://dl.acm.org/doi/abs/10.1145/2043164.2018462 (What's the
         | difference?: efficient set reconciliation without prior
         | context) 2011
         | 
         | https://link.springer.com/article/10.1007/s00446-017-0316-0
         | (Simple Multi-Party Set Reconciliation) 2015
         | 
         | They both briefly mention the Minsky paper.
        
           | nullc wrote:
           | > I think the approaches based on invertible bloom filters
           | are generally more efficient.
           | 
           | More computationally efficient for sure, but their
           | communications efficiency is only good asymptotically.
           | 
           | For "small" cases such as where the set difference has a few
           | hundred or thousand entries and especially where the set
           | members are small (e.g. <=64-bit hashes) there is a pretty
           | large constant factor needed to get a low failure rate from
           | invertible bloom filters.
           | 
           | E.g. for a set difference of size 128 we measured that you
           | need 32x communications overhead to get a 1 in 10,000 failure
           | rate. 8x overhead for 2048 items. The issue is that iblt
           | decodability depends on the non-existence of cycles in a
           | random graph and this only holds for sufficiently large
           | graphs.
           | 
           | For applications (such as Bitcoin transaction relay or -- I
           | think -- SCM commits) which frequently synchronize a small
           | number of items these communications overheads take that
           | approach entirely out of the running. Fortunately, at these
           | sizes the quadratic computation of communications efficient
           | approaches is not a big deal.
           | 
           | Interestingly, it is completely unambiguous that IBLT and
           | dense code based reconciliation can be combined (sort of the
           | dual to Raptor codes, which combine sparse fountain codes and
           | dense RS-like codes for block erasure recovery)-- but it's an
           | open question if the combination can capture the advantages
           | of both approaches (linear time decode and communications
           | efficiency) for set reconciliation as is the case for block
           | erasure codes.
        
       | nullc wrote:
       | It's possible to be dramatically more communication efficient
       | than bloom filters to synchronize sets.
       | 
       | Bloom filters require O(my_set_size) bandwidth, with a good
       | constant factor. This work optimizes it somewhat by excluding
       | members that existed at the time of last synchronization, but it
       | still wastes bandwidth when those commits came in from another
       | source.
       | 
       | It's possible to instead realize
       | O(symmetric_difference_between_sets) when synchronizing a pair.
       | Applied to this kind of usage the bandwidth required would be
       | reduced by whatever redundancy factor (e.g. number of different
       | repos you'd learn the same commit from).
       | 
       | This can be accomplished via error correcting codes, for a high
       | performance industrial implementation I contributed to, see:
       | https://github.com/sipa/minisketch/ (I tried submitting this to
       | HN a while back, but no joy--).
       | 
       | For our paper on applying this to Bitcoin transaction relay to
       | render rumouring bandwidth largely independent of peer count:
       | 
       | https://arxiv.org/abs/1905.10518
       | 
       | [Similar to the motivations of the post, we consider immunity to
       | attackers to be critical-- and most prior efficient gossip work
       | is extremely non-robust.]
       | 
       | For the specific case of synchronizing the blockchain itself,
       | these fancy techniques aren't needed because there is only a
       | single best chain 'list'-- there is only one relevant tip-- and
       | it can be efficiently synchronized with bisection. Bitcoin sends
       | up to 100 hashes at once (though usually more like 30), the first
       | few dense around the best tip and the rest exponentially spaced
       | to extremely efficiently find the highest point of commonalty.
        
         | dochtman wrote:
         | This sounds very interesting, so I tried reading some of the
         | math in
         | https://github.com/sipa/minisketch/blob/master/doc/math.md (I'm
         | not great this kind of math).
         | 
         | In my understanding, a Bloom filter, while being O(set_size),
         | nonetheless has a small multiplier for the linear factor (in
         | the OP I think they mentioned using 10 bits). As I read the
         | PinSketch setup, it looks like it would actually need the full
         | bit size of the set members, so for SHA-256 you need the full
         | 256 bits per set member. So while PinSketch would only need
         | O(symmetric_difference_between_sets), the element size factor
         | is ~25 times larger?
         | 
         | And also, in the example from the minisketch readme, it looks A
         | still needs to send the full sketch (not just the symdiff) in
         | the first message, right?
         | 
         | In other words, there are a bunch of trade-offs here that might
         | very much depend on your application?
        
           | nullc wrote:
           | You can make the hashes you as small as you like-- subject to
           | the limitation that you'll get false positives if you make it
           | too small ---- the same problem that small hashes get in
           | bloom filters.
           | 
           | Bloom filters also have poor communications efficiencies for
           | small numbers of entries: you have to oversize the filter a
           | lot. This can be ameliorated by entropy coding the oversized
           | bloom filter, but once you're going that far you're better
           | off with using a golumb coded set or something more
           | efficient.
           | 
           | I think the only case where an approximate set-filter
           | approach will be more efficient is where you want to
           | communicate a small set one way to someone whos set is _much_
           | larger. The issue there is the the sketch needs one entry per
           | item it removes, while a bloom filter can be smaller
           | (assuming the difference is big enough to overcome the bloom
           | overheads).
           | 
           | Information theoretically it should be possible to decode a
           | minisketch 'beyond the limit' using knowledge of the specific
           | items in the local set. ... but the only algorithm I know for
           | this has exponential time (brute force remove items from your
           | set and retry the reconstruction). This problem is
           | mathematically pretty close to list-input decoding a error
           | correcting code, so efficient algorithms may be possible--
           | but so far there hasn't been any research to that end.
           | 
           | [You can, of course, combine a filter with efficient
           | reconciliation in cases where an set size differences are
           | expected. The filter's FP rate just needs to be low enough to
           | get the receivers set size close to the sender's]
           | 
           | > it looks A still needs to send the full sketch
           | 
           | The sketch is exactly the same size as the symdiff it can
           | recover.
           | 
           | An interesting challenge is that you likely don't know in
           | advance how big the symdiff is... If you're too conservative
           | you waste bandwidth, but never more bandwidth then you use
           | sending the whole set. You can reduce your waste at the
           | expense of round-trips, in an obvious manner: fetching a
           | little more of the sketch each time until you have enough.
           | 
           | In the erlay paper (bitcoin application of this stuff), we
           | give a scheme to bisect when the difference is bigger than
           | expected without wasting much bandwidth. The bisection loses
           | a little communications efficiency but avoids quadratic
           | computation costs which become problematic for big set
           | differences.
        
       ___________________________________________________________________
       (page generated 2020-12-03 23:00 UTC)