[HN Gopher] Using Bloom filters to efficiently synchronise hash ... ___________________________________________________________________ Using Bloom filters to efficiently synchronise hash graphs Author : ignoramous Score : 64 points Date : 2020-12-02 16:39 UTC (1 days ago) (HTM) web link (martin.kleppmann.com) (TXT) w3m dump (martin.kleppmann.com) | wiml wrote: | This is nice. How does it compare to something like a polynomial | based representation like in [1] (used by the old keyserver | gossip network)? I find Bloom filters more intuitive to reason | about, but [1] does mention them as prior work and seems to think | its scheme outperforms them. | | [1] http://dx.doi.org/10.1109/TIT.2003.815784 | http://ipsit.bu.edu/documents/ieee-it3-web.pdf | thomasahle wrote: | I think the approaches based on invertible bloom filters are | generally more efficient. | | https://dl.acm.org/doi/abs/10.1145/2043164.2018462 (What's the | difference?: efficient set reconciliation without prior | context) 2011 | | https://link.springer.com/article/10.1007/s00446-017-0316-0 | (Simple Multi-Party Set Reconciliation) 2015 | | They both briefly mention the Minsky paper. | nullc wrote: | > I think the approaches based on invertible bloom filters | are generally more efficient. | | More computationally efficient for sure, but their | communications efficiency is only good asymptotically. | | For "small" cases such as where the set difference has a few | hundred or thousand entries and especially where the set | members are small (e.g. <=64-bit hashes) there is a pretty | large constant factor needed to get a low failure rate from | invertible bloom filters. | | E.g. for a set difference of size 128 we measured that you | need 32x communications overhead to get a 1 in 10,000 failure | rate. 8x overhead for 2048 items. The issue is that iblt | decodability depends on the non-existence of cycles in a | random graph and this only holds for sufficiently large | graphs. | | For applications (such as Bitcoin transaction relay or -- I | think -- SCM commits) which frequently synchronize a small | number of items these communications overheads take that | approach entirely out of the running. Fortunately, at these | sizes the quadratic computation of communications efficient | approaches is not a big deal. | | Interestingly, it is completely unambiguous that IBLT and | dense code based reconciliation can be combined (sort of the | dual to Raptor codes, which combine sparse fountain codes and | dense RS-like codes for block erasure recovery)-- but it's an | open question if the combination can capture the advantages | of both approaches (linear time decode and communications | efficiency) for set reconciliation as is the case for block | erasure codes. | nullc wrote: | It's possible to be dramatically more communication efficient | than bloom filters to synchronize sets. | | Bloom filters require O(my_set_size) bandwidth, with a good | constant factor. This work optimizes it somewhat by excluding | members that existed at the time of last synchronization, but it | still wastes bandwidth when those commits came in from another | source. | | It's possible to instead realize | O(symmetric_difference_between_sets) when synchronizing a pair. | Applied to this kind of usage the bandwidth required would be | reduced by whatever redundancy factor (e.g. number of different | repos you'd learn the same commit from). | | This can be accomplished via error correcting codes, for a high | performance industrial implementation I contributed to, see: | https://github.com/sipa/minisketch/ (I tried submitting this to | HN a while back, but no joy--). | | For our paper on applying this to Bitcoin transaction relay to | render rumouring bandwidth largely independent of peer count: | | https://arxiv.org/abs/1905.10518 | | [Similar to the motivations of the post, we consider immunity to | attackers to be critical-- and most prior efficient gossip work | is extremely non-robust.] | | For the specific case of synchronizing the blockchain itself, | these fancy techniques aren't needed because there is only a | single best chain 'list'-- there is only one relevant tip-- and | it can be efficiently synchronized with bisection. Bitcoin sends | up to 100 hashes at once (though usually more like 30), the first | few dense around the best tip and the rest exponentially spaced | to extremely efficiently find the highest point of commonalty. | dochtman wrote: | This sounds very interesting, so I tried reading some of the | math in | https://github.com/sipa/minisketch/blob/master/doc/math.md (I'm | not great this kind of math). | | In my understanding, a Bloom filter, while being O(set_size), | nonetheless has a small multiplier for the linear factor (in | the OP I think they mentioned using 10 bits). As I read the | PinSketch setup, it looks like it would actually need the full | bit size of the set members, so for SHA-256 you need the full | 256 bits per set member. So while PinSketch would only need | O(symmetric_difference_between_sets), the element size factor | is ~25 times larger? | | And also, in the example from the minisketch readme, it looks A | still needs to send the full sketch (not just the symdiff) in | the first message, right? | | In other words, there are a bunch of trade-offs here that might | very much depend on your application? | nullc wrote: | You can make the hashes you as small as you like-- subject to | the limitation that you'll get false positives if you make it | too small ---- the same problem that small hashes get in | bloom filters. | | Bloom filters also have poor communications efficiencies for | small numbers of entries: you have to oversize the filter a | lot. This can be ameliorated by entropy coding the oversized | bloom filter, but once you're going that far you're better | off with using a golumb coded set or something more | efficient. | | I think the only case where an approximate set-filter | approach will be more efficient is where you want to | communicate a small set one way to someone whos set is _much_ | larger. The issue there is the the sketch needs one entry per | item it removes, while a bloom filter can be smaller | (assuming the difference is big enough to overcome the bloom | overheads). | | Information theoretically it should be possible to decode a | minisketch 'beyond the limit' using knowledge of the specific | items in the local set. ... but the only algorithm I know for | this has exponential time (brute force remove items from your | set and retry the reconstruction). This problem is | mathematically pretty close to list-input decoding a error | correcting code, so efficient algorithms may be possible-- | but so far there hasn't been any research to that end. | | [You can, of course, combine a filter with efficient | reconciliation in cases where an set size differences are | expected. The filter's FP rate just needs to be low enough to | get the receivers set size close to the sender's] | | > it looks A still needs to send the full sketch | | The sketch is exactly the same size as the symdiff it can | recover. | | An interesting challenge is that you likely don't know in | advance how big the symdiff is... If you're too conservative | you waste bandwidth, but never more bandwidth then you use | sending the whole set. You can reduce your waste at the | expense of round-trips, in an obvious manner: fetching a | little more of the sketch each time until you have enough. | | In the erlay paper (bitcoin application of this stuff), we | give a scheme to bisect when the difference is bigger than | expected without wasting much bandwidth. The bisection loses | a little communications efficiency but avoids quadratic | computation costs which become problematic for big set | differences. ___________________________________________________________________ (page generated 2020-12-03 23:00 UTC)