DESIGN - dedup - deduplicating backup program
 (HTM) git clone git://bitreich.org/dedup/ git://enlrupgkhuxnvlhsf6lc3fziv5h2hhfrinws65d7roiv6bfj7d652fid.onion/dedup/
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) Tags
 (DIR) README
 (DIR) LICENSE
       ---
       DESIGN (2323B)
       ---
            1 Design notes
            2 ============
            3 
            4 There are three main abstractions in the design of dedup:
            5 
            6   - The chunker interface
            7   - The snapshot layer
            8   - The block layer
            9 
           10 The block layer
           11 ---------------
           12 
           13 From the outside world, the block layer is just an abstraction for
           14 dealing with variable length blocks.  All blocks are referenced with
           15 their hash.
           16 
           17 The block layer is arranged into a stack of layers.  From top to
           18 bottom these are as follows:
           19 
           20   - Generic layer
           21   - The compression layer
           22   - The encryption layer
           23   - The storage layer
           24 
           25 The generic layer is the one that client code interfaces with.  It is
           26 the top level entrypoint to the block layer.  The generic layer
           27 calculates the hash of the block and passes it down to the compression
           28 layer.
           29 
           30 The compression layer will prepend a compression descriptor to the
           31 block and then compress the block using snappy or lz4.  It is possible
           32 to disable compression in which case a special descriptor is prepended
           33 and the data is passed uncompressed to the encryption layer.
           34 
           35 The encryption layer will prepend an encryption descriptor to the
           36 block and then encrypt/authenticate the block using XChaCha20 and
           37 Poly1305.  It is possible to disable encryption in which case it acts
           38 as a bypass with a special type of encryption descriptor.  The block
           39 is then passed to the storage layer.
           40 
           41 The storage layer will prepend a storage descriptor and append the
           42 descriptor and the data to a single backing file.
           43 
           44 The snapshot layer
           45 ------------------
           46 
           47 The snapshot abstraction is currently very simplistic.  A snapshot is
           48 a file under $repo/archive/<name>.  The contents of the file are the
           49 block hashes of the data stored in the snapshot.
           50 
           51 The chunker interface
           52 ---------------------
           53 
           54 The chunker issues variable length blocks.  The minimum block size is
           55 512KB, the maximum block size is 8MB and the average block size is
           56 2MB.  These configuration parameters can be modified by editing
           57 config.h but it can be tricky to tune it properly.
           58 
           59 The buzhash[0] rolling hash algorithm is used to fingerprint the input
           60 stream.
           61 
           62 When encryption is enabled, a random seed is generated and stored
           63 encrypted in the repository state file.  The seed is XOR-ed with the
           64 buzhash initial state table to mitigate against length fingerprinting
           65 attacks.
           66 
           67 [0] http://www.serve.net/buz/Notes.1st.year/HTML/C6/rand.012.html