Title: How to check your data integrity?
       Author: Solène
       Date: 17 March 2017
       Tags: unix security
       Description: 
       
       Today, the topic is data degradation, bit rot, birotting, damaged files
       or whatever you call it. It's when your data get corrupted over the
       time, due to disk fault or some unknown reason.
       
       # What is data degradation ? #
       
       I shamelessy paste one line from wikipedia: "*Data degradation is the
       gradual corruption of computer data due to an accumulation of
       non-critical failures in a data storage device. The phenomenon is also
       known as data decay or data rot.*".
       
       [Data degradation on
       Wikipedia](https://en.wikipedia.org/wiki/Data_degradation)
       
       So, how do we know we encounter a bit rot ?
       
           bit rot = (checksum changed) && NOT (modification time changed)
       
       While updating a file could be mistaken as bit rot, there is a
       difference
       
           update = (checksum changed) && (modification time changed)
       
       # How to check if we encounter bitrot ? #
       
       There is no way you can prevent bitrot. But there are some ways to
       detect it, so you can restore a corrupted file from a backup, or
       repair it with the right tool (you can't repair a file with a hammer,
       except if it's some kind of HammerFS ! :D )
       
       In the following I will describe software I found to check (or even
       repair) bitrot. If you know others tools which are not in this list, I
       would be happy to hear about it, please mail me.
       
       In the following examples, I will use this method to generate bitrot
       on a file:
       
           % touch -d "2017-03-16T21:04:00"
       my_data/some_file_that_will_be_corrupted
           % generate_checksum_database_with_tool
           % echo "a" >> my_data/some_file_that_will_be_corrupted
           % touch -d "2017-03-16T21:04:00"
       my_data/some_file_that_will_be_corrupted
           % start_tool_for_checking
       
       We generate the checksum database, then we alter a file by adding a
       "a" at the end of the file and we restore the modification and acess
       time of the file. Then, we start the tool to check for data
       corruption.
       
       The first **touch** is only for convenience, we could get the
       modification time with **stat** command and pass the same value to
       touch after modification of the file.
       
       ## bitrot ##
       
       This is a python script, it's **very** easy to use. I will scan a
       directory and create a database with the checksum of the files and
       their modification date.
       
       **Initialization usage:**
       
           % cd /home/my_data/
           % bitrot
           Finished. 199.41 MiB of data read. 0 errors found.
           189 entries in the database, 189 new, 0 updated, 0 renamed, 0
       missing.
           Updating bitrot.sha512... done.
           % echo $?
           0
       
       **Verify usage (case OK):**
       
           % cd /home/my_data/
           % bitrot
           Checking bitrot.db integrity... ok.
           Finished. 199.41 MiB of data read. 0 errors found.
           189 entries in the database, 0 new, 0 updated, 0 renamed, 0
       missing.
           % echo $?
           0
       
       Exit status is 0, so our data are not damaged.
       
       **Verify usage (case Error):**
       
       
           % cd /home/my_data/
           % bitrot
           Checking bitrot.db integrity... ok.
           error: SHA1 mismatch for ./sometextfile.txt: expected
       17b4d7bf382057dc3344ea230a595064b579396f, got
       db4a8d7e27bb9ad02982c0686cab327b146ba80d. Last good hash checked on
       2017-03-16 21:04:39.
           Finished. 199.41 MiB of data read. 1 errors found.
           189 entries in the database, 0 new, 0 updated, 0 renamed, 0
       missing.
           error: There were 1 errors found.
           % echo $?
           1
       
       fails, it's easy to write a script running every day/week/month.
       
       [Github page](https://github.com/ambv/bitrot/)
       
       bitrot is available in OpenBSD ports in sysutils/bitrot since 6.1
       release.
       
       
       
       ## par2cmdline ##
       
       This tool works with PAR2 archives (see below for more informations
       about what PAR ) and from them, it will be able to check your data
       integrity **AND** repair it. 
       
       While it has some pros like being able to repair data, the cons is
       that it's not very easy to use. I would use this one for checking
       integrity of long term archives that won't changes. The main drawback
       comes from PAR specifications, the archives are created from a
       filelist, if you have a directory with your files and you add new
       files, you will need to recompute ALL the PAR archives because the
       filelist changed, or create new PAR archives only for the new files,
       but that will make the verify process more complicated. That doesn't
       seems suitable to create new archives for every bunchs of files added
       in the directory.
       
       PAR2 let you choose the percent of a file you will be able to repair,
       by default it will create the archives to be able to repair up to 5%
       of each file. That means you don't need a whole backup for the files
       (while it's would be a bad idea) and only an approximately extra of 5%
       of your data to store.
       
       **Create usage:**
       
           % cd /home/
           % par2 create -a integrity_archive -R my_data
           Skipping 0 byte file: /home/my_data/empty_file
       
           Source file count: 17
           Source block count: 2000
           Redundancy: 5%
           Recovery block count: 100
           Recovery file count: 7
       
           [text cut here]
           Opening: my_data/[....]
       
           Computing Reed Solomon matrix.
           Constructing: done.
           Wrote 381200 bytes to disk
           Writing recovery packets
           Writing verification packets
           Done
       
           % echo $?
           0
       
           integrity_archive.par2
           integrity_archive.vol000+01.par2
           integrity_archive.vol001+02.par2
           integrity_archive.vol003+04.par2
           integrity_archive.vol007+08.par2
           integrity_archive.vol015+16.par2
           integrity_archive.vol031+32.par2
           integrity_archive.vol063+37.par2
           my_data
       
       **Verify usage (OK):**
       
           % par2 verify integrity_archive.par2 
           Loading "integrity_archive.par2".
           Loaded 36 new packets
           Loading "integrity_archive.vol000+01.par2".
           Loaded 1 new packets including 1 recovery blocks
           Loading "integrity_archive.vol001+02.par2".
           Loaded 2 new packets including 2 recovery blocks
           Loading "integrity_archive.vol003+04.par2".
           Loaded 4 new packets including 4 recovery blocks
           Loading "integrity_archive.vol007+08.par2".
           Loaded 8 new packets including 8 recovery blocks
           Loading "integrity_archive.vol015+16.par2".
           Loaded 16 new packets including 16 recovery blocks
           Loading "integrity_archive.vol031+32.par2".
           Loaded 32 new packets including 32 recovery blocks
           Loading "integrity_archive.vol063+37.par2".
           Loaded 37 new packets including 37 recovery blocks
           Loading "integrity_archive.par2".
           No new packets found
       
           The block size used was 3812 bytes.
           There are a total of 2000 data blocks.
           The total size of the data files is 7595275 bytes.
       
       
           [...cut here...]
           Target: "my_data/....." - found.
       
           % echo $?
           0
       
       **Verify usage (with error):**
       
           par2 verify integrity_archive.par.par2                              
           Loaded 36 new packets
           Loading "integrity_archive.par.vol000+01.par2".
           Loaded 1 new packets including 1 recovery blocks
           Loading "integrity_archive.par.vol001+02.par2".
           Loaded 2 new packets including 2 recovery blocks
           Loading "integrity_archive.par.vol003+04.par2".
           Loaded 4 new packets including 4 recovery blocks
           Loading "integrity_archive.par.vol007+08.par2".
           Loaded 8 new packets including 8 recovery blocks
           Loading "integrity_archive.par.vol015+16.par2".
           Loaded 16 new packets including 16 recovery blocks
           Loading "integrity_archive.par.vol031+32.par2".
           Loaded 32 new packets including 32 recovery blocks
           Loading "integrity_archive.par.vol063+37.par2".
           Loaded 37 new packets including 37 recovery blocks
           Loading "integrity_archive.par.par2".
           No new packets found
       
           The block size used was 3812 bytes.
           There are a total of 2000 data blocks.
           The total size of the data files is 7595275 bytes.
       
       
           [...cut here...]
           Target: "my_data/....." - found.
           Target: "my_data/Ebooks/Lovecraft/Quete Onirique de Kadath
       l'Inconnue.epub" - damaged. Found 95 of 95 data blocks.
       
       
           1 file(s) exist but are damaged.
           16 file(s) are ok.
           You have 2000 out of 2000 data blocks available.
           You have 100 recovery blocks available.
           Repair is possible.
           You have an excess of 100 recovery blocks.
           None of the recovery blocks will be used for the repair.
       
           1
       
       
           % par2 repair integrity_archive.par.par2      
           Loading "integrity_archive.par.par2".
           Loaded 36 new packets
           Loading "integrity_archive.par.vol000+01.par2".
           Loaded 1 new packets including 1 recovery blocks
           Loading "integrity_archive.par.vol001+02.par2".
           Loaded 2 new packets including 2 recovery blocks
           Loading "integrity_archive.par.vol003+04.par2".
           Loaded 4 new packets including 4 recovery blocks
           Loading "integrity_archive.par.vol007+08.par2".
           Loaded 8 new packets including 8 recovery blocks
           Loading "integrity_archive.par.vol015+16.par2".
           Loaded 16 new packets including 16 recovery blocks
           Loading "integrity_archive.par.vol031+32.par2".
           Loaded 32 new packets including 32 recovery blocks
           Loading "integrity_archive.par.vol063+37.par2".
           Loaded 37 new packets including 37 recovery blocks
           Loading "integrity_archive.par.par2".
           No new packets found
       
           The block size used was 3812 bytes.
           There are a total of 2000 data blocks.
           The total size of the data files is 7595275 bytes.
       
       
           [...cut here...]
           Target: "my_data/....." - found.
           Target: "my_data/Ebooks/Lovecraft/Quete Onirique de Kadath
       l'Inconnue.epub" - damaged. Found 95 of 95 data blocks.
       
       
           1 file(s) exist but are damaged.
           16 file(s) are ok.
           You have 2000 out of 2000 data blocks available.
           You have 100 recovery blocks available.
           Repair is possible.
           You have an excess of 100 recovery blocks.
           None of the recovery blocks will be used for the repair.
       
       
       
       l'Inconnue.epub" - found.
       
       
           0
       
       working with PAR archives exists. They should be able to all works
       with the same PAR files.
       
       [Parchive on Wikipedia](https://en.wikipedia.org/wiki/Parchive)
       
       [Github page](https://github.com/Parchive/par2cmdline)
       
       par2cmdline is available in OpenBSD ports in archivers/par2cmdline.
       
       If you find a way to add new files to existing archives, please mail
       me.
       
       ## mtree ##
       
       One can write a little script using **mtree** (in base system on
       OpenBSD and FreeBSD) which will create a file with the checksum of
       every files in the specified directories. If mtree output is different
       since last time, we can send a mail with the difference. This is a
       process done in base install of OpenBSD for /etc and some others files
       to warn you if it changed.
       
       While it's suited for directories like /etc, in my opinion, this is
       not the best tool for doing integrity check.
       
       ## ZFS ##
       
       I would like to talk about ZFS and data integrity because this is
       where ZFS is very good. If you are using ZFS, you may not need any
       other software to take care about your data. When you write a file,
       ZFS will also store its checksum as metadata. By default, the option
       "checksum" is activated on dataset, but you may want to disable it for
       better performance.
       
       There is a command to ask ZFS to check the integrity of the
       files. Warning: scrub is very I/O intensive and can takes from hours
       to days or even weeks to complete depending on your CPU, disks and the
       amount of data to scrub:
       
           # zpool scrub zpool
       
       The scrub command will recompute the checksum of every file on the ZFS
       pool, if something is wrong, it will try to repair it if possible. A
       repair is possible in the following cases:
       
       If you have multiple disks like raid-Z or raid-1 (mirror), ZFS will be
       look on the differents disks if the non corrupted version of the file
       exists, if it finds it, it will restore it on the disk(s) where it's
       corrupted.
       
       If you have set the ZFS option "copies" to 2 or 3 (1 = default), that
       means that the file is written 2 or 3 time on the disk. Each file of
       the dataset will be allocated 2 or 3 time on the disk, so take care if
       you want to use it on a dataset containing heavy files ! If ZFS find
       thats a version of a file is corrupted, it will check the others
       copies of it and tries to restore the corrupted file is possible.
       
       You can see the percentage of filesystem already scrubbed with 
       
           zfs status zpool
       
       and the scrub can be stopped with 
       
           zfs scrub -s zpool
       
       
       Like ZFS, BTRFS is able to scrub its data and report bit rot, and
       repair
       it if data is available in another disk.
       
       To start a scrub, run:
       
           btrfs scrub start /
       
       You can check progress using:
       
           btrfs scrub status /
       
       It's possible to use `btrfs scrub cancel /` to stop a scrub, and resume
       it later with `btrfs scrub resume /`, however btrfs tries its best to
       scrub the data without affecting much the responsiveness of the system.
       
       
       ### AIDE ###
       
       Its name is an acronym for "Advanced Intrusion Detection Environment",
       it's an complicated software which can be used to check for bitrot. I
       would not recommend using it if you only need bitrot detection.
       
       Here is a few hints if you want to use it for checking your file
       integrity:
       
       **/etc/aide.conf**
       
           /home/my_data/ R
           # Rule definition
           All=m+s+i+md5
           report_summarize_changes=yes
       
       (R for recursive). "All" line list the checks we do on each file. For
       bitrot checking, we want to check modification time, size, checksum
       and inode of the files. The `report_summarize_change` displays a
       list of changes if something is wrong.
       
       This is the most basic config file you can have. Then you will have to
       run **aide** to create the database and then run aide to create a new
       database and compare the two databases. It doesn't update its database
       itself, you will have to move the old database and tell it where to
       found the older database.
       
       
       # My use case #
       
       I have different kind of data. On a side, I have static data like
       pictures, clips, music or things that won't change over time and the
       other side I have my mails, documents and folders where the content
       changes regularly (creation, deletetion, modification). I am able to
       afford a backup for 100% of my data with some history of the backup on
       a few days, so I won't be interested about file repairing. 
       
       I want to be warned quickly if a file get corrupted, so I can still
       get the backup in my history but I don't keep every versions of my
       files for too long. I choose to go with the python tool **bitrot**,
       it's very easy to use and it doesn't become a mess with my folders
       getting updated often.
       
       I would go with par2cmdline if I could not be able to backup all my
       data. Having 5% or 10% of redundancy of my files *should* be enough to
       restore it in case of corruption without taking too much space.