Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

I have a large amount of files that should never change, including RAW photograp

ID: 661897 • Letter: I

Question

I have a large amount of files that should never change, including RAW photographs and video files. I'm worried about silent bit rot.

I do have backups to restore lost/corrupted files, but comparing current files against backups is not practical (for example, video files are on digital tapes). Also, my backup software does not provide functionality for this.

Is there a software that scans list of folders, stores reliable checksums and can validate that selection for added/removed/modified (corrupted) files?

There's about 3TB and 21 million files (a large portion of that is really small files, obviously), so memory consumption is important. It should run on Linux, and preferably on OS X too.

Explanation / Answer

I started using AIDE:

AIDE (Advanced Intrusion Detection Environment) is a file and directory integrity checker.

It tracks changed, added and modified files, and file attributes. Supports variety of checksum algorithms, including sha256/sha512.

On Ubuntu, aide package is available from base repository (apt-get install aide). On OS X, compiling failed with mysterious errors, but installation with macports succeeded:

sudo port install aide

Example configuration file is available at /opt/local/etc/aide.conf. Running is simple:

aide --init # Initializes the database - calculates checksums
aide --check # Checks files against the database
aide --update # Checks files against the database, and updates the database

All data is stored in plaintext file (which is obviously vulnerable to corruption, but keeping a copy is easy), so switching the tool to something else should be straightforward.

Positive things:

Fast
Supports multiple strong checksum algorithms. Use of md5 is highly discouraged, as it's basically broken.
Easy to run on cron
Based on short testing, no issues so far. Detected all changes (on content and on configured file attributes) properly, as well as added and removed files.
Supports complicated file excludes: for example, there's no point on checksumming temporary files, or any file that should change.
Calculates multiple checksums (configurable). This provides relatively good guarantees for future - even if one hashing algorithm is compromised, integrity database is still useful, even against intentional modifications (vs. bit rot).
Checksums are stored in plaintext, and headers include field definitions. This is useful if configuration file is lost, or if it is parsed with another program.
Easy to store configuration file and checksum database on each disk/CD/folder (structure). With that, all configuration options are automatically stored, and it is easy to run integrity check again.

Negative points:

Configuring requires editing the config file on text editor, versus having a nice UI. Similarly, checking output is straight to the terminal.
Latest release is from 2010, but on the other hand it is feature complete, so there is no need for constant updates.
Checksum database integrity is not automatically validated. Fortunately, doing that separately is easy (sha1sum checksums.db > checksums.db.sha1sum)