>You Are Here<

Introduction

This is currently a work in progress, but it does work and I thought people might be interested.

The other day I had an idea for an improvement to my snapshoting system. I'm currently using an rsync snapshot system (see, for example here and it is very good. Rather than rotating the snapshots, as in the above link, I just create a new one each night. I now have lots, going back several months, but I've got plenty of disk space and they don't take up too much.

There are, however, a couple of 'limitations' to this sytem, or rather, improvements which could be made:-

  • If a file is moved between snapshots a new copy must be made
  • (related) if the same file exists in different places it will take up unnecessary extra disk space

A solution to this problem would be venti. This is clearly very cool. If anyone knows of an implementation of this for Linux I'd love to hear about it.

Still, without implementing a full venti system, we can get quite close to it using a simple perl script or 2.

Downloads

There are 3 scripts here (2 perl scripts and a shell script).

To use them, put them all in your PATH, chmod +x them, and then:-

  • Create a snapshots directory
  • In the snapshots directory create a directory called 'hashes'
  • cd into the directory you wish to snapshot
  • type 'snapshot SNAPDIR' where SNAPDIR is the snapshot directory you just created

eg: 'snapshot ~/snapshots'

To run these you will need the perl module 'Digest::SHA1' from CPAN.

Principle

We have a working directory. We also have a snapshot directory, where snapshots are kept. The simplest way to make a snapshot would be to simply do a recursive copy of the working directory into a directory in the snapshot directory. In this case, if the working directory is typically X MB in size, the size of the snapshots will be N*X MB where N is the total number of snapshots. This is clearly not very good. What we want is to only store 1 copy of any given file. These snapshots are stored as: SNAPDIR/YYYY/MM/DD--HH-MM-SS/...

To make sure we only store 1 copy of a given file (or rather - 1 copy of a given file content) rather than doing a straight copy to the snapshots directory we first read in the contents of each file and take a hash of it (using sha1 in this case). We then look in the 'hashes' directory in the snapshots directory to see if there is a file named according to that hash. If there is then we just (hard) link that file to the destination file in the snapshot we are creating. If there isn't then we create one, copying the content of the file into it, and THEN link that to the destination file.

In this way the snapshot directories will only ever have one copy of a given file, even if it moves, or its name changes. There is a slight problem with this if there are 2 files with different permissions or owners but the same content. Both these files will be hard linked to the same hash file, which means the permissions of 1 of them will be changed. To get around this problem a file called 'CATALOG' is created at the root of each snapshot which contains a list of all the files along with their owners, mode, hash and other things. This is also useful for comparing different snapshots to see what changed. We can then have another program for restoring things from a snapshot which reconstructs the permissions and owners as required. I haven't written that script yet!

These snapshots can be browsed and opened just like any other file on the file system.

Implementation

The 'catalog' script reads a list of files from STDIN and writes out that list of files along with modes, owners, SHA1 digests and other information (from the perl 'stat' function). This then gets piped into the 'snap' script, which reads a list of files from STDIN along with their hashes etc and copies them into a new snapshot. The initial list of files can be generated by 'find .'. 'snapshot' is a very simple shell script which pipes the output of 'find' into these 2 perl scripts (one after the other).

Notes

I use 'find' to generate the file list to save me having to scan directories myself in perl. This makes the implementation easier but also makes the system quite flexible - it's possible to only select certain files for the snapshot.

Another useful thing to have would be a script to replace a file not in the snapshot directory with a link to a hash file in the snapshot hashes directory. This would only work if that file were never modified (so it should be marked read only). That would save space if a snapshot were ever taken of that file.