At Six Feet Up, we have been using Bacula to manage the backups across our infrastructure. This has worked mostly well for years, but sometimes Bacula can be hard to debug, can lead to strange issues with virtual tape volumes filling up and can cause backup failures. Also, due to our typical workload of backing up a few very large files for each customer, it can be quite inefficient about the amount of storage it uses for incremental backups.
Bacula’s limitations have pushed us to look for a alternate system that would handle our needs in a more graceful way. Since we are a heavy user of ZFS (Z File System) for all internal storage, we looked into using some of the native capabilities of our file system of choice. More specifically: the combination of snapshots and replication.
Over time, we moved all of our FreeBSD Jail (a lightweight container) infrastructure to use ZFS to manage their backups. This system is entirely based on dataset snapshots and replication back to our storage servers using zfsnap and zxfer. zfsnap
takes a snapshot every day for a week and every week for a month. It throws away old ones automatically so you don't accumulate excess snapshots. For example, all of our current Jail servers’ dataset get snapshotted, then each snapshot gets replicated back to the storage servers. It's very fast.
Here is an example straight from one of our production Jail servers:
$ grep zfsnap /etc/periodic.conf
daily_zfsnap_enable="YES"
daily_zfsnap_recursive_fs="zroot/jails"
daily_zfsnap_delete_enable="YES"
weekly_zfsnap_enable="YES"
weekly_zfsnap_recursive_fs="zroot/jails"
weekly_zfsnap_delete_enable="YES"
Optionally, you can take snapshots hourly and monthly. Each snapshot gets stored with a specific TTL that tells zfsnap
when to clean up old snapshots. In our example, we are using the default TTLs of 1 week for daily snapshots and 1 month for weekly snapshots.
The other half of the solution is to get these snapshots moved to a remote server and/or location. zxfer
is a transfer script that makes sure all the snapshots get transferred back in a timely fashion. It takes all the snapshots that it doesn't know about yet and transfers them over to the storage server automatically.
$ sudo crontab -l
@daily /usr/local/sbin/zxfer -dFkv -g 376 -T root@storagehost1 -R zroot/jails storage/jailhost1
The example above will recursively (-R
) all datasets under zroot/jails
and delete stale snapshots on the target (-d
), store original filesystem properties in a file (-k
), force a rollback of the filesystem on the target before transferring (-F
) and protect grandfather snapshots older than 376 days (-g
). The zxfer man page has some really good examples on it that can clarify this further.
Additionally, these tools do this at a block level backup so if a 4k block of a 10GB file changes, only 4k gets backed up, and we don't have to backup the whole 10GB. It's a lot more efficient on space and faster to transfer.
If you are eager to learn more on this topic, I have written another post about using ZFS in conjunction with Postgresql.
I'd love to get any feedback on this process and hear if anyone else is doing something similar in their infrastructure.