DISCREMENTAL (hard disk based incremental backups) This program (set of scripts) is yet another rsync based backup solution. It uses rsync with the --link-dest option to create hard links to files that have not changed rather than creating another copy of the same file. Using this technique, each snapshot is a full backup even though the backup process used is incremental (backing up only the files that changed since the last backup). The snapshots of files are arranged by date and time. PHILOSOPHY Tapes suck. If you are looking into this type of program you probably already know this. I don't need to defend the rationale for not using tapes. That said, you may still use tape and other mediums with this solution. Like other disk based backup programs this program is just a wrapper around rsync. Unlike many of those other programs it keeps that as the central goal--to make it easy and efficient to use the features of rsync and manage disk based backups. This philosophy also leads to the program being easily maintained. It does not use a compiled language. Why should a wrapper around rsync need to be compiled? It also minimizes the amount of code that must be maintained by using proven and common subsystems such as cron and ssh. Each backup is atomic in that it is run directly from cron and configured by the user. This allows maximum flexibility when needing to run pre and post actions without needing yet another markup language and/or configuration subsystem. One other tenant used during development of these scripts is unit testing of the code. There are regression tests that can be (and were) run after making changes to the code. This allows the system to be verified after small and major API changes with a higher degree of certainty. IDEAL BACKUP SOLUTION + only backs up changed files + can be tested + is easily maintainabile (scripted rather than compiled) + uses minimal hard disk space + has centralized configuration + requires minimal configuration on the client + using 'out of the box' systems + no extra software needed on client (uses sshd, ssh dsa keys, and/or rsyncd) + allows encrypted backups + allows files from the backup server to be backed up + to itself + to other systems (secondary) efficiently + allows for multiple sets of backups from different sources + allows efficient selective backups + user does not always want to back up the entire filesystem/system + allows backing up entire system + allows excluding of files and directories + allows backups to cross filesystems or not based on user configuration + allows for different options and settings for each set of backups + allows a backup to be run on demand + allows backups at different times without creating yet another subsystem for scheduling (in short--uses cron) - expires backups automatically based on admin configuration + registers banks/vaults with some central config + notifies admin when backup does not work + prevents a new backup when old one is still running on same data + indexes files for fast searching + can report the size of an archive or subset of an archive on demand - allows users to retrieve files - requires knowing which users have access to which files - requires authentication The plus (+) entries are completed. The minus (-) entries are not finished yet. OTHER CONSIDERATIONS These requirements make it difficult to include backups of files on windows systems but not impossible. TERMINOLOGY I like the terminology that dirvish uses. This is based on it. A box is an individual snapshot made by rsync that will share files with other snapshots by the use of hard links. The naming of the boxes is based on the FORMAT parameter and the date command. A vault is a collection of boxes that share files. The vault will also hold meta data and logs. A bank is a directory that holds vaults and meta data. In addition to vaults, the bank should contain a template vault and a bank log. A recursive (secondary or slave) vault is one which efficiently backs up another vault. Review BANK - a directory that holds vaults VAULT - a collection of backups that have hard links to some of the same files BOX - a snapshot/backup RECURSIVE/SECONDARY/SLAVE VAULT - a vault that backs up other vaults RECURSIVE VAULTS A recursive (secondary or slave) vault is one which efficiently backs up another vault. This is useful when secondary backup servers are used that do not have enough storage to backup the entire primary backup server. The recursive feature tells the vault to sync the last completed backup on the source server over to the destination. On update, it will also use the previous boxes that it has on updates to create hard links just like the source server does. This allows the secondary server to expire it's own backups at a different rate than the primary server. I.E. The primary server may keep backups for 1 month but the secondary server may only keep 2 backups. NOTE: It doesn't make sense to keep less than 2 backups in a recursive vault because the process of expiring the data would delete it all. It would then have to copy all that data again on the next backup. To get a better understanding of why this feature is needed let's assume that disk space is unlimited for a moment. A good rule of thumb is to keep 3 sets of backups. We will use 1 primary backup server and 2 secondary servers. The first thing we might do to backup the backups is to rsync the bank from the primary to the secondary server. We use the archive and hardlink options to rsync to keep permissions and not duplicate hard linked files. rsync -aH user@primary:/bank /bank There are a few problems with this approach. First, we may not have the space for the entire bank on the slave servers. Secondly, as the backups grow this process is going to take longer and longer to complete. We can try to make the process more efficient by building lists of files to include or exclude which are used with rsync. rsync -aH --exclude-from=file_with_vaults_and_boxes user@primary:/bank /bank rsync -aH --include-from=file_with_vaults_and_boxes user@primary:/bank /bank This approach works a little better but how are we going to build these lists of files? This could become very complicated very quickly when all we really want is to sync the latest backup from the primary server over to the secondary. After all, we already have most of the files needed in place. Only the new files will need to be transferred over. The recursive (secondary option is -S) can be used when setting up a vault on a secondary server using createvault. This option tells the code that it should treat the source as a vault. It will look for the last parameter and only sync that box. If you do NOT use the -S option the entire vault may be synced to the secondary server. Since you can sync a vault with rsync it is pointless to create a vault just to put the entire contents of another vault inside it. CONFIGURATION * Tweak defaults in /etc/discremental.conf or ~/.discremental Read the examples and notes in discremental.conf for more information. * Create a bank (a or b--not both) a. Use the createbank script to create a bank and register it with the system createbank /path/to/bank b. Create a bank by copying the bank template. It may be located in a document directory or in with the source for this utility. cp -r bank.template /path/to/my/bank Now put the bank in /etc/discremental.conf or ~/.discremental echo "BANK=/path/to/my/bank" >> /etc/discremental.conf * If using a network, setup connections from server to client(s) a. if using ssh, setup passwordless public/private key authentication b. if using rsync daemon setup rsyncd.conf on clients * create a vault and initial box for each system (or for each backup set that will share hardlinks to files) createvault /path/to/my/bank/vault /file /folder1 /folder2 * Schedule cron jobs that use updatevault to create boxes (snapshots) in a vault echo "30 1 * * * root updatevault /path/to/my/bank/vault" >> /etc/cron.d/discremental COMMANDS createbank createvault files . . . updatevault expirevault dsize hsize The size commands--by default--show how much real disk space is being used. Since hard links are used this number may be smaller than what you expect. If you want to know how much disk space would be used by using a tape drive or copying the files to somewhere else without using hardlinks pass the proper argument (-l). Comparing the two outputs would tell you how much space is being saved over traditional backup methods. OTHER RSYNC BASED SOLUTIONS are listed in OTHER_SOLUTIONS.txt