Ok, here’s another shout out to the guys and gals in Mountain View, CA. For anyone who’s ever had processing scripts fail because someone on the data entry side of the house didn’t use caps, misplaced a decimal point, mixed up the letters of an acronym or emisspelled siphonophore (the most obvious example), Google Refine might be a God-send. Continue reading
Data needs to be backed up, period. How’s it done is up for discussion so long as it’s done regularly and consistently. In this article I’m going to talk about rsync, my preferred technique for copying files from point A to point B and how to use rsync in data backup script.
The easiest way to copy files is to use the operating system’s built-in copy program. Every OS has a file copy program that copies files from one place to another. Depending on the OS the copy program may have several options that alter hoe the program behaves but for the most part they all work the same. Coupling the copy command with a scripting language and a scheduler and you have everything you need to setup automated file backups. There is one catch. Copy scripts only work well for smaller dataset where the copy time for a directory of files ranges from seconds to just a few minutes. For larger datasets where copying a complete folder can take the better part of an hour or more there is reason to incorporate some efficiency.
The easiest way to reduce the backup time is to copy only the new files, thus eliminating the time wasted by copying previously backed up files again and again. However, what about the files that are not new but have changed since the last backup? What is really needed is a why to copy only the changes. So how do we script this?
What is rsync, how can it help
As if an answer to our data management prayers comes Andrew Tridgell and rsync. A open source (read: free) command-line client/server that automatically identifies any new files and/or modified files and transfers only the changes. The way rsync works is pretty slick. By calculating a rolling checksum on both the local and remote files, rsync identifies what files are new or have changed. For files that have changed rsync actually copies just the internal changes in the file thus improving efficiency dramatically. For more information on how rsync works please click here. What’s really great about rsync is it provides the backup simplicity of copying a whole directory plus the time-saving efficiency of copying only the changes.
Rsync is a very powerful copy utility and has many options that dictate how it behaves. For example, there are flags for ensuring file owner, group, and modification times are preserved. There is a flag for tunneling the transfer through a secure shell (SSH) tunnel. There is a flag for only copying files in a directory whose filename matches a regular expression. The list of options is long and I highly recommend reading about them in the manual.
Rsync has been ported to all the major operating system. Linux users should be able to find a pre-package version of rsync for most distros. For instance, Debian/Ubuntu users can install rsync via the following command:
sudo apt-get install rsync
Mac users will find that rsync comes as part of OS X (tested on OS X 10.6.6) however it is an older version. It is recommended that users install an updated version using the Darwin Ports or Fink package management systems. The command to install rsync using Fink is:
sudo fink install rsync
Windows users will need to download and install cwrsync.
Rsync can be run in two ways: as a client and as a server. When run as a client rsync acts much like a command-line copy program. Command-line arguments are used to control its behavior. The default port for the rsync client to use is port 873. This port may be block by more secure networks or when trying to transfer files over the Internet. As an alternative the rsync client can transfer data though an SSH tunnel using the more popular port 22. More on that a little bit later.
When rsync is run as a server, it listens for incoming rsync clients. Once a client connects to the server, the server services the transfer requests. There is a different set of configuration options when rsync is run as a server. Rather than pass all of the configuration data via command-line arguments, a rsync server also uses a configuration file (rsyncd.conf) to define its options and behavior. One of the many features defined in the configuration files are the name and location for a rsync share points. Rsync share points are equivalent to the shared folders in Windows. The share points provide a shortcut to a directory on the servers file system making it more easily available to a connecting client.
Rather than go through all the available scenarios, I’m just going to only describe the ways I use rsync on the ship I work on.
In my shipboard environment I have a workstation dedicated to the operation of the CTD and XBT (SHIP_CTD) and a ship-board data repository (SHIP_WAREHOUSE). Both computers are on the same subnet. The SHIP_CTD workstation is the rsync client. It is running Windows XP Pro. The SHIP_WAREHOUSE server is the rsync server. It is running Debian Linux version 5. I have setup an rsync share point on SHIP_WAREHOUSE called “databackup”. Here is the rsyncd.conf file:
max connections = 5 log file = /var/log/rsync.log timeout = 300 [databackup] comment = Data Backup Folder on SHIP_WAREHOUSE path = /mnt/RAIDDisk/data read only = no list = yes uid = nobody gid = nogroup hosts allow = 127.0.0.0/8 192.168.1.0/24
To start the rsync server on SHIP_WAREHOUSE I run the following command at startup, this command assumes the rsyncd.conf file is located at /etc:
To push all the data files from the c:\data\ folder on SHIP_CTD to the “databackup” rsync share point on SHIP_WAREHOUSE (192.168.1.42), I run the following commands from SHIP_CTD:
SET CYGWIN=nontsec SET "PATH=%PATH%;c:\Program Files\cwRsync\bin" rsync -a /cygdrive/c/data/ rsync://192.168.1.42:/databackup/
Here’s the breakdown:
- SET CYGWIN=nontsec – Setting this variable ensures the rsync command doesn’t mess up any local file permissions.
- SET “PATH=%PATH%;c:\Program Files\cwRsync\bin” – adds the directory where the rsync executable lives to the execution path
- rsync – the rsync executable
- -a – the archive flag. This tells rsync to recursively copy all files and subdirectories from the source locations to the destination location. It also tells rsync to preserve as much file information as possible such as modification time, permissions, etc
- /cygdrive/c/data – the source location c:\data. Because rsync uses the cygwin UNIX environment you have to use cygwin’s directory notation. /cygdrive/ = the Windows filesystem, c/data/ = c:\data\.
- rsync://192.168.1.42 = the rsync server running on 192.168.1.42 (SHIP_WAREHOUSE)
- :/databackup/ the destination location, a rsync shared mount called “databackup”
The first time you run the command it performs just like a copy command, copying each file from the source to the destination. If you run the command a second time you should notice it runs much faster. That’s because rsync figured out the difference between the source and the destination and only copied the changes.
In addition to backing up files around the boat, I also regularly copy files back to the beach using the vessel’s VSAT satellite connection. The files travel over the satellite, and across the internet to a shore-based data repository (SHORE_WAREHOUSE). Firewall policies on shore and on the vessel prevent using straight rsync. This is where SSH comes into play. Using the -e ssh command-line argument I can transfer the data through a SSH tunnel. SSH uses port of 22 which is much more common than rsync and in my case was not blocked by the two firewalls.
Using SSH requires that I provide the username for an valid user account on the server. UNIX Only: the destination directory must be owned by the remote user . It also requires that when the rsync client calls the ssh client that I provide a password for the given username. To get around providing a password please refer to the OceanDataRat article on SSH Public Key Authentication. On the plus side, tunneling with SSH eliminates the need to run an Rsync server, the SSH server does the job for us. The downside is that instead of being able to use the RSYNC share mounts, we must explicitly call the full destination path.
Here’s how the previous example changes
SET CYGWIN=nontsec SET "PATH=%PATH%;c:\Program Files\cwRsync\bin" SET "HOME=%HOMEDRIVE%\Documents and Settings\Administrator" rsync -a -e ssh /cygdrive/c/data/ rsync://192.168.1.42:/mnt/RAIDDisk/databackup/
The main difference is the addition of the -e ssh argument and setting the HOME environment variable. For more information on why the HOME variable must be set and how to eliminate the need for a password please refer to the OceanDataRat article on SSH Public Key Authentication
For those of us that are paranoid about knowing whether or not files were backed up rsync has a solution, the -i flag. The -i flag turns on the audit trail providing in-depth information about how rsync treated each files and what errors were encounters. You can save this audit trail by redirecting the output from the rsync command to a log file.
The output of the audit trail looks as follows:
<f......... CTD/SBE911/EX1004L2_CAST02_20100523.hex <f......... CTD/SBE911/EX1004L2_CAST02_20100523.hdr <f......... CTD/SBE911/EX1004L2_CAST02_20100523.bl <f+++++++++ CTD/SBE911/EX1004L2_CAST03_20100527.hex <f+++++++++ CTD/SBE911/EX1004L2_CAST03_20100527.hdr <f+++++++++ CTD/SBE911/EX1004L2_CAST03_20100527.bl
The files with the <f……… prefix are files that already existed at the remote server and did not require updating. The files with the <f+++++++++ prefix are files that did not exist at the remote server and were transferred. Please read the rsync manual for the full description of what the 11 character prefix mean.
When I use custom scheduled jobs to do these kinds of important tasks I like to know if the script ran successfully or not. The return value of the rsync command can quickly indicate if the transfer encountered any problems. I like to use this return code to trigger a system notification like a Growl message. Please refer to the OceanDataRat article on Growl notification for more information on integrating Growl notifications into scripts.
With the basic building blocks in place it’s time to pull everything together and build a script we can pass to Windows scheduler. Here is my final script. Unlike most of my scripts this one’s for Windows only. It handles all of the options discussed including ssh tunneling, logging and Growl notifications. You will need to tweak the script variables for your particular setup, install cwRsync and Growl on your local system and an SSH server on your remote system. You will also need to setup your SSH public/private keys. I’ve included a lot of documentation in the script that I hope makes it easy to tweak the script for your particular setup as well as disabling some of the bells and whistles like SSH, logging and notifications.
Once you’ve modified the script for your particular needs, run it a couple of times from the command prompt to ensure it’s preforming as expected. When satisfied, add it to Windows scheduler and you should be well on your way towards a more automated data management setup. In the end, I just hope it helps.
Want to talk about what was discussed here? Please go to the forums.