Google Refine… pretty awesome.

Google RefineOk, here’s another shout out to the guys and gals in Mountain View, CA.  For anyone who’s ever had processing scripts fail because someone on the data entry side of the house didn’t use caps, misplaced a decimal point, mixed up the letters of an acronym or emisspelled siphonophore (the most obvious example), Google Refine might be a God-send.

Basically its a web-based spreadsheet program in the spirit of the Google Docs Spreadsheet program but with a couple of serious tweaks.

  1. With one click Google Refine shows the value of a column and number of times that value appears in a row.  Here’s an example: Lets say you have an event logger for recording ROV observations.  On 999 rows the user enters “fish”, on 123 rows the user enters ”fsh” on 12 rows the user enters ”FSH” and on 54 rows the user enters ”Fish”.  Using Google Refine you would quickly see all these variations and be able to quickly make that column in all the rows read “fish”.
  2. For numerical data Google Refine can perform some basic statistical analysis to find outliers that may have been caused by misplacing a decimal or putting in text instead of numbers.
  3. Although it web-based, Google Refine runs as a local application. Meaning that unlike Google Docs, you don’t have to upload your data to Google and it works without an Internet connection.  Both very important since… you might be working with proprietary data, on a ship, with a 128kB internet connection.  It’s also cross-platform (Windows, Linux and OS X).

Here’s a screencast that will probably do a better job explaining than I ever could:

 Thanks to Eric Martin from MBARI for pointing me to this.

I hope this helps.

Want to talk about this some more? Please post your questions in the Forums

Share
Posted in Post Processing, Software | Tagged , , , , | 1 Comment

Heading to RVTEC Conference in New Orleans, LA.

Next week is one of my favorite conferences, the annual UNOLS RVTEC meeting.  Not to downplay the importance of science but this is where ship operators, marine techs and us datarats get to see what each other is up to and talk shop.   This year it’s in New Orleans, LA.  I’ll be going on behave of the NOAA Office of Ocean Exploration and Research and the Okeanos Explorer Program but I’d welcome some nuts and bolts datarat discussions over a brew afterwords.

See you there!
- Webb

Meeting Website

Share
Posted in General | Tagged , | Leave a comment

Using RSS to Monitor Data Transfers.

I got this idea from a colleague down at Stennis Space Center about a year ago.  He said “Wouldn’t it be nice if we could know when data arrives on the server the same way get notified about online news articles?”  The light bulb went on and pretty much exploded.  And why try to replicated the functionality?  Just use the same technology to publish data transfers to the web.  The technology I’m referring to is Real Simple Syndication (RSS), a dirt-bag simple way to publish information that allows anyone to subscribe to receive news updates on all sorts of platforms (browsers, news reader, email clients).

What is RSS?

The last line of the previous paragraph pretty much sums up what RSS does.  How it works is as the name implies, real simple.  RSS is just an XML-based text file hosted on a web server.  The file must adhere to the standardized RSS XML schema but because the XML schema is standardized, all kinds of programs have been written to interpret and display RSS articles.

Here’s the basic layout of an RSS file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<atom:link href="http://tethys.gso.uri.edu/data/rss.xml" rel="self"
 type="application/rss+xml" />
<title>EX Data Transfer RSS Feed</title>
<link>http://tethys.gso.uri.edu/data/rss.xml</link>
<description>This RSS feed provides updates to when files are synced to the
 Shore-based Redistribution Server</description>
<language>en-us</language>
<copyright>Copyright (C) 2011 Okeanos Explorer Program</copyright>
<item>
<title>EX1106 Data Upload Update - Tue, 27 Sep 2011 13:30:57 UTC</title>
<description>
<![CDATA[
<p>Added new file: EX1106/CTD/XBT/EX1106_XBT94_110927.EDF</p>
<p>   13 files updated in ./EX1106/SCSData/NAV</p>
]]>
</description>
<pubDate>Tue, 27 Sep 2011 13:30:57 UTC</pubDate>
</item>
</channel>
</rss>

The breakdown:

  • Line 1 is required, exactly as it appears.
  • Line 2 is the main container, everything within the <RSS> container is interpreted as part of the RSS feed.
  • Line 3 is the container for an RSS channel, think TV channels.  I believe there can be multiple channels in a single RSS feed but I’m not sure how RSS client interpret multiple channels.  For this article I don’t use multiple channels.
  • Lines 4-5 add compatibility with ATOM clients.  ATOM is an alternative syndication protocol.
  • Line 6 is the Title of the RSS feed
  • Line 7 is the URL of the feed (or it can be used to link to a parent site)
  • Lines 8-9 are the description for the feed, a.k.a what the RSS feed is propagating.
  • Line 10 is the language of the feed
  • Line 11 is the copyright info.
  • Line 12 is the opening tags for an item (article).
  • Line 13 is the title of the item (article).
  • Line 14 is the opening tag for the meat of the article
  • Lines 15-18 are the meat of the article.  This RSS feed uses HTML-style tags for formatting the text.  This is not required but for the RSS feed I setup it made things easier.   To use HTML-style formatting add “<![CDATA[" at the beginning and "]]>” at the end of the text block.
  • Line 19 is the closing tag for the description.
  • Line 20 is the publishing date/time.  The date/time must be formatted just as it is shown to adhere to the RSS schema standard.
  • Line 21 is the closing tag for the item (article).  At this point addition items can be added.
  • Lines 22-23 are closing tags for the channel and the rss feed.

Now back to the original problem…

The Okeanos Explorer transfers all collected data (sans raw multibeam data and high-definition video) to shore via satellite every hour.  The collection, cataloging, checksum generation and upload all happen auto-magically.

The participants on shore are dependent on this data flow to stay in the know of the ship’s findings as well as being able to actively participate in the exploration.  This data dependency created one of the most asked questions… “is it (data) there yet?”

Enter RSS.  By creating an RSS feed based on the successful transfer of data from the ship to shore, shore-side participants are almost instantly informed of when new data has arrived.

How I did it:

For each hourly transfer to shore (via rsync) there is a corresponding log file.  The log file is created by rsync using the “-i” flag.  This produces a list of all the files in the source directory and how each file was interpreted (i.e. as a new file, an updated file or unchanged).  I include the Cruise ID and the date/time of the transfer in the log file name (i.e. EX1104_Transfer_to_Shore_20110810T093000Z).  This is used by my script to populate the <title> and <pubdate> fields in the RSS article.

After a successful data transfer I upload the corresponding rsync log files to a specific directory on the shore-based server.  Once a file arrives on the shore-side server a bash script processes the log file into a RSS article (<item></item>) and adds the article to the beginning of the RSS feed and presto, within seconds of the data arriving, the users are made aware.  After the log file is processed it is moved to a backup directory so that it is not processed again.

In order to minimize the length of each article I only show what’s new and what has changed.  For new files I list the file name and full path. For updated files I list the directory name and number of files that were updated.

Once the RSS file is created I save it in the Okeanos Explorer’s shore-side web server for the shore-side team to see.  Take a look.

Caveats

Satellite communications at sea can be flakey sometimes due to faulty equipment, tracking issues and weather.  This causes the data transfers to periodically fail.  As part of the Okeanos Explorer’s hourly data transfer scripts, the rsync command is called repeatedly until the entire transfer completes or up to five times, whichever comes first.  Each rsync call produces a new rsync log file.  At the end of a successful transfer, all of the logs are sent to shore.  To account for this I wrote a script that batch processes any and all rsync log files within a directory.

The Code

Here is the bash script I use to process the rsync log files: download.

Here is the root RSS file that the articles are added to: download.  I used this file just the first time I run the script.  To add articles to an existing RSS feed you need to run the script against the most recent version.

Here’s the script I used to batch process a directory of rsync log files: download.

Both of the scripts are heavily documented so if you have any issues running them please take a look at the comments.

I hope this helps.

Want to talk about this some more? Please post your questions in the Forums.

Share
Posted in Data Management | Tagged , , , | 2 Comments

Consolidated Web-based management of backup scripts running on remote computers.

Here’s a technique I developed two years ago to streamline the process of consolidate datasets collected on multiple computers.  Using some batch scripts, PHP code and some open source tools I created a simple web-based management system for controlling my regular data backup tasks.

Problem: The data collection landscape on the Okeanos Explorer is not the simplest.  There are separate workstations for each of the major collection systems (CTD/XBT, SCS, Multibeam, EK60, etc).  At the beginning of each cruise the ship’s survey techs create new folders on each of the collection workstations for that system’s datatype (i.e. SBE SeaSave, SIS, SCS).  For size reasons (i.e. multibeam data) some of the collection workstations store their collected data on a share drive (i.e. NetApps storage array).  In these cases the folder is created on the shared drive.  This is not done as standard practice to prevent dependencies on network resources. Typically data remains on the collection workstation to improve the performance of additional product development (i.e. creating maps, calculating SVPs, plotting SST, etc) or for data comparison.

The ship needed a unified data management solution that met the following requirements:

  1. All the collected data needed to be consolidated consistently on a cruise-by-cruise to a single point.
  2. The solution had to be flexible enough to accomidate and ever-changing list of collection points.
  3. It needed to be simple enough that it could be managed with a minimum amount of the survey tech’s time.
  4. It needed to be platform independent.
  5. It needed to enforce data management policies (i.e. naming conventions) where possible.

Solution: The first thing that was needed was a consolidated collection point, something the Okeanos Explorer calls the shipboard data warehouse (Warehouse).  The Warehouse has enough storage for all datasets (sans raw High Definition (HD) video and raw multibeam data) for an entire field season (~800GB).  The hardware is reasonably fault-tolerant; Dell PowerEdge 2950 Server, rack-mount, dual NICs, dual power supplies, 8 hot-swappable 150GB SAS drives connected to a hardware RAID controller.  The server is running Debian 6 (Linux)

We used rsync batch scripts called as a scheduled tasks (Linux/Mac used BASH scripts and cron jobs) to transfer the data from the collection computer to the Warehouse (Refer to Using RSYNC to Efficiently Backup Data).  The rsync jobs run every hour.  We use rsync’s –include/–exclude arguments to enforce naming conventions.  The rsync jobs are tailored such that the data from each collection point is copied to the standardized directory location on the Warehouse regardless of the original directory name.  A new directory structure is created on the Warehouse for each cruise that contains the cruise id (i.e. EX1104).  Error checks are performed in the scripts where ever possible.  Any errors as well as successes are reported to the collection workstation as Growl notifications (Refer to Using GrowlNotify to Send System-wide Notifications From Scripts).

Having all the data in one place was extremely useful to ship’s crew and science alike.  This prompted us to make the consolidated datasets publicly available (read-only) via FTP, SMB and HTTP.  To quell security concerns we moved the Warehouse to the visitor’s network and altered the backup scripts to use SSH tunneling for the transfers (Refer to Setting Up SSH Public Key Authentication).

All of the backup scripts behave based on a centralized configuration file.  The configuration file contains all of the local and remote (on the Warehouse) directory names.  The configuration file lives on the Warehouse and is access by the collection workstations via http using the wget utility (wget for Windows).  Once the file is downloaded, the variables are loaded into the shell environment, the local copy of the file is immediately deleted (for security) and the script does it’s job.  When the shell completes (success or fail) the variables are erased (again for security).

Dataflow Diagram

Dataflow Diagram showing flow of configuration files and data.

Management of the configuration file is web-based.  A secure website (via .htaccess) running on the Warehouse contains a web form for altering the master configuration file.  At the beginning of each cruise the survey tech updates the directory information as required and hits a “save” button.  The next time the scripts run the new variables will be applied.

ODR_webdata

Web-based configuration file control.


Additional bells and whistles: While in port the ships’s are turned off.  The scheduled tasks used for the automatic backups however are not.  There are two ways to handle this.

  1. Go around to each of the collection workstations and disable all the scheduled tasks
  2. Use the central configuration file to disable the backup scripts.

We went with the latter approach.  A variable in the central configuration file serves as the master switch that will prevent the script from reaching the rsync command.

The configuration file also contains the cruise ID which is used as part of the ship’s naming convention and is the name of the top-level cruise directory.

The Code:

Here’s the code for the website, big thanks to friend and honorary datarat from MBARI, Eric Martin: download.  Unzip this into the document root folder for your website (i.e. /var/www).  You will need to open index.php and set the $batFile variable for your particular installation.

Here’s a sample backup script that uses the method described in this article: download.  You will need to change the HOMEPATH, CONFIG_FILENAME and CONFIG_URL variables for your particular installation.

I hope this helps.

Want to talk about this some more? Please post your questions in the Forums.

Share
Posted in Datalogging | Tagged , , , , , , , , , | Leave a comment

Cable Management via Paperclip

Here’s a trick I use to help keep Mac workstations tidy.

Problem: How to manage the mouse cable connected to the keyboards on Mac workstations.

Solution: Use a paper clip as a reusable wire tie.  If you take your time it doesn’t have to look tacky.

  1. Acquire a paper clip… shouldn’t be too difficult.
  2. Straighten the paperclip.
  3. Use needle nose pliers to fold the ends of the clips back on themselves.  Only need about 1/16″
  4. Bundle the mouse cable into a ~3″ loop leave 3″ of unbundled cable on both sides.
  5. Wrap the paper clip in a spiral pattern around the bundle.
  6. Show your supervisor just how resourceful/OCD you are.


Share
Posted in General | Tagged | Leave a comment

Using GrowlNotify to Send System-Wide Notifications from Scripts.

I do a lot scripting and use a lot of automated tasks.  While I like that I have the computers working for me (instead of the other way around) it’s nice to be able to monitor how things are going, when job complete and most importantly when things go awry.  There are many ways to monitor jobs including automatic emails, log files, RSS feeds (I hope to do an article on this soon) and my favorite method: Growl notifications.

Growl is a notification framework.  It gives other programs a way to submit messages to a queue which Growl sends to the user as elegant and uniform notifications.  Each notification includes three pieces of information: a title, a message and an icon.  By default the messages appear somewhere on the screen (usually bottom-right) and after 5 seconds disappear.  Notification can be set as “sticky”, meaning the notification will not disappear after 5 seconds.  If more than 1 message comes in at a time the notifications get stacked vertically up the right-side of the screen.

Growl was originally developed just for Mac and included plug-ins for several of OS X’s build-in programs including iTunes, Mail and Safari.  It also included a command-line interface (CLI) program called growlnotify.  Growlnotify allowed growl messages to be triggered from the terminal app or from within shell scripts.  That opened the door for elegant notifications from script-based scheduled tasks.  The Windows community saw the usefulness in this unified notification engine and soon after both growl and growlnotify were ported to Windows (available here).   This opened the door for triggering growl notifications from batch files and powershell scripts.

Usage

Mac Users

Here’s a simple example of how to call growlnotify from a Mac bash script.  In this example the growl notification informs the user that a file transfer has completed and the exact time of the event.

GROWLDATE=`date +"%A, %h %d %Y %H:%m:%S %Z"`
growlnotify -s -a terminal -t "Transfer Complete" -m "${GROWLDATE}"

Here’s the breakdown:

  • GROWLDATE=`date +”%A, %h %d %Y %H:%m:%S %Z”` – This takes the current system date/time and reformats it to look like “Wednesday, Feb 16 2011 21:06:28 EST”
  • growlnotify – the name of the CLI program that triggers the growl message
  • -s – set the notification as sticky
  • -a terminal – set the notification icon the same as the icon used by Terminal.app
  • -t “Transfer Complete” – set the title of the notification
  • -m “${GROWLDATE}” – set the notification message to the date string created earlier

Windows Users

Same scenario as the Mac example but for a Windows batch file.

GROWLDATE="%date:~-4,4%/%date:~-10,2%/%date:~7,2% %time:~-11,2%:%time:~-8,2%:%time:~-5,2%"
growlnotify.com /s:true /i:"c:\growlicons\success.png" /t:"%SCRIPT_NAME%" "%GROWLDATE%"

Here’s the breakdown:

  • GROWLDATE=”$date:~-4,4%/%date:~-10,2%/%date:~7,2% %time:~-11,2%:%time:~-8,2%:%time:~-5,2%” – This takes the current system date/time and reformats it to look like “2011/02/16 21:06:28″
  • growlnotify.com – the name of the CLI program that triggers the growl message
  • /s:true – set the notification as sticky
  • /i:”c:\growlicons\success.png – set the notification icon to the image provided
  • /t:”Transfer Complete” – set the title of the notification
  • “%GROWLDATE%” – set the notification message to the date string created earlier

Recommended Practices

  • When developing a new script I put in a lot of notifications.  As I gain confidence in the script I disable them.
  • I make all error-related notifications sticky.  This way I won’t miss one if I’m away from my desk for several hours (I gotta sleep at some point, right?)
  • I use different icons for notifying errors and successes.  I make sure the icon are in stark contrast to each other which helps quickly identify problems.

Here are the Success and Fail images I typically use with growlnotify

Success

Fail

I hope this helps.

Want to talk about this some more? Please post your questions in the Forums.

Share
Posted in General | Tagged , , , , | Leave a comment

Using Rsync to Efficiently Backup Data

Overview

RSYNCData needs to be backed up, period.  How’s it done is up for discussion so long as it’s done regularly and consistently.  In this article I’m going to talk about rsync, my preferred technique for copying files from point A to point B and how to use rsync in data backup script.

The easiest way to copy files is to use the operating system’s built-in copy program.  Every OS has a file copy program that copies files from one place to another.  Depending on the OS the copy program may have several options that alter hoe the program behaves but for the most part they all work the same.  Coupling the copy command with a scripting language and a scheduler and you have everything you need to setup automated file backups.  There is one catch.  Copy scripts only work well for smaller dataset where the copy time for a directory of files ranges from seconds to just a few minutes.  For larger datasets where copying a complete folder can take the better part of an hour or more there is reason to incorporate some efficiency.

The easiest way to reduce the backup time is to copy only the new files, thus eliminating the time wasted by copying previously backed up files again and again.  However, what about the files that are not new but have changed since the last backup?  What is really needed is a why to copy only the changes.  So how do we script this?

What is rsync, how can it help

As if an answer to our data management prayers comes Andrew Tridgell and rsync. A open source (read: free) command-line client/server that automatically identifies any new files and/or modified files and transfers only the changes.  The way rsync works is pretty slick.  By calculating a rolling checksum on both the local and remote files, rsync identifies what files are new or have changed.  For files that have changed rsync actually copies just the internal changes in the file thus improving efficiency dramatically.  For more information on how rsync works please click here.  What’s really great about rsync is it provides the backup simplicity of copying a whole directory plus the time-saving efficiency of copying only the changes.

Rsync is a very powerful copy utility and has many options that dictate how it behaves.  For example, there are flags for ensuring file owner, group, and  modification times are preserved.  There is a flag for tunneling the transfer through a secure shell (SSH) tunnel.  There is a flag for only copying files in a directory whose filename matches a regular expression.  The list of options is long and I highly recommend reading about them in the manual.

Installing rsync

Rsync has been ported to all the major operating system.  Linux users should be able to find a pre-package version of rsync for most distros.  For instance, Debian/Ubuntu users can install rsync via the following command:

sudo apt-get install rsync

Mac users will find that rsync comes as part of OS X (tested on OS X 10.6.6) however it is an older version.  It is recommended that users install an updated version using the Darwin Ports or Fink package management systems.  The command to install rsync using Fink is:

sudo fink install rsync

Windows users will need to download and install cwrsync.

Using rsync

Rsync can be run in two ways: as a client and as a server.  When run as a client rsync acts much like a command-line copy program.  Command-line arguments are used to control its behavior.  The default port for the rsync client to use is port 873.  This port may be block by more secure networks or when trying to transfer files over the Internet.  As an alternative the rsync client can transfer data though an SSH tunnel using the more popular port 22. More on that a little bit later.

When rsync is run as a server, it listens for incoming rsync clients.  Once a client connects to the server, the server services the transfer requests.  There is a different set of configuration options when rsync is run as a server.  Rather than pass all of the configuration data via command-line arguments, a rsync server also uses a configuration file (rsyncd.conf) to define its options and behavior.  One of the many features defined in the configuration files are the name and location for a rsync share points.  Rsync share points are equivalent to the shared folders in Windows.  The share points provide a shortcut to a directory on the servers file system making it more easily available to a connecting client.

Rather than go through all the available scenarios, I’m just going to only describe the ways I use rsync on the ship I work on.

In my shipboard environment I have a workstation dedicated to the operation of the CTD and XBT (SHIP_CTD) and a ship-board data repository (SHIP_WAREHOUSE).  Both computers are on the same subnet.  The SHIP_CTD workstation is the rsync client.  It is running Windows XP Pro.  The SHIP_WAREHOUSE server is the rsync server.  It is running Debian Linux version 5.  I have setup an rsync share point on SHIP_WAREHOUSE called “databackup”.  Here is the rsyncd.conf file:

max connections = 5
log file = /var/log/rsync.log
timeout = 300
[databackup]
comment = Data Backup Folder on SHIP_WAREHOUSE
path = /mnt/RAIDDisk/data
read only = no
list = yes
uid = nobody
gid = nogroup
hosts allow = 127.0.0.0/8 192.168.1.0/24

To start the rsync server on SHIP_WAREHOUSE I run the following command at startup, this command assumes the rsyncd.conf file is located at /etc:

rsync --daemon

To push all the data files from the c:\data\ folder on SHIP_CTD to the “databackup” rsync share point on SHIP_WAREHOUSE (192.168.1.42), I run the following commands from SHIP_CTD:

SET CYGWIN=nontsec
SET "PATH=%PATH%;c:\Program Files\cwRsync\bin"
rsync -a /cygdrive/c/data/ rsync://192.168.1.42:/databackup/

Here’s the breakdown:

  • SET CYGWIN=nontsec – Setting this variable ensures the rsync command doesn’t mess up any local file permissions.
  • SET “PATH=%PATH%;c:\Program Files\cwRsync\bin” – adds the directory where the rsync executable lives to the execution path
  • rsync – the rsync executable
  • -a – the archive flag. This tells rsync to recursively copy all files and subdirectories from the source locations to the destination location. It also tells rsync to preserve as much file information as possible such as modification time, permissions, etc
  • /cygdrive/c/data – the source location c:\data.  Because rsync uses the cygwin UNIX environment you have to use cygwin’s directory notation.  /cygdrive/ = the Windows filesystem, c/data/ = c:\data\.
  • rsync://192.168.1.42 = the rsync server running on 192.168.1.42 (SHIP_WAREHOUSE)
  • :/databackup/ the destination location, a rsync shared mount called “databackup”

The first time you run the command it performs just like a copy command, copying each file from the source to the destination.  If you run the command a second time you should notice it runs much faster.  That’s because rsync figured out the difference between the source and the destination and only copied the changes.

Adding Security

In addition to backing up files around the boat, I also regularly copy files back to the beach using the vessel’s VSAT satellite connection.  The files travel over the satellite, and across the internet to a shore-based data repository (SHORE_WAREHOUSE).  Firewall policies on shore and on the vessel prevent using straight rsync.  This is where SSH comes into play.  Using the -e ssh command-line argument I can transfer the data through a SSH tunnel.  SSH uses port of 22 which is much more common than rsync and in my case was not blocked by the two firewalls.

Using SSH requires that I provide the username for an valid user account on the server. UNIX Only: the destination directory must be owned by the remote user .  It also requires that when the rsync client calls the ssh client that I provide a password for the given username.  To get around providing a password please refer to the OceanDataRat article on SSH Public Key Authentication.  On the plus side, tunneling with SSH eliminates the need to run an Rsync server, the SSH server does the job for us.  The downside is that instead of being able to use the RSYNC share mounts, we must explicitly call the full destination path.

Here’s how the previous example changes

SET CYGWIN=nontsec
SET "PATH=%PATH%;c:\Program Files\cwRsync\bin"
SET "HOME=%HOMEDRIVE%\Documents and Settings\Administrator"
rsync -a -e ssh /cygdrive/c/data/ rsync://192.168.1.42:/mnt/RAIDDisk/databackup/

The main difference is the addition of the -e ssh argument and setting the HOME environment variable.  For more information on why the HOME variable must be set and how to eliminate the need for a password please refer to the OceanDataRat article on SSH Public Key Authentication

Adding Logging

For those of us that are paranoid about knowing whether or not files were backed up rsync has a solution, the -i flag.  The -i flag turns on the audit trail providing in-depth information about how rsync treated each files and what errors were encounters.  You can save this audit trail by redirecting  the output from the rsync command to a log file.

The output of the audit trail looks as follows:

<f......... CTD/SBE911/EX1004L2_CAST02_20100523.hex
<f......... CTD/SBE911/EX1004L2_CAST02_20100523.hdr
<f......... CTD/SBE911/EX1004L2_CAST02_20100523.bl
<f+++++++++ CTD/SBE911/EX1004L2_CAST03_20100527.hex
<f+++++++++ CTD/SBE911/EX1004L2_CAST03_20100527.hdr
<f+++++++++ CTD/SBE911/EX1004L2_CAST03_20100527.bl

The files with the <f……… prefix are files that already existed at the remote server and did not require updating.  The files with the <f+++++++++ prefix are files that did not exist at the remote server and were transferred.  Please read the rsync manual for the full description of what the 11 character prefix mean.

Adding Notification

When I use custom scheduled jobs to do these kinds of important tasks I like to know if the script ran successfully or not.  The return value of the rsync command can quickly indicate if the transfer encountered any problems.  I like to use this return code to trigger a system notification like a Growl message.  Please refer to the OceanDataRat article on Growl notification for more information on integrating Growl notifications into scripts.

Automate it!

With the basic building blocks in place it’s time to pull everything together and build a script we can pass to Windows scheduler.  Here is my final script.  Unlike most of my scripts this one’s for Windows only.  It handles all of the options discussed including ssh tunneling, logging and Growl notifications.  You will need to tweak the script variables for your particular setup, install cwRsync and Growl on your local system and an SSH server on your remote system. You will also need to setup your SSH public/private keys.  I’ve included a lot of documentation in the script that I hope makes it easy to tweak the script for your particular setup as well as disabling some of the bells and whistles like SSH, logging and notifications.

Once you’ve modified the script for your particular needs, run it a couple of times from the command prompt to ensure it’s preforming as expected.  When satisfied, add it to Windows scheduler and you should be well on your way towards a more automated data management setup.  In the end, I just hope it helps.

Want to talk about what was discussed here? Please go to the forums.

Share
Posted in Datalogging, Software, Storage | Tagged , , , , , , | Leave a comment

Setting up SSH Public Key Authentication

SSHI’m a big fan of scripting and secure communication, which together can be tricky.  As an example: how to write a script that connects two computers?   Under normal circumstances this requires a password therefore two ugly solution are to provide a password every time you run the script (not practical for scheduled tasks) or include the password in the script in clear text (defeats the point of having a password).  There is a third option, SSH with public key authentication.

SSH with public key authentication allows a secure, encrypted connection to be establish between two computers without having to provide a password.  Instead of a password, each computer has a digital key that it uses for authentication.  The keys are paired, meaning both are required for authentication.  The “public key” resides on the remote server.  The “private key” resides on any local machines that needs to connect to the server.  When a client it’s private key to the server, the server checks to see if it pairs with the public key.  If it’s a match, the server acknowledges the client’s connection request is genuine and the connection is made, without a password.  If the private key doesn’t pair with any of the public keys on the server, the authentication method reverts back to requiring a password.  This Wikipedia article does a good job explaining how public key authentication works and I encourage anyone with concerns about authenticating without needed a password to take a read.

Some Initial Planning

Public Key Authentication doesn’t really connect two computers, it connect a user on the local computer to a user on the remote computer.  Before proceeding further you need to map out how this connection is going to be made.  On the boat I work on I have the same user account on all the machines that I use just for data management (i.e. datarat).  All but one of these machines act as clients.  Only my shipboard data warehouse (SHIP_WAREHOUSE) is a SSH server.  This is not the only way to do it.  What’s important is that you map out who needs to talk to who and which computers are going to be the servers and which are going to be the clients.

Installing SSH

The next step is to install a SSH client and SSH server on the local and remote systems respectively.

For Linux users: there’s a good chance this has already been done.  To verify a SSH client is installed type:

ssh -V

This should return a one-line response like: OpenSSH_5.2p1, OpenSSL 0.9.8l 5 Nov 2009, indicating a SSH client is installed and it’s version.

To verify a SSH server is running type:

ps -A | grep sshd

This should return a response like: 2423 ?         00:00 sshd, indicating the SSH server is running and it’s process ID.

If the system does not return these responses please referred to the documentation for the Linux distro on installing ssh and sshd.

For Mac users (tested through OS X 10.6.6): SSH is installed by default but the server may not be enabled.  To enable the SSH server, goto System Preferences –> Sharing.  Check the “Remote Login” checkbox.

For Windows users: SSH will need to be installed.  There are several commercial and open source (read: free ) options.  My preferred choice is Copssh and the remainder of this article will be based on this implementation.  Please refer to the Copssh website for installation instructions.  When installing Copssh as a SSH server is important to enable the local user you plan to authenticate to from a SSH client.  This is done from the Copssh control panel.

Creating the keys

The next step is to create the public and private keys.  I prefer to do this from the client machine.  Depending on the OS the command varies.

For Linux, Unix and Mac users: Login to the computer using user account you plan to authenticate from.  From a terminal windows type the command:

ssh-keygen -t rsa -N ''

This will create the ~/.ssh directory.  Inside that directory you should find the private key (id_rsa) and the public key (id_rsa.pub).

For Windows users ( Copssh installed ): Login to Windows using the local user account you plan to authenticate from.  Open a command prompt ( Start –> run –> cmd.exe ). This should start you in the local user’s home directory (i.e. C:\Documents and Settings\<username> ).  Create a sub-directory called “.ssh” by typing:

mkdir .ssh

Create the keys by typing:

"c:\Program Files\ICW\Bin\ssh-keygen.exe" -t rsa -N '' -f .ssh\id_rsa

Verify the keys were created by looking inside the .ssh directory.  You should find the private key (id_rsa) and the public key (id_rsa.pub).

Transferring the Public Key to the Remote Server

Now we need to install the public key we just created to the remote server.  The way public keys are installed is simple.  There is a file on the remote server called. /home/<username>/.ssh/authorized_keys (Unix) or c:\Documents and Settings\<username>\.ssh\authorized_keys (Windows).  Each line in the authorized_keys file corresponds to a public key.  By default the authorized_keys file is NOT created for each user.  It must be created when the first key is installed.  Subsequent keys are added to the end of the file.  It is important to remember that each key in the authorized_keys file takes exactly one line.  When installing keys from a Windows machine to a Unix machine make sure carriage returns are not introduced.

If this is the first key being installed on the server there is a shortcut for transferring the public key and creating the authorized_keys file.  Secure Copy (scp) is part of the SSH client install.  Secure copy is similar to regular copy but it allows users to the copy files between two machines using SSH.  Here’s how to use  scp to copy the id_rsa.pub file from a local Windows-based machine and saving it the authorized_keys file on a remote Unix/Linux/Mac-based server:

"c:\Program Files\ICW\Bin\scp.exe" .\.ssh\id_rsa.pub <remote user>@<remote server>:.ssh/authorized_keys

This command will prompt for the remote user’s password.

Verify it Works

If all went well you should be able to login using SSH and NOT NEED A PASSWORD.  Here is the command for connecting a Windows-base client to a Unix-based server:

c:\Program Files\ICW\Bin\ssh.exe" -f .ssh\id_rsa <remote user>@<remote server>

Windows-based client require the -f <filename> argument because Windows and Copssh handle home directories differently.

Troubleshooting

If you’ve followed these instructions and you are still prompted for a password the problem is usually caused by one of four errors: can’t find the private key, can’t find the public key, wrong local user, wrong remote user.  Verify that on the local machine (Windows-based) the local user’s home director has a .ssh\id_rsa file.  Verify that on the remote machine (Unix-based) that the remote user’s home director has a .ssh/authorized_keys file.  Verify that you are connecting from the correct local user account to the remote server and the correct remote user account.

In all the examples the client is assumed to be a Windows-based machine and the server is a Unix-based machine.  If you are trying something different you may need to modify the steps.  If you need help please post your questions in the forums.

Conclusion

While it might not be obvious, Public Key Authentication opens up a world of scripting options including secure file transfers and remote script execution.  You can expect future articles on OceanDataRat to refer back to this article.  In the end I just hope this helps.

Want to talk about the article more.  Please go to the forums

Share
Posted in General | Tagged , , , , | Leave a comment

Bulk Geo-Tagging of Images Using SCS Timestamped NMEA GGA, HDT and ROV Data

The idea idea of Geo-Tagging or Geo-Referencing images is straight forward; embed time and position information into an image file so that you know when and where the image was taken without having to keep track of the information externally.  It is a technique that is widely used among professional and amateur photographers alike to show where their photos were taken.  Technically speaking, geo-tagging is the process of populating storage bins in the image file’s metadata header with GPS time, longitude, latitude, heading (true or magnetic), and altitude (above or below sea level).  This metadata header is only present on PSD, JPEG and TIFF image types so the techniques discussed here will only work on those files.

Why Geo-tag?

For many ships, data is primarily made up of ASCII data received on a network or serial port (i.e. GPS, Gyro, Met Sensors, etc) or specialized binary data (ADCP, Multibeam, SBP, etc).  However on an increasing number of ships (i.e. the NOAA Ship Okeanos Explorer, the USGC Icebreaker Healy, and the E/V Nautilus) images from fixed cameras that were once novel are becoming standard datasets.  For the same reason we add a timestamp to our serial data strings to provide them temporal context, we should be adding something to this image dataset so that we can figure out when and where it came from.

There are 3 major standards in the image metadata world: EXIF, XMP and IPTC.  All the standards have ways to embed copyright, creator, creation time and a description but only EXIF goes beyond that and handles among other things, image orientation (landscape/profile) and GPS location.  Thus to Geo-tag the images we need just worry about populating the EXIF metadata bins.

A Real-World Scenario:

During one cruise aboard the NOAA Ship Okeanos Explorer the ROV team saved ~3400 high-resolution JPEG images.  The scientists who participated on the cruise needed to know where these images were taken (in x,y,z and heading).

The NOAA Ship Okeanos Explorer records High-Definition (HD) video from their two Remotely Operated Vehicles (ROVs) and four ship-mounted HD pan/tilt/zoom (PTZ) cameras.  The Okeanos does not use Blu-ray discs or HDCAM tape media to store video but instead records all video as files onto dual 42TB RAID arrays.  Each filename includes the date and time of the first frame as well as the camera source and a brief description.  The date/time portion are pulled from a dedicated GPS-sync’d SMPTE time code generator and automatically added to the filenames by the video reordering system.  The camera source and description are generated manually by the video operator.  An example of this file naming syntax: 20100711_05h21m23s02_ROVHD_CRABS.mov.  From these HD video files, the ROV video team selects individual frames and saves them as high-resolution JPEG images.  The image filenames match the original video filename syntax but the date/time portions of the filenames are manually corrected to correspond with the timecode for individual video frame.  i.e. 20100711_05h24m53s19_ROVHD_CRABS.jpg.

20100711_05h24m53s19_ROVHD_CRABS.jpg

The lat,lon positions of the ROVs are generated as a NMEA0183 GGA string. The ROV heading, depth, pitch, roll, altitude, etc is generated as a proprietary NMEA formatted sentence created by the ROV control software vendor.  All of this data is logged using the NOAA developed Shipboard Computing System (SCS).  The SCS system prepended each received line of data with a date and time (“mm/dd/yyyy,hh:mm:ss.sss” UTC timezone).

Here are some samples of what the raw data looks like:

08/01/2010,00:20:41.900,$GPGGA,002040.60,0441.78672,N,12650.79987,E,0,00,0.0,0.00,M,,,,*0B
08/01/2010,00:20:42.010,$GPGGA,002040.60,0441.78672,N,12650.79987,E,0,00,0.0,0.00,M,,,,*0B
08/01/2010,00:20:42.119,$GPGGA,002040.60,0441.78672,N,12650.79987,E,0,00,0.0,0.00,M,,,,*0B
08/01/2010,00:20:42.260,$GPGGA,002040.60,0441.78672,N,12650.79987,E,0,00,0.0,0.00,M,,,,*0B

08/01/2010,00:16:21.922,$PGSSRVR,359.8,0.9,0.7,0.0,m,0.0,m,1,0.0,0.0*57
08/01/2010,00:16:22.422,$PGSSRVR,359.4,1.6,0.5,0.0,m,0.0,m,1,0.0,0.0*57
08/01/2010,00:16:22.922,$PGSSRVR,359.0,1.2,0.1,0.0,m,0.0,m,1,0.0,0.0*53
08/01/2010,00:16:23.437,$PGSSRVR,358.3,0.8,0.6,0.0,m,0.0,m,1,0.0,0.0*5D

Required Tools

To read/write the metadata I needed a command-line tool that I could use as part of a script.  After Google-ing “EXIF command line tool” I found exiv2, a simple, open-source EXIF metadata tool, perfect!

I also needed to do some simple file querying and row formatting. For this I relied on my good friends grep and awk. I also needed to use the BASH shell scripting language to stitch everything together. My solution was developed on the Mac OS X platform but ultimately it was ported to work on a Linux-based server.   At the time of this writing, I haven’t looked to see if this will work on Windows using CYGWIN but I don’t see a reason why not.

The Solution

UPDATED 2011/01/12: After using this script I realized many web-gallery sites such as Piwigo and Gallery3 require that the Exif.Photo.DateTimeOriginal bin must also be set for images to be correctly sorted chronologically.  I’ve uploaded a new version on the final script that sets this bin to the same time as SCSTIMESTAMP.

The image filenames alone provided a lot of information such as date, time and camera source as well as a brief description.  Using that time stamp I could search the position and attitude data but before I could perform the file query I first had to convert the timestamp format in the file name to match the SCS format.  For simplicity’s sake I did not worry about the decimal second data.  The following awk command handled the conversion:

echo "20100711_05h24m53s19_ROVHD_CRABS.jpg" | \
awk -F"_" ' {printf "%s/%s/%s,%s:%s:%s\n",  substr($1,5,2),substr($1,7,2),substr($1,1,4),substr($2,1,2),substr($2,4,2),substr($2,7,2)}'

Now that I had the date/time information from the filename in the same format as the SCS time stamp I used grep to search the GGA and ROV data files.  I used the -m1 argument to return only the first row that matched the timestamp.  Again, I chose to ignore the partial second information:

SCSTIMESTAMP="07/11/2010,05:24:53"
grep -m1 ${SCSTIMESTAMP} ./SCSData/ROV_GPGGA_07112010.Raw
grep -m1 ${SCSTIMESTAMP} ./SCSData/ROV_PGSSRVR_07112010.Raw

Running the two queries yielded the following two results respectively.

07/11/2010,05:24:53.027,$GPGGA,052451.93,0251.00232,N,12503.55104,E,0,00,0.0,0.00,M,,,,*06
07/11/2010,05:24:53.449,$PGSSRVR,109.7,5.8,2.8,422.2,m,0.0,m,7,0.0,0.0*56

The next step was to embed this data in the image file using exiv2.  I found exiv2 to be a little funny to use.  First off I had to know the exact  bin name and expected datatype (i.e. ASCII, Rational Number, Byte, etc).  Please refer to the exiv2 tag reference for all the metadata bin names and datatypes.  Second quirk, to store a rational number, say 154.23, you have to enter it as 15423/100.  The last quirk is how data is passed to the exiv2 command using the command-line.  There are two methods: you can populate individual bins using the bin name and value as command-line arguments or you can pass a script file to exiv2 and populate multiple bins at once.  I found the latter method easier to script.

Once I understood how to use exiv2 to populate metadata had to determine exactly what bins I populate in order for programs like iPhoto and Picasa3 to recognize an image as geo-tagged.  My conclusion was that at the most basic level I needed the GPS version, GPS Time, Latitude Reference, Latitude, Longitude Reference, Longitude, Altitude Reference and Altitude.  Here is the list of the metadata bin names, datatypes and descriptions of how I planned to populate them:

Exif.GPSInfo.GPSVersionID Multi-Byte – must be set to: “02 00 00 00″
Exif.GPSInfo.GPSTimeStamp Multi-Rational – hours, minutes, seconds: “6/1 21/1 42967/1000″ = 6:21:42.967
Exif.GPSInfo.GPSLatitudeRef Ascii – N (North) or S (South)
Exif.GPSInfo.GPSLatitude Multi-Rational – Degrees, Minutes, Seconds: “5/1 4/1 439/10″ = 5 deg, 4 min, 43.9 sec
Exif.GPSInfo.GPSLongitudeRef Ascii – W (West) or E (East)
Exif.GPSInfo.GPSLongitude Multi-Rational – Degrees, Minutes, Seconds: “126/1 39/1 17956/1000″ = 126 deg, 39 min, 17.956 sec
Exif.GPSInfo.GPSAltitudeRef Byte - 00 (Above Sea level) or 01 (Below Sea Level)
Exif.GPSInfo.GPSAltitude Rational – meter away from sea level: “170041/100″ = 1700.41

Optionally I could populate heading using the following bins:

Exif.GPSInfo.GPSImgDirectionRef Ascii – T (True) or M (Magnetic)
Exif.GPSInfo.GPSImgDirection Rational – degrees: “2352/10″ = 235.2

Getting back the my example, I used awk to dissect the GGA string build an exiv2 script file for the GPSVersionID, GPSTimeStamp, GPSLatitudeRef, GPSLatitude, GPSLongitudeRef, and GPSLongitude.  In cases where the data value was floating point number I multiplied the value by a factor of 10 to make it an integer.  I also had to convert the decimal minutes part of the lat and long data to minutes, seconds by multiplying the fractional part of the decimal minutes by 60.  Here were the results:

GGASTRING="07/11/2010,05:24:53.027,$GPGGA,052451.93,0251.00232,N,12503.55104,E,0,00,0.0,0.00,M,,,,*06"
echo ${GGASTRING} | awk -F"," '{printf "set Exif.GPSInfo.GPSVersionID Byte 02 00 00 00\n\
set Exif.GPSInfo.GPSTimeStamp Rational %i/1 %i/1 %i/1000\
set Exif.GPSInfo.GPSLatitudeRef Ascii %s\
set Exif.GPSInfo.GPSLatitude Rational %i/1 %i/1 %i/100000\
set Exif.GPSInfo.GPSLongitudeRef Ascii %s\
set Exif.GPSInfo.GPSLongitude %i/1 %i/1 %i/100000\
", \
substr($4,1,2),substr($4,3,2),substr($4,5,6)*1000, \
$6, \
substr($5,1,2),substr($5,3,2),substr($5,6,5)*60, \
$8, \
substr($7,1,3),substr($7,4,2),substr($7,7,5)*60}' > exiv2GGAScript.txt

The output exiv2GGAScript.txt script looked like:

set Exif.GPSInfo.GPSVersionID Byte 02 00 00 00
set Exif.GPSInfo.GPSTimeStamp Rational 5/1 24/1 51930/1000
set Exif.GPSInfo.GPSLatitudeRef Ascii N
set Exif.GPSInfo.GPSLatitude Rational 2/1 51/1 13920/100000
set Exif.GPSInfo.GPSLongitudeRef Ascii E
set Exif.GPSInfo.GPSLongitude 125/1 3/1 3306240/100000

I used awk again to dissect the PGSSRVR string and build an exiv2 script file for the GPSAltitudeRef, GPSAltitude, GPSImgDirectionRef and GPSImgDirectionRef.  In cases where the data value was floating point number I multiplied the value by a factor of 10 to make it an integer.  Here were the results:

PGSSRVRSTRING="07/11/2010,05:24:53.449,$PGSSRVR,109.7,5.8,2.8,422.2,m,0.0,m,7,0.0,0.0*56"
echo ${PGSSRVRSTRING} | awk -F"," '{printf "set Exif.GPSInfo.GPSImgDirectionRef Ascii T\
set Exif.GPSInfo.GPSImgDirection Rational %i/100\
set Exif.GPSInfo.GPSAltitudeRef Byte 01\
set Exif.GPSInfo.GPSAltitude Rational %i/100\
", \
$4*100, \
$7*100 }' > exiv2PGSSRVRScript.txt

The output exiv2PGSSRVRScript.txt script looked like:

set Exif.GPSInfo.GPSImgDirectionRef Ascii T
set Exif.GPSInfo.GPSImgDirection Rational 10970/100
set Exif.GPSInfo.GPSAltitudeRef Byte 01
set Exif.GPSInfo.GPSAltitude Rational 42220/100

Rather than call the exiv2 command once with each script, chose to concatenate the two scripts into a single script and call the evix2 command just once.

cat exiv2GGAScript.txt exiv2PGSSRVRScript.txt > exiv2Script.txt
exiv2 -k -m exiv2Script.txt 20100711_05h24m53s19_ROVHD_CRABS.jpg

To verify the metadata was successfully written I used the exiv2 command with the -pt argument.

exiv2 -pt 20100711_05h24m53s19_ROVHD_CRABS.jpg
Exif.Image.GPSTag                            Long        1  26
Exif.GPSInfo.GPSVersionID                    Byte        4  2.0.0.0
Exif.GPSInfo.GPSLatitudeRef                  Ascii       2  North
Exif.GPSInfo.GPSLatitude                     Rational    3  2deg 51' 0.139"
Exif.GPSInfo.GPSLongitudeRef                 Ascii       2  East
Exif.GPSInfo.GPSLongitude                    Rational    3  125deg 3' 33.062"
Exif.GPSInfo.GPSAltitudeRef                  Byte        1  Below sea level
Exif.GPSInfo.GPSAltitude                     Rational    1  422.2 m
Exif.GPSInfo.GPSTimeStamp                    Rational    3  05:24:51.9
Exif.GPSInfo.GPSImgDirectionRef              Ascii       2  True direction
Exif.GPSInfo.GPSImgDirection                 Rational    1  10970/100

Mac users can verify the metadata was written by opening the image with Preview, Click Goto Tool->Show Inspector and select the GPS tab.

Preview Inspector Showing Geographic Location

And again using Picasa3.

Picasa3 Showing Geographic Location

So at this point I had a method for finding the right data, formatting it properly and embedding it into an image.  All that was left was to script it so that I could GeoTag all ~3400 images at once.

Automating it!

So what I wanted was a script that does all the steps for me and does it for an entire directory of images.  I also wanted to add some additional functionality such as:

  • Ability to only do GPS coordinates no heading or depth (if I were tagging an image from a digital camera or  frame capture from one of the ship-mounted HD PTZ cameras)
  • Ability to pull the timestamp from the modification time of the file instead of the filename (if I were using a real-time frame capture system this would be the way to go)
  • Ability to embed heading data based on the NMEA0183 HDT string (if I wanted to pull heading off a NMEA-compadible Gyroscope)
  • Ability to create a copy of the image and modify the copy, not the original
  • Ability to pass an additional exiv2 script to further populate the image metadata. (I’m thinking about adding things like Expedition Name, Dive Number, Cruise ID, Copyright Credit, etc)
  • Ability to properly handle TIFF and JPG images

So here is the script.

This was a long one in development but I’m happy with the results.  Without any options the script will try to geo-tag all images matching the YYYYMMDD_HHhMMmSSsFF_CAMERA-SOURCE_DESCRIPTION.jpg format.  By Default these images will be located at sea level.  If no GPS data is found for the given timestamp the script will move on to the next image.  Use the -v flag to increase the verbose messaging.  Use the -t flag to perform a dry-run (images will not be modified).  Use the -m flag to search for position data based on the files last modification time.  Use the -F jpg|tif|tiff arguments to specify the image file type.  The argument must be “jpg”, “tiff” or “tif”. Use the -R <PGSSRVR file> argument to add depth and heading information based on the PGSSRVR string format.  Use the -H <GPHDT file> argument to add heading information based on the NMEA0183 HDT string format.  If  -R and -H are specified, the HDT format will override the heading data from the PGSSRVR data.  Use the -M <EXIV2 Script> arguments to populate additional EXIF metadata bins.  This file must use the same exiv2 script format as described above. I used the following script when I was testing this feature:

UPDATED 2011/01/12: After using the script with different web gallery programs I now use a more comprehensive auxiliary EXIV2 script, the follow section has been updated.
set Exif.Image.Artist Ascii "NOAA Okeanos EXplorer Program"
set Exif.Image.Copyright Ascii "INDEX-SATAL 2010, Okeanos Explorer Program, NOAA"
set Iptc.Application2.Copyright String "INDEX-SATAL 2010, Okeanos Explorer Program, NOAA"
set Exif.Image.ImageDescription Ascii "This will be the first time scientists use a remotely operated vehicle (ROV) to get even a glimpse of deepwater biodiversity in the waters of the Sangihe Talaud Region. We expect to make discoveries that will advance our understanding of undersea ecosystems, particularly those associated with submarine volcanoes and hydrothermal vents. http://oceanexplorer.noaa.gov"
set Exif.Photo.UserComment "This will be the first time scientists use a remotely operated vehicle (ROV) to get even a glimpse of deepwater biodiversity in the waters of the Sangihe Talaud Region. We expect to make discoveries that will advance our understanding of undersea ecosystems, particularly those associated with submarine volcanoes and hydrothermal vents. http://oceanexplorer.noaa.gov"
set Iptc.Application2.Caption String "This will be the first time scientists use a remotely operated vehicle (ROV) to get even a glimpse of deepwater biodiversity in the waters of the Sangihe Talaud Region. We expect to make discoveries that will advance our understanding of undersea ecosystems, particularly those associated with submarine volcanoes and hydrothermal vents. http://oceanexplorer.noaa.gov"
set Iptc.Application2.Keywords String "ROV, INDEX-SATAL 2010, Indonesia"

Use the -O <output directory> arguments to specify an output directory for the geo-tagged images.  When this flag is used, each image is first copied to the output directory and then that copied image is geo-tagged.  The original file is not modified at all and the copied file still retains all the timestamps and permissions as the original.

The only two required arguments are the <gga file> and the <image directory>.  Without these the script will exit immediately.

The script writes the filename and position data to the stdout which is convenient for finding out what position data was embedded in each image and finding any images where position information could not be found.  I typically redirect this output to a logfile so that I can go back at a later time and access this information.

The full usage statement is:

Usage: ODR_GeoTag.sh: [-vtm] [-F jpg|tif|tiff ] [-R <PGSSRVR file>] [-H <HDT file>] [-M <EXIV2 script> [-O <output directory>] <gga file> <image directory>
	-v verbose
	-t test only
	-m Geotag based on file modification timestamp, not filename
	-F jpg|tif|tiff Set the image type to embed, the only valid options are jpg, tif, and tiff
	-R <PGSSRVR file> Use PGSSRVR file to populate depth and heading metadata bins
	-H <HDT file> Use HDT file to populate heading metadata bins
	-M <EXIV2 Script> Use EXIV2 Script to populate additional metadata bins
	-O <output dir> directory to save the output files
	<gga file> the navigation file to use
	<image directory> the directory containing the image files to geotag

Enjoy, and I hope this helps.

Want to talk about what was discussed here? Please go to the forums.

Share
Posted in Post Processing | Tagged , , , , , , , , | Leave a comment

Convert SCS timestamped NMEA GGA data to KML

Overview

For better or worse Google Earth is becoming the defacto standard for geospatial visualization.  I’m guessing this is due to the amazingly powerful and beautiful yet intuitive user interface (giving credit where credit is due).  Regardless of the reasoning, for the near future Google Earth is going to be how most people prefer to show off their GIS data and the interest of being good datarats we should try to figure out how to accommodate our scientists.

Here’s the real-world scenario I’m facing:  I’ve got a science party member who wants to be able to display the entire cruise track in Google Earth.  This reqires taking the recorded GPS data and producing a Google Earth-formated (.kml) file.

So how to attack the problem?  Since all we’re really doing is converting data from one format to another let’s take a look at our staring and end points and see just how much trouble we’re really in.  The starting point is ship’s GPS.  This sensor spits out standard NMEA0183 GGA messages that are logged by the ship’s datalogger to a file. In my scenario the GPS sensor is a POS/MV and the datalogger  is the NOAA’s Shipbard Computing System (SCS).  That means I’m receiving data values at 2Hz and SCS is automatically prepending each reading with a date and time (mm/dd/yyyy,hh:mm:ss.sss).  The final saved data looks something like the following:

07/22/2010,07:30:54.744,$GPGGA,073054.518,0207.20460,N,12539.85843,E,2,11,0.8,5.42,M,,,10,0025*33
07/22/2010,07:30:55.260,$GPGGA,073055.018,0207.20573,N,12539.85840,E,2,11,0.8,5.35,M,,,10,0025*37
07/22/2010,07:30:55.744,$GPGGA,073055.518,0207.20685,N,12539.85839,E,2,11,0.8,5.27,M,,,9,0025*0D
07/22/2010,07:30:56.260,$GPGGA,073056.018,0207.20799,N,12539.85838,E,2,11,0.8,5.19,M,,,9,0025*0B

Google Earth uses an XML-based file format called Keyhole Markup Langauge (KML) to display geospatial data.  The name Keyhole is a carry over from the Keyhole Inc., the GIS company Google aquired in 2004 to create Google Earth.  As a sidebar the name Keyhole is also based on the military’s first-generation eye-in-the-sky, the KH-11 reconnaissance satellite (it all makes sense now, right?).  Here’s an example of a KML file:

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2">
<Folder>
<name>Tracks</name>
<Folder>
<name>Points</name>
<Placemark>
<TimeStamp><when>2010-07-22T07:30:54.017Z</when></TimeStamp>
<Point>
<coordinates>125.664308,2.120058,5.460000</coordinates>
</Point>
</Placemark>
<Placemark>
<TimeStamp><when>2010-07-22T07:31:54.017Z</when></TimeStamp>
<Point>
<coordinates>125.664272,2.122292,5.290000</coordinates>
</Point>
</Placemark>
</Folder>
</Folder>
</kml>

Although XML is very powerful in it’s ability to gracefully handle multiple varieties of GIS information, translating from comma-delimited to XML is a bit of a PITA (pain-in-the-ass).  As a design philosophy when facing a challenge such as this I prefer to do a little research to find out if someone else has already solved a similar problem.  I find it’s quicker to hack than to reinvent.  Quick… to the Internet!

Tools

After Googl-ing “Translate NMEA to KML command-line” and clicking a couple of links I came across a slick little utility GPSBabel.  GPSBabel can  translate between seemingly every file format related to storing GPS data including NMEA and KML. Perfect. A quick read though the online documentation confirms that it should fit the bill and best of all GPSBabel is FREE!

In addition to GPSBabel we need to do some simple file querying and row formatting.  For this I’ll rely on my good friends grep and awk.  I’ll also need to use the BASH shell scripting language to stitch everything together.  For my solution I’ll be developing on the Mac OS X platform but ultimately porting the solution over to a Linux-based server.  I haven’t looked to see if this will work on Windows using CYGWIN but I don’t see a reason why not.

Solution

After reading the man pages for GPSBabel I discovered the correct way to use the command using just the default settings:

gpsbabel -i NMEA -f <input file> -o KML -F <output file>

GPSBabel requires that the input file be properly formatted NMEA0183 GGA.  This means that before we can proceed we will need to strip out the date and time stamp that SCS added.  To do this I use AWK to print each column of data starting at column 3:

awk -F, '{ printf "%s", $3 } { for ( i = 4; i <= NF; i++ ) printf ",%s", $i} {printf "\n"}' <input file> > <output file>

The output should look like:

$GPGGA,073054.518,0207.20460,N,12539.85843,E,2,11,0.8,5.42,M,,,10,0025*33
$GPGGA,073055.018,0207.20573,N,12539.85840,E,2,11,0.8,5.35,M,,,10,0025*37
$GPGGA,073055.518,0207.20685,N,12539.85839,E,2,11,0.8,5.27,M,,,9,0025*0D
$GPGGA,073056.018,0207.20799,N,12539.85838,E,2,11,0.8,5.19,M,,,9,0025*0B

As I mentioned earlier my GPS samples position at 2Hz.  While this sampling rate is required for operating a multibeam system or a dynamic positioning system, it’s simply too much for displaying a multi-day track line in Google Earth, a 5-minute interval should be more than sufficent.  To resample the data we could employ a mathmatical resampling algorithim that would interpolate the data to a 5-minute interval or take a shortcut and just grab one row out of every 600 (2 records/second * 60 seconds/minute * 5 minutes = 600 records/5 minutes).  To do that I’ll use AWK again.  The following command grabs the 600th line of data starting at line 1 of the input file and stores the line in the output file:

CORRECTION: Turns out starting on the first line is dangerous if your datalogging program introduces >.5 second delay between when the data is produced and when it is time stamped as you could end up with the following situation:

07/26/2010,00:00:00.174,$GPGGA,235959.734,0216.12878,N,12449.07267,E,2,09,1.1,7.33,M,,,10,0025*3A

This is a problem because the SCS time is from the day after when the data was produced (SCS Time: 00:00:00.174, GPS Time: 23:59:59.734, from the previous day).  There’s no easy way to account for this so what I’m going to do is perform the resampling operation starting at the second line.  The corrected AWK command is:

awk 'NR%600==2' <input file> > <output file>

When I run the gpsbabel command and immediately get an error:

nmea: No date found within track (all points dropped)!
nmea: Please use option "date" to preset a valid date for thoose tracks.

So I head back to the man pages and look at the options for the nmea format.  Turns out I need to give GPSBabel the date the data was recorded using the “date=<yyyymmdd>” option.  Time is provided internally by the second column of the GGA message.  So we can manually look at each data file and translate the date into the required format or… use the following command to grab the date field from the first line of the raw SCS file and translate it for use:

head -n1 <input file> | awk -F[,/] '{print $3$1$2}'

Let’s try this again:

gpsbabel -i NMEA,date=<yyyymmdd> -f <input file> -o KML -F <output file>

Success… well maybe, we at least have a KML output file.

Importing it into Google Earth yields a trackline with timeline.  I can event hit play and watch as the the marker retraces where the ship’s been cruising for the last month.  Sweet, but what’s up with those numeric labels that keeps decreasing?  That’s just a little annoying and I need to remove it so it’s back to the GPSBabel man pages.

After trying a bunch of stuff I finally tweaked my command to produce a KML file just the way I want it.  I can still see my little ship move along it’s trackline, but there are no labels distracting me.  As a bonus I’ve reduced the output filesize by 97.5%!  As I mentioned earlier, XML is very powerful but it is not simple and it is not concise.  The command I ultimately ended up using was:

gpsbabel -i NMEA,date=<yyyymmdd> -f <input file> -o KML,trackdata=0,track=1,labels=0 -F <output file>

UPDATED 2011-01-28: Processing interpreted GGA data (i.e from an ROV)

Recently someone I’ve worked with tried using this procedure to convert GGA data from and ROV dive.  The process failed very quietly and generated a KML file with no data points.  Here’s why and here’s how to fix it.  I’ve also updated the final script so that it now handles this scenario abet you will need to installed the NMEACheckSumGen program described below.

The Cause

To the best of my knowledge GPS doesn’t work deep underwater.  The signals from the satellites just can’t penetrate kilometers of water.  That said, Science-Class ROV operators still need to know where their vehicles are both relative to the ship and absolute to the surface of the planet.  For determining ROV position relative to the vessel, the majority of ROV operators use an Ultra-Short Baseline (USBL) acoustic navigation system.  Simply put, a USBL uses sound to determine the range and bering from a transducer mounted on the ship’s hull to a transducer on an underwater vehicle.  This gives the operators the X and Y distances (in feet or meters) between the ship’s transducer and the vehicle’s transducer.  To get the absolute location of the vehicle in lat/lon, another process (i.e. Hypack) combines the USBL data with the ship’s GPS position, taking into account the distance between the ship’s GPS antenna and the ship’s USBL transducer.  Here’s where the problem begins…

The output from these last process produces a NMEA GGA data stream that represents the lat/lon position of the vehicle.  Now if you were to dig deep into the NMEA GGA spec you will find that the 6th GGA data field corresponds to the GPS quality type.  There are three valid values for this field: 1=GPS fix, 2=Differential GPS fix, 0=invalid.  Because the calculated GPS location is neither a real GPS fix nor a diff. GPS fix, it’s logged as type 0, invalid.

So the 6th field has a 0 in it, so what?  Well, GPSBabel, the utility I use for translating GGA into KML is a bit anal-retentive and will only process quality type 1 and type 2 data.  This is an easy fix.  Use awk to change the field from 0 to 1.  Unfortunately this creates another problem, the NMEA checksum.  The NMEA checksum is a XOR summation of the data that is used to verify the data was transmitted correctly.  The idea is that whatever system is uses the GPS data can caluclate their own checksum upon receiving the data.  If the transmitted checksum and calculated checksum don’t match then something went wrong during transmission and the fix should be interpreted as bad.  Again, here is where GPSBabel is a bit anal-retentive and will only process fixes that have the correct checksum.

The Solution

The solution to the first problem of the quality type = 0 is an easy fix using awk.  Here’s the command, remember this command is for an SCS-timestamped GGA file so we need to change the 9th field instead of the 7th:

awk -F, 'BEGIN{ OFS="," } {print $1, $2, $3, $4, $5, $6, $7, $8, "1", $10, $11, $12, $13, $14, $15, $16, $17}' <input file> > <output file>

The solution to the second problem is not as straight-forward.  After searching the internet for awhile I was able to piece together enough C code to build a simple program that will read an input NMEA file, calculate the correct checksum and spit out a valid data stream.  The program is so simple it will actually work for any valid NMEA data stream not just GGA.  Please remember this if you find yourself in a similar bind with other NMEA data.  Here’s the source code: main.c Usage instructions are in the code’s header comment.  On Mac and Linux systems, you can compile the code using the following command:

gcc -o ./NMEACheckSumGen ./main.c -I

This command will create an executable program called NMEACheckSumGen.  Copy the resulting executable to somewhere on your system’s path (i.e. /usr/local/bin) and you’ll be able to call it from anywhere.

Automating it!

So now we have our rough procedure for getting from start to finish, time to put it all together.

I want to be able to simply pass a SCS-timestamped GGA file to a program and have KML come out.  The script should  resample the data file to whatever interval I set, figure out what date string to pass to GPSBabel and reformat the raw datafile to a NMEA0183 compliant GGA format automatically.  The program should also be able to handle interpreted GGA data such as ROV position.

So here is the script.

In addition to the requirements specified previously the script also accounts for if the raw file includes data from more than one day.  In that case the script will parse out the data for a single day and build a new KML file for that date.  This process repeats for each date discovered in the raw data.  This is useful for when the data logging program doesn’t truncate files on the even day or if you want to concatenate all the raw data from a cruise into a single file and run the script once at the end.  Depending one the amount of navigation data to convert you may want to change the default resampling interval.  The default resample interval is 600 (1 sample/5 minutes) but can be changed using the -R flag.  The script can add a user-specified prefix to the output KML files using the -P option.  This is useful if you want to store KML files from more than one platform in the same directory (i.e. ship and an ROV).  The script has two required arguments. The first is the input data file.  The second is the directory to store the resulting KML file(s).  Make sure you have write permission to the output directory before you run the script.

Here is the full usage statement:

Usage: ./NMEA_2_KML: [-vf] [-R <interval> ] [-P <prefix>] <input file> <output directory>
     -v turn on verbose messaging
     -f fix interpreted GGA data, requires NMEACheckSumGen
     -R <interval> resample the input data by selecting 1/<interval>
          data row.  Interval must be an integer. Default=600
     -P <prefix> prefix to add to the output KML files
      <input file> the navigation file to use
      <output directory> the directory to store the KML files

Enjoy, and I hope this helps.

Want to talk about what was discussed here? Please go to the forums.

Share
Posted in Post Processing | Tagged , , , , , , , | Leave a comment