Spreadsheets as a data exchange format
Jesse Luehrs (@doyster)
march 27th, 2014
When working with non-technical clients, often their preferred means of exchanging structured data is via spreadsheets. Using a custom tool is not always practical due to cost or training time constraints, and using a type of document that doesn’t have its own standard editor (such as XML or JSON) will generally result in having to deal with malformed files on a regular basis, since these files are often edited by hand.
Excel is the only program for managing structured data that is widely used by both technical and non-technical people, and being able to leverage that structure can make the whole data exchange process much smoother, even though it can be frustrating at times.
Tools
Luckily, Perl comes with a good set of modules for reading and writing various kinds of spreadsheets. I use the following:
- For
.xlsx
files, I use Spreadsheet::ParseXLSX (which I wrote) for reading, and Excel::Writer::XLSX for writing. - For
.xls
files, I use Spreadsheet::ParseExcel for reading, and Spreadsheet::WriteExcel. - For
.csv
files, I use Text::CSV_XS.
In addition, I have a couple of unreleased scripts
(here and
here) for
turning an Excel file into a set of CSV files and back, since it is
occasionally easier to convert to CSV, make a few edits by hand in a
text editor, and then convert back to Excel rather than writing a
script to do it. CSV files are also diff
able, so this can also be
used to detect changes between different versions of Excel files.
If you need to debug files that are being misparsed, or if you need to
implement Excel support in a language that doesn’t yet have a decent
implementation, the relevant file format specifications are documented
quite copiously (and somewhat impenetrably)
here (.xlsx
)
and
here (.xls
).
Best practices
Mitigating data errors
Don’t let the structure of the spreadsheet drive your actual data model (or vice versa). It’s likely that the spreadsheet will be created by someone who isn’t familiar with how the system works underneath, and these documents are often structured in a way that emphasizes readability by humans over readability by computers. It’s quite possible for the document to change in ways that make sense from their business perspective, but not from the perspective of how the system works. Keeping a separation between “how the data is delivered to us” and “how the system actually understands the data” (the latter preferably driven by actual requirements documents) will make things run more smoothly.
In addition, these kinds of changes make it much more important to have a lot of error checking. Error checking is probably more important than testing for these situations, because it’s quite difficult to predict the ways in which things can go wrong. Enforcing a very strict document structure and throwing an error as soon as something unexpected is seen is preferable to trying to guess and ending up with malformed data. If you can impress upon the client the need for a strict document structure, even better, but this isn’t always possible.
If you aren’t able to enforce a structure, it’s often better to take a two stage approach. First, write your strict importer that expects a single, well defined document with lots of error checking, and then do whatever you have to do to the data you receive to make it fit, if possible. It’s rarely worth the effort to modify the actual parser every time the structure of the document changes, since that may happen quite frequently, and every time you change the actual parser is another opportunity to introduce bugs into it.
Other tricks and things to watch out for
- Working with spreadsheets can be tricky, because they are quite
often assembled by hand, and this can introduce all kinds of
interesting errors. A common one that I’ve run into is that people
tend to not be very careful about leading or trailing whitespace in
cells. This should almost certainly be stripped in most situations
where you find it. Note that for whatever reason, non-breaking
spaces (
\xA0
) are common, so make sure you’re not only looking for ASCII whitespace. - Numeric formatting in Excel files can occasionally cause problems. For instance, if you use identifiers that look like large numbers (larger than can fit in a machine integer), applying formatting can turn them into large floating point numbers, since that’s how languages like Perl deal with integer overflow. This will make later exact comparisons fail.
- If the spreadsheet is being generated by hand, people rarely remember exact column orders. Keeping track of which column is associated with which data based on column headers rather than column indices tends to be much more accurate.
- For quick and dirty data manipulation, it is often easier to convert the spreadsheet you received into a CSV file and modify that in a text editor directly, rather than actually writing a script. Using tabs as the separator in the CSV format (sometimes called TSV instead) rather than commas can avoid complexity here, since actual data contains literal commas much more often than it contains literal tabs.
- If your programming environment doesn’t have a good way of
generating Excel files, it occasionally works to just generate CSV
files (which is pretty trivial to do in any language) and serve them
up with a
.xls
extension. While this isn’t exactly recommended, it can save a lot of effort in some cases. (A similar trick can be used for Word documents by generating them as.rtf
files and serving them with a.doc
extension.)
Conclusion
Data exchange is a very common task when dealing with clients, and clients themselves typically use Excel. Learning to work within their processes (which typically exist because they do already actually work) can be less difficult than you think, and may save both you and the client a lot of time and effort.
Tags: technology perl communication