The Format of Babel or In Text We Trust

Over the past several days I’ve been wrangling with text of various forms and formats. More specifically I’ve been trying to get various references and documentation into Emacs (more on that some other day). As I was going through all the various sizes and shapes that text came in, I could only marvel at the number of seemingly interchangeable yet arbitrarily unique ways text is molded into one form or another.


Take this “eBook” revolution on our hands. The Epub format used by iPads and Kindles are essentially XHtml wrapped up in a zip file conveniently named “epub”. I’m sure that if this was the 1990s or Adobe had its way, it would be in some kind of a proprietary binary format (like a compiled program) that sends corporate headquarters your device ID and GPS location, ready to transmit your name to the authorities or disable your device on the slightest infringement. I think it’s mostly thanks to the disaster that is PDF and the lucky circumstance that no one company still has the lock down on digital publishing that we were able to adopt a relatively decent format like epub (compared to say PDF).

Yet the road to get here is littered with many forgotten and esoteric standards among some well known. Some of the formats I encountered these past several days range from Latex, HTML, xhtml, txt, rtf, SGML, texi, LyX, rhtml, markdown, and textile, and chm. All of them are formats used to document source code, generated from commenting templates for some of the more well-known open-source projects. All of the formats are used in some capacity by prominent branches of software. For example texi files are used by GNU open source projects after they are compiled into “info” files that can easily be ready by Emacs. All the formats are based on either plain text with modest formatting or HTML. Some of the newer projects use their own templating engine with HTML output for easy publishing to websites (to provide search engine fodder for api documentation). The good thing is that most of the markup is lightweight and it’s essentially text. They are usually designed to make navigation easy by stipulating headers or other standards for cross-linking within the document so you could jump from an index to the relevant documentation quickly.

The problem is each of these formats scratch someones itch and none of them are completely interchangeable. Usually there’s some kind of intermediary format that converts into the final form. PDF is a prominent option but once you transform it into PDF, heaven help you if you misplace the original.

Looking around at some of the more recent HTML based varieties had me thinking. While these documents are decent as HTML documents with minimal, clean design and just enough javascript to make navigation easy, it doesn’t beat having the text in a malleable form at your fingertips. It just can’t compete.

We live in a world where we are surrounded by text, trade in text, yet for the most important pieces we are still too scared to completely let go and resign ourselves to the free flow of ideas and words. We are still coming to terms with the way we are willing to trade text.

One thing I’m glad to see with epub is the resurgence of formats like latex. The problem with pdf, postscript, and any of the Microsoft Office formats (that have opened up quite a bit) is that they try to mix up presentation and content. A lot of us are probably old enough to remember how buggy Microsoft Word used to be. It was like the Wild West where you always had to keep one hand on your gun (the save button) before that pesky Microsoft bandit pissed away your days work with “Word has crashed what would you like to do? [Exit] [Cancel]”. Of course, we all know that pressing [Cancel] takes you right back to the dialog. If you’re lucky you might find a garbled up file with fragments of what you thought you typed amidst what could only look like communication from alien life forms. All you did was type a few words.

Before Microsoft, standards like Latex kept the raw content separate from the presentation. You could add content to your heart’s content and then adjust the format when you’re done. You never have to deal with mind-boggling situations like you get with Microsoft Word where shifting a diagram one pixel to the right suddenly turns the preceding paragraph into a 15pt bold headline in bright blue.

Somewhere deep down inside, people are trying to come back to simplicity as seen with the resurgence of “writing programs” that essentially strip away all the bells and whistles to give us a full screen of nothing but our own text so that we can focus on writing.