Friday, January 6, 2012

The Rituals of the Cult of Cthulhu as Practiced by Bioinformaticians


BEHOLD!
If you don't immediately recognize what the above is, you're in for a treat: that is an overview of the ISO standard for intermodal containers. Intermodal containers are those big steel containers you see at every single respectable port in the world (and at the Port of Oakland. Save for Sonny Barger and Tim Armstrong, there's nothing respectable about Oakland). There's actually far more documentation pertaining to the everything from the markings to recommendations on the construction of the twist locks that allow the containers to be mounted on trains, trucks or container ships (ASTM D5728-00, ISO 9897, ISO 14829, ISO 17363, ISO/PAS 17712, ISO 18185-2 and ISO/TS 10891). Unless you ship in tonnes or have Aspergers, all of this might be of little interest to you, but consider this: 90% of all non-bulk cargo comes to your grubby little paws via an intermodal container. That's the world's entire shipping network; containerized.

Christina Hendricks is here because without her presence,
no-one would have read that paragraph about intermodal containers.
Know your audience.
Before containerization, cargo would have to be loaded and unloaded from different modes of transport by crews of longshoremen for $5.86 a ton (probably sound drinking money in those days). Intermodal containers could be loaded and unloaded with a crane directly to another mode of transport or to storage for a measly $0.16 a ton. Not only that, those shifty longshoremen couldn't access the cargo within the crates, effectively reducing cargo theft. The stackability of crates also meant that specially built container ships could carry cargo more efficiently. Within decades after their introduction in 1955, intermodal containers had changed the landscape of shipping, removing significant trade barriers from Asia, conveniently sidestepping the issue of differing track gauges between railway lines, and building entire new cities around the needs of containerized ports.


A container ship at the Port of Miami.
There's a cocaine joke in there somewhere.
In fact, there's probably a lot of cocaine jokes in there.
The advantages containerization introduced to shipping are a prime example of standardization done right: a simple standard, adopted across multiple industries, to remove barriers.

Which naturally brings me to protein nomenclature. The basic unit of bioinformatics is our data, our best digital representation of the squishy underbelly of biology. We can choose to represent genes, proteins, molecules and other biological entities, plus any commonalities or interactions between them as consistent digital representations. Well, that's what we should be doing, and what we pretty much are, except there are a lot of cooks in this particular kitchen.

I swear before Odin, this guy works in my lab.
My personal pet peeve is the protein, one of the mechanisms behind the EVERYTHING that happens in your body. We'll start with a simple one:

MPPSGLRLLL LLLPLLWLLV LTPGRPAAGL STCKTIDMEL VKRKRIEAIR GQILSKLRLA
SPPSQGEVPP GPLPEAVLAL YNSTRDRVAG ESAEPEPEPE ADYYAKEVTR VLMVETHNEI
YDKFKQSTHS IYMFFNTSEL REAVPEPVLL SRAELRLLRL KLKVEQHVEL YQKYSNNSWR
YLSNRLLAPS DSPEWLSFDV TGVVRQWLSR GGEIEGFRLS AHCSCDSRDN TLQVDINGFT
TGRRGDLATI HGMNRPFLLL MATPLERAQH LQSSRHRRAL DTNYCFSSTE KNCCVRQLYI
DFRKDLGWKW IHEPKGYHAN FCLGPCPYIW SLDTQYSKVL ALYNQHNPGA SAAPCCVPQA
LEPLPIVYYV GRKPKVEQLS NMIVRSCKCSL

That's the sequence of amino acids that defines the protein I'm referring to. There are a lot of these sequences. UniProt, my lab's protein sequence database of choice, lists 533,657. A sequence database is like a complicated phone-book. You know what a phone-book is, dammit: it's that slab of paper that props up your computer monitor. This particular phone-book lets you link up protein names with their sequences and an array of other associated data, like publications that reference the protein and rushing yards. Using this sequence to identify the protein in question is ideal for computers, but less so for a researcher, who wants to be able to refer to that sequence in a sentence. So researchers refer to this sequence by its 'name', which is Transforming growth factor beta-1. Here is where things begin to get unhinged, in a 'Who's on First' kind of way. For 'Transforming growth factor beta-1' I get 136 matches in UniProt. Why 136 matches? Partly because the same name is used across multiple species to describe similar proteins. There's a protein with the same name in human, mouse, carp and a bunch of others. OK, so maybe restricting our search to human proteins is what we need: 42 results, but only one of them shares the same name. Wonderful. We even have a nice, unique ID to refer to it by: P01137. We've gone from Joe Black, to Joe Black from 233 Whiskey Ave. New Orleans. Now do this same procedure hundreds of more times for a list of proteins, which can number in the thousands, and you will get an idea of why biologists and computer scientists get at odds with each other. A biologist writing a paper describing his favorite protein will refer to it by his favorite name, not by its unique identifier. The favorite name for P01137 could be any of the following: Transforming growth factor beta-1, TGF-beta-1, TGFB1, etc. That guy you knew in highschool who you only knew as 'Smurf', who everyone else called 'Murph' or 'Steve': try finding them in a phone-book with only that information. When that biologist gives us an Excel spreadsheet full of their favorite names instead of stable IDs, we're getting dozens of Smurfs, Wurzels and Dweezils. It's the kind of thing that results in night terrors.
Night terrors and awesome Metallica source material.
So, with our biologist now aware of our plight and referring to his data by a unique UniProt ID, we should be out of the woods, right? No. Not by a long shot. UniProt is constantly changing: IDs get retired, new IDs get added, and thousands of records are revised yearly. Your phone book from 1982 hardly helps you look up your friend from Kansas in 2012. Your UniProt ID can only be said to be certain for the snapshot in time in which you downloaded it. In addition, UniProt is not the only sequence repository in the world. There's NCBI's Entrez and RefSeq, Ensembl, and probably half a dozen smaller databases lurking in the swirling, Lovecraftian universe of bioinformatics. Each has their own version documentation. What you're doing now is combining numbers from two different phone-books, one in the English alphabet and one in Cyrillic, with an outdated Old East Slavic-English dictionary between them. Without keeping a keen eye on when you downloaded what from whom, your data will start looking like that shitty Cadillac Johnny Cash built.

Johnny Cash, ladies and gentlemen.
All of this is compounded by common problems like this: even with your version numbers hammered down, when you can translate from one database to another your mapping is not guaranteed to be one to one. If you ask Ensembl for UniProt IDs and UniProt for Ensembl IDs for their respective proteins, you won't necessarily get the same answers. There are, after all, a lot of Browns in the phone-book. To top it all off, when you go to buy a physical product, say a DNA Microarray from Affymetrix, you'll find that Affymetrix has it's own IDs, derived from multiple databases, which you'll have no choice but to translate to, either with the same mystical translations mentioned above, or with a dedicated application such as BLAST.

All of this gives us the same kind of extraneous overhead that longshoremen were responsible for in the domain of shipping. You get data from a database, you try to translate it into the format your lab uses, you end up with bizarre results from the original database's translation tables, and then you sacrifice a goat to the Old Gods, fire up BLAST and hope for the best. Then, you do it all over again when someone else asks for your data in their favorite format.

This is why the shipping container appeals to me: a common format would drive a lot of these woes beneath our notice. How did the shipping container come to dominate? There was a lot of wheeling and dealing by a fellow named Malcom McLean, a former trucker with a high-school diploma who eventually founded his own trucking company. McLean has to convince port owners to go with his standard, and that involved a lot of fighting with longshoremen's unions, who's livelihood was directly threatened by his new system. Eventually, through a series of trials and adoption by the US military, the standard hit a critical point, and there was no turning back. If you weren't shipping containers, you were on the outskirts of an industry.

There's a lot of room for Malcom McLeans in the realm of bioinformatics. Until one comes along, I'll be the guy staring wistfully as a cargo train disappears in the horizon.

No comments:

Post a Comment