Wednesday, January 25, 2012

Teaser: The Map of Rock

I've had a LOT of fun building the Map of Rock over the past month, and with a few pointers from Christophe Viau, I've managed to find a spot for it online, where everyone can have a gander. There's a problem though: I see a lot of people reading this blog, but no comments. I cannot make bricks without clay, people. So here's the deal: five comments. If five comments show up on this post, you can have the ultra-secret volcano-lair location of the Map of Rock. It's a steal, ladies and gentlemen.

Ten comments, and I'll go through the trouble of inserting all the missing umlauts.

All Hail Our New Database Overlords!

I recently had to do a presentation on biological pathways, or to be more accurate, my ignorance of biological pathways. For all of the insights I got from my audience, I might as well have presented in the form of an interpretive dance.
All your pathways are horse. Hear the blue.
After the awkward silence died down, I did manage to have a chat with some interested parties, and a few ugly truths came out.

If you got a headache contemplating the inconsistencies of protein nomenclature, you will be less than pleased to know that while we may not know which proteins are what, we're busily building interaction and pathway databases based on them. Remember PKACA, the protein I mentioned in an earlier blog post? Of course you do. You've been paying attention. You will recall a couple of databases that store information on PKACA (I keep reading it as 'pee-kaka'. Serious, it's like there's a five-year old romping around inside my head.) and they all have different information on it. When I wrote the earlier blog post, I didn't even blink when I linked to multiple databases, each with different ideas of about PKACA. The number of interactions were anything from 1 (DIP), 15(MINT) and 17(IntAct). Looking at metabolic pathways, PKACA is either a member of 27 (KEGG), 5 (PID) or 54 (Reactome). Is the entire bioinformatics community just trying to mess with my head? No: these databases fall prey to some real-world constraints. Often, they are curated by individuals. Those individuals have to sort through this monster:


PubMed. Vincenzo Natali could have a field day with this.
PubMed is a database which contains just about every respectable medically-related publication in a searchable format, as far back as when 'The Practicability of transporting the Negro back to Africa.' was still a respectable article title. Curators have to sort through article after article involving individual proteins. PKACA shows up in 93 articles in PubMed's search (PubMed's search has it's own issues, which I'll leave for another day, for now lets just say 93 is a really fuzzy number). A curator has to read those articles and extract interactions from them. Recall that protein names are ambiguous and that the curator can't possibly have knowledge of the every one of the 20,242 sequences listed in UniProt for a species like Homo Sapiens. It passes from Herculean to Sisyphean when the definition for each protein has a less than stable identity. It's tracking a population of Jason Bournes.

This would not be the first time this has happened in a field of study. Take cartography, which has a history of inconsistencies that would be comical if peoples lives hadn't depended on it. From cartography, we have the example of portolan charts, a powerful navigational tool used by nautical navigators from the 13th century until the renaissance. These charts focused on compass headings instead of going through the whole tedious business of surveying coastlines, allowing sailors to plot a course between different landmarks. This is like that uncle that refuses to use street names and gives you directions like 'take a left when you see the tree beside a bigger tree that looks kind of like a goose'. When you look at a map today, you likely imagine that what you see will be mirrored almost exactly on another map, because both should have been derived from the same data. This is because you have been spoiled by more than a century of photography and accurate reproduction. Even so, you shouldn't believe everything you navigate by.

Google Earth or psilocybin?
Back before modern cartography, errors abounded. Portolan charts varied, their content dependent on the needs of their eventual owners. A Venetian merchant and a Dutch one would have different ports of interest, depending on where they had friendly trading partners. Omissions would be a matter of course. Individual portolan charts, as with many other maps, were often derived from charts before them, not necessarily from observations of the world they were meant to represent. Some charts were patchworks of other charts. This meant that errors were carried down from chart to chart. Strangely enough, the system worked, well enough so that some states kept their portolan charts as state secrets, a good indication of their value. The charts were safely in the realm of 'good enough'. What they weren't good enough for was the immensity of things like the Atlantic Ocean, which was kind of bigger than cartographers were used to (Portolans were primarily used for inland bodies for water) and required much more detail to navigate successfully. Map making picked up the ball and ran with it after this point, changing from attempts to document individual routes to attempting to model geographical reality as accurately as possible. 

Which brings me back to the points outlined in 'Too many roads not taken', except the criticism isn't only of the direction research takes, but of the way we choose to document it. What we see now is the equivalent of the portolan chart: every map is not necessarily built upon the same data. Certain interactions and pathways are of importance to individual curators, even if only by the reason that they were on top of a pile of articles they had to read. Errors are propagated, even in this age, by legacy literature.

So where to now? The good news is that our knowledge is expanding, as are the tools at our disposal. Maybe what we'll end up seeing, as our ability to identify and characterize proteins, interactions and pathways improves, is a kind of convergence. Eventually, given the same data, these databases could end up looking pretty similar, the same way maps do nowadays. The larger scale studies suggested by Edwards et. al. could be part of this solution. I have another complimentary suggestion:

Nuke it from orbit.
PubMed has a very unique position. To my knowledge (jump in here and tell me I'm wrong on this) there is no other comprehensive database of medical publications with the scope and power of PubMed. PubMed could require annotations to be given for new articles referencing protein-protein interactions, and were they to request only one particular format, say RefSeq, it would provide a strong incentive for others to follow suit. Remember Malcom McLean? This could be the ruthless monopoly that would transform the industry, much the same way as Malcom transformed the global shipping industry with a few key deals. It would be ugly, and a lot of work hours will have been wasted to some degree on the behalf of databases that would suddenly have to pledge their allegiance to their new NCBI overlords, but the results would make my job a whole lot easier. And that's what really matters, isn't it?

Friday, January 6, 2012

The Rituals of the Cult of Cthulhu as Practiced by Bioinformaticians


BEHOLD!
If you don't immediately recognize what the above is, you're in for a treat: that is an overview of the ISO standard for intermodal containers. Intermodal containers are those big steel containers you see at every single respectable port in the world (and at the Port of Oakland. Save for Sonny Barger and Tim Armstrong, there's nothing respectable about Oakland). There's actually far more documentation pertaining to the everything from the markings to recommendations on the construction of the twist locks that allow the containers to be mounted on trains, trucks or container ships (ASTM D5728-00, ISO 9897, ISO 14829, ISO 17363, ISO/PAS 17712, ISO 18185-2 and ISO/TS 10891). Unless you ship in tonnes or have Aspergers, all of this might be of little interest to you, but consider this: 90% of all non-bulk cargo comes to your grubby little paws via an intermodal container. That's the world's entire shipping network; containerized.

Christina Hendricks is here because without her presence,
no-one would have read that paragraph about intermodal containers.
Know your audience.
Before containerization, cargo would have to be loaded and unloaded from different modes of transport by crews of longshoremen for $5.86 a ton (probably sound drinking money in those days). Intermodal containers could be loaded and unloaded with a crane directly to another mode of transport or to storage for a measly $0.16 a ton. Not only that, those shifty longshoremen couldn't access the cargo within the crates, effectively reducing cargo theft. The stackability of crates also meant that specially built container ships could carry cargo more efficiently. Within decades after their introduction in 1955, intermodal containers had changed the landscape of shipping, removing significant trade barriers from Asia, conveniently sidestepping the issue of differing track gauges between railway lines, and building entire new cities around the needs of containerized ports.


A container ship at the Port of Miami.
There's a cocaine joke in there somewhere.
In fact, there's probably a lot of cocaine jokes in there.
The advantages containerization introduced to shipping are a prime example of standardization done right: a simple standard, adopted across multiple industries, to remove barriers.

Which naturally brings me to protein nomenclature. The basic unit of bioinformatics is our data, our best digital representation of the squishy underbelly of biology. We can choose to represent genes, proteins, molecules and other biological entities, plus any commonalities or interactions between them as consistent digital representations. Well, that's what we should be doing, and what we pretty much are, except there are a lot of cooks in this particular kitchen.

I swear before Odin, this guy works in my lab.
My personal pet peeve is the protein, one of the mechanisms behind the EVERYTHING that happens in your body. We'll start with a simple one:

MPPSGLRLLL LLLPLLWLLV LTPGRPAAGL STCKTIDMEL VKRKRIEAIR GQILSKLRLA
SPPSQGEVPP GPLPEAVLAL YNSTRDRVAG ESAEPEPEPE ADYYAKEVTR VLMVETHNEI
YDKFKQSTHS IYMFFNTSEL REAVPEPVLL SRAELRLLRL KLKVEQHVEL YQKYSNNSWR
YLSNRLLAPS DSPEWLSFDV TGVVRQWLSR GGEIEGFRLS AHCSCDSRDN TLQVDINGFT
TGRRGDLATI HGMNRPFLLL MATPLERAQH LQSSRHRRAL DTNYCFSSTE KNCCVRQLYI
DFRKDLGWKW IHEPKGYHAN FCLGPCPYIW SLDTQYSKVL ALYNQHNPGA SAAPCCVPQA
LEPLPIVYYV GRKPKVEQLS NMIVRSCKCSL

That's the sequence of amino acids that defines the protein I'm referring to. There are a lot of these sequences. UniProt, my lab's protein sequence database of choice, lists 533,657. A sequence database is like a complicated phone-book. You know what a phone-book is, dammit: it's that slab of paper that props up your computer monitor. This particular phone-book lets you link up protein names with their sequences and an array of other associated data, like publications that reference the protein and rushing yards. Using this sequence to identify the protein in question is ideal for computers, but less so for a researcher, who wants to be able to refer to that sequence in a sentence. So researchers refer to this sequence by its 'name', which is Transforming growth factor beta-1. Here is where things begin to get unhinged, in a 'Who's on First' kind of way. For 'Transforming growth factor beta-1' I get 136 matches in UniProt. Why 136 matches? Partly because the same name is used across multiple species to describe similar proteins. There's a protein with the same name in human, mouse, carp and a bunch of others. OK, so maybe restricting our search to human proteins is what we need: 42 results, but only one of them shares the same name. Wonderful. We even have a nice, unique ID to refer to it by: P01137. We've gone from Joe Black, to Joe Black from 233 Whiskey Ave. New Orleans. Now do this same procedure hundreds of more times for a list of proteins, which can number in the thousands, and you will get an idea of why biologists and computer scientists get at odds with each other. A biologist writing a paper describing his favorite protein will refer to it by his favorite name, not by its unique identifier. The favorite name for P01137 could be any of the following: Transforming growth factor beta-1, TGF-beta-1, TGFB1, etc. That guy you knew in highschool who you only knew as 'Smurf', who everyone else called 'Murph' or 'Steve': try finding them in a phone-book with only that information. When that biologist gives us an Excel spreadsheet full of their favorite names instead of stable IDs, we're getting dozens of Smurfs, Wurzels and Dweezils. It's the kind of thing that results in night terrors.
Night terrors and awesome Metallica source material.
So, with our biologist now aware of our plight and referring to his data by a unique UniProt ID, we should be out of the woods, right? No. Not by a long shot. UniProt is constantly changing: IDs get retired, new IDs get added, and thousands of records are revised yearly. Your phone book from 1982 hardly helps you look up your friend from Kansas in 2012. Your UniProt ID can only be said to be certain for the snapshot in time in which you downloaded it. In addition, UniProt is not the only sequence repository in the world. There's NCBI's Entrez and RefSeq, Ensembl, and probably half a dozen smaller databases lurking in the swirling, Lovecraftian universe of bioinformatics. Each has their own version documentation. What you're doing now is combining numbers from two different phone-books, one in the English alphabet and one in Cyrillic, with an outdated Old East Slavic-English dictionary between them. Without keeping a keen eye on when you downloaded what from whom, your data will start looking like that shitty Cadillac Johnny Cash built.

Johnny Cash, ladies and gentlemen.
All of this is compounded by common problems like this: even with your version numbers hammered down, when you can translate from one database to another your mapping is not guaranteed to be one to one. If you ask Ensembl for UniProt IDs and UniProt for Ensembl IDs for their respective proteins, you won't necessarily get the same answers. There are, after all, a lot of Browns in the phone-book. To top it all off, when you go to buy a physical product, say a DNA Microarray from Affymetrix, you'll find that Affymetrix has it's own IDs, derived from multiple databases, which you'll have no choice but to translate to, either with the same mystical translations mentioned above, or with a dedicated application such as BLAST.

All of this gives us the same kind of extraneous overhead that longshoremen were responsible for in the domain of shipping. You get data from a database, you try to translate it into the format your lab uses, you end up with bizarre results from the original database's translation tables, and then you sacrifice a goat to the Old Gods, fire up BLAST and hope for the best. Then, you do it all over again when someone else asks for your data in their favorite format.

This is why the shipping container appeals to me: a common format would drive a lot of these woes beneath our notice. How did the shipping container come to dominate? There was a lot of wheeling and dealing by a fellow named Malcom McLean, a former trucker with a high-school diploma who eventually founded his own trucking company. McLean has to convince port owners to go with his standard, and that involved a lot of fighting with longshoremen's unions, who's livelihood was directly threatened by his new system. Eventually, through a series of trials and adoption by the US military, the standard hit a critical point, and there was no turning back. If you weren't shipping containers, you were on the outskirts of an industry.

There's a lot of room for Malcom McLeans in the realm of bioinformatics. Until one comes along, I'll be the guy staring wistfully as a cargo train disappears in the horizon.