BEHOLD! |
Christina Hendricks is here because without her presence, no-one would have read that paragraph about intermodal containers. Know your audience. |
A container ship at the Port of Miami. There's a cocaine joke in there somewhere. In fact, there's probably a lot of cocaine jokes in there. |
Which naturally brings me to protein nomenclature. The basic unit of bioinformatics is our data, our best digital representation of the squishy underbelly of biology. We can choose to represent genes, proteins, molecules and other biological entities, plus any commonalities or interactions between them as consistent digital representations. Well, that's what we should be doing, and what we pretty much are, except there are a lot of cooks in this particular kitchen.
I swear before Odin, this guy works in my lab. |
MPPSGLRLLL LLLPLLWLLV LTPGRPAAGL STCKTIDMEL VKRKRIEAIR GQILSKLRLA
SPPSQGEVPP GPLPEAVLAL YNSTRDRVAG ESAEPEPEPE ADYYAKEVTR VLMVETHNEI
YDKFKQSTHS IYMFFNTSEL REAVPEPVLL SRAELRLLRL KLKVEQHVEL YQKYSNNSWR
YLSNRLLAPS DSPEWLSFDV TGVVRQWLSR GGEIEGFRLS AHCSCDSRDN TLQVDINGFT
TGRRGDLATI HGMNRPFLLL MATPLERAQH LQSSRHRRAL DTNYCFSSTE KNCCVRQLYI
DFRKDLGWKW IHEPKGYHAN FCLGPCPYIW SLDTQYSKVL ALYNQHNPGA SAAPCCVPQA
LEPLPIVYYV GRKPKVEQLS NMIVRSCKCSL
That's the sequence of amino acids that defines the protein I'm referring to. There are a lot of these sequences. UniProt, my lab's protein sequence database of choice, lists 533,657. A sequence database is like a complicated phone-book. You know what a phone-book is, dammit: it's that slab of paper that props up your computer monitor. This particular phone-book lets you link up protein names with their sequences and an array of other associated data, like publications that reference the protein and rushing yards. Using this sequence to identify the protein in question is ideal for computers, but less so for a researcher, who wants to be able to refer to that sequence in a sentence. So researchers refer to this sequence by its 'name', which is Transforming growth factor beta-1. Here is where things begin to get unhinged, in a 'Who's on First' kind of way. For 'Transforming growth factor beta-1' I get 136 matches in UniProt. Why 136 matches? Partly because the same name is used across multiple species to describe similar proteins. There's a protein with the same name in human, mouse, carp and a bunch of others. OK, so maybe restricting our search to human proteins is what we need: 42 results, but only one of them shares the same name. Wonderful. We even have a nice, unique ID to refer to it by: P01137. We've gone from Joe Black, to Joe Black from 233 Whiskey Ave. New Orleans. Now do this same procedure hundreds of more times for a list of proteins, which can number in the thousands, and you will get an idea of why biologists and computer scientists get at odds with each other. A biologist writing a paper describing his favorite protein will refer to it by his favorite name, not by its unique identifier. The favorite name for P01137 could be any of the following: Transforming growth factor beta-1, TGF-beta-1, TGFB1, etc. That guy you knew in highschool who you only knew as 'Smurf', who everyone else called 'Murph' or 'Steve': try finding them in a phone-book with only that information. When that biologist gives us an Excel spreadsheet full of their favorite names instead of stable IDs, we're getting dozens of Smurfs, Wurzels and Dweezils. It's the kind of thing that results in night terrors.
Night terrors and awesome Metallica source material. |
Johnny Cash, ladies and gentlemen. |
All of this gives us the same kind of extraneous overhead that longshoremen were responsible for in the domain of shipping. You get data from a database, you try to translate it into the format your lab uses, you end up with bizarre results from the original database's translation tables, and then you sacrifice a goat to the Old Gods, fire up BLAST and hope for the best. Then, you do it all over again when someone else asks for your data in their favorite format.
This is why the shipping container appeals to me: a common format would drive a lot of these woes beneath our notice. How did the shipping container come to dominate? There was a lot of wheeling and dealing by a fellow named Malcom McLean, a former trucker with a high-school diploma who eventually founded his own trucking company. McLean has to convince port owners to go with his standard, and that involved a lot of fighting with longshoremen's unions, who's livelihood was directly threatened by his new system. Eventually, through a series of trials and adoption by the US military, the standard hit a critical point, and there was no turning back. If you weren't shipping containers, you were on the outskirts of an industry.
There's a lot of room for Malcom McLeans in the realm of bioinformatics. Until one comes along, I'll be the guy staring wistfully as a cargo train disappears in the horizon.
No comments:
Post a Comment