Wednesday, January 25, 2012

All Hail Our New Database Overlords!

I recently had to do a presentation on biological pathways, or to be more accurate, my ignorance of biological pathways. For all of the insights I got from my audience, I might as well have presented in the form of an interpretive dance.
All your pathways are horse. Hear the blue.
After the awkward silence died down, I did manage to have a chat with some interested parties, and a few ugly truths came out.

If you got a headache contemplating the inconsistencies of protein nomenclature, you will be less than pleased to know that while we may not know which proteins are what, we're busily building interaction and pathway databases based on them. Remember PKACA, the protein I mentioned in an earlier blog post? Of course you do. You've been paying attention. You will recall a couple of databases that store information on PKACA (I keep reading it as 'pee-kaka'. Serious, it's like there's a five-year old romping around inside my head.) and they all have different information on it. When I wrote the earlier blog post, I didn't even blink when I linked to multiple databases, each with different ideas of about PKACA. The number of interactions were anything from 1 (DIP), 15(MINT) and 17(IntAct). Looking at metabolic pathways, PKACA is either a member of 27 (KEGG), 5 (PID) or 54 (Reactome). Is the entire bioinformatics community just trying to mess with my head? No: these databases fall prey to some real-world constraints. Often, they are curated by individuals. Those individuals have to sort through this monster:


PubMed. Vincenzo Natali could have a field day with this.
PubMed is a database which contains just about every respectable medically-related publication in a searchable format, as far back as when 'The Practicability of transporting the Negro back to Africa.' was still a respectable article title. Curators have to sort through article after article involving individual proteins. PKACA shows up in 93 articles in PubMed's search (PubMed's search has it's own issues, which I'll leave for another day, for now lets just say 93 is a really fuzzy number). A curator has to read those articles and extract interactions from them. Recall that protein names are ambiguous and that the curator can't possibly have knowledge of the every one of the 20,242 sequences listed in UniProt for a species like Homo Sapiens. It passes from Herculean to Sisyphean when the definition for each protein has a less than stable identity. It's tracking a population of Jason Bournes.

This would not be the first time this has happened in a field of study. Take cartography, which has a history of inconsistencies that would be comical if peoples lives hadn't depended on it. From cartography, we have the example of portolan charts, a powerful navigational tool used by nautical navigators from the 13th century until the renaissance. These charts focused on compass headings instead of going through the whole tedious business of surveying coastlines, allowing sailors to plot a course between different landmarks. This is like that uncle that refuses to use street names and gives you directions like 'take a left when you see the tree beside a bigger tree that looks kind of like a goose'. When you look at a map today, you likely imagine that what you see will be mirrored almost exactly on another map, because both should have been derived from the same data. This is because you have been spoiled by more than a century of photography and accurate reproduction. Even so, you shouldn't believe everything you navigate by.

Google Earth or psilocybin?
Back before modern cartography, errors abounded. Portolan charts varied, their content dependent on the needs of their eventual owners. A Venetian merchant and a Dutch one would have different ports of interest, depending on where they had friendly trading partners. Omissions would be a matter of course. Individual portolan charts, as with many other maps, were often derived from charts before them, not necessarily from observations of the world they were meant to represent. Some charts were patchworks of other charts. This meant that errors were carried down from chart to chart. Strangely enough, the system worked, well enough so that some states kept their portolan charts as state secrets, a good indication of their value. The charts were safely in the realm of 'good enough'. What they weren't good enough for was the immensity of things like the Atlantic Ocean, which was kind of bigger than cartographers were used to (Portolans were primarily used for inland bodies for water) and required much more detail to navigate successfully. Map making picked up the ball and ran with it after this point, changing from attempts to document individual routes to attempting to model geographical reality as accurately as possible. 

Which brings me back to the points outlined in 'Too many roads not taken', except the criticism isn't only of the direction research takes, but of the way we choose to document it. What we see now is the equivalent of the portolan chart: every map is not necessarily built upon the same data. Certain interactions and pathways are of importance to individual curators, even if only by the reason that they were on top of a pile of articles they had to read. Errors are propagated, even in this age, by legacy literature.

So where to now? The good news is that our knowledge is expanding, as are the tools at our disposal. Maybe what we'll end up seeing, as our ability to identify and characterize proteins, interactions and pathways improves, is a kind of convergence. Eventually, given the same data, these databases could end up looking pretty similar, the same way maps do nowadays. The larger scale studies suggested by Edwards et. al. could be part of this solution. I have another complimentary suggestion:

Nuke it from orbit.
PubMed has a very unique position. To my knowledge (jump in here and tell me I'm wrong on this) there is no other comprehensive database of medical publications with the scope and power of PubMed. PubMed could require annotations to be given for new articles referencing protein-protein interactions, and were they to request only one particular format, say RefSeq, it would provide a strong incentive for others to follow suit. Remember Malcom McLean? This could be the ruthless monopoly that would transform the industry, much the same way as Malcom transformed the global shipping industry with a few key deals. It would be ugly, and a lot of work hours will have been wasted to some degree on the behalf of databases that would suddenly have to pledge their allegiance to their new NCBI overlords, but the results would make my job a whole lot easier. And that's what really matters, isn't it?

No comments:

Post a Comment