parse, dammit!: 2012

Friday, June 22, 2012

Lost In Translation

Did you remember to take the second left at RefSeq Genomic Accession ID?

I complain ad nauseaum about the various competing formats used to identify various biological molecules. Here's a ~~bug#&*% insane~~ helpful map from the people at bioDBnet.

Monday, June 18, 2012

I've been on a little bit of a communication kick lately, having against all reason actually enjoyed presenting at GLBIO. In truth, representing data visually is just a subset of the task of communication. Through visualization, we transfer data from a medium to our brains via the sense of sight. The peculiar custom of transferring data through standing in front of a crowd of strangers and talking for an hour or two is curious, but similar in many ways.

A few months ago of my co-workers was kind enough to direct me to the work of Edward R. Tufte, a critic and theorist who specializes in visual data representation. I am ashamed to say that until she literally put a book of his in front of my nose, I hadn't seen any of his work. I have since remedied this.

Among his writings, there exists a well-distributed, crabby critique called 'The Cognitive Style of PowerPoint' that I've recently taken to heart. By taking it to heart, I mean that I've used its principles to utterly dismantle the PowerPoint presentation I put on at GLBIO 2012. I redid that presentation in-lab today with the following set of basic rules:

Only use slides for things that slides are good at.

If you ask Tufte what slides are good at, he would tell you 'very little'. He spends a great deal of time discussing the many ways in which PowerPoint slides, and particularly their templates, just plain suck at communication. Data is parceled out in 10 to 20 line chunks, forcing the presenter to unnaturally partition their narrative. PP graphs are low resolution, illustrating only the most blunt of points, and used in places where a simple 'and then X happened' would have sufficed from the presenter. For the most part, a standard template for a PP presentation serves less as a means of communication than as an assistance device to organize the presenter, albeit within the bizarre constraints of 'one slide per topic' regardless of the scope or complexity of that topic.

With all of this in mind, I looked at the slides I had, and I started to pick them apart. Immediately gone were organizational slides. Unless a slide could communicate a concept more effectively than speaking alone, it was scrapped too. Out of 27 original slides, I kept 8, and those were severely cut down and held almost nothing but graphics.

8 slides to represent an hour long portion of my talk.

The results of this cull?

Twofold.

The ugly side of this was that I had underestimated the usefulness of PowerPoint as an organizational device. Stripped of a hierarchy of bullet points, I realized at midnight before I was to present, I had lost my narrative. The mess of bullet points that I transferred to five pages of printout were a poor substitute for a rehearsed talk. I stumbled over what should have been a flowing exploration of graph visualization practices.

Though this resulted in a few uncomfortable moments, it drove home a point Tufte made often in his critique: giving presentations is hard. Having PowerPoint, at best, turns a poor presentation into a boring one. Instead of spending hours of my time culling my slides, I should have invested that time into practicing my narrative and ensuring that the more efficient form of data transfer, the 150 word-per-minute speech, was as well-oiled as it could get.

The second result of my re-formatting was that I saw how effective a good graphic could be. If anything saved the presentation, it was this slide:

This horrible, horrible slide.

This slide was an illustration of bad visualization practices. It comes from this article in Nature, which is, ironically, about improving graph visualization. I have seen many different versions of the same concept repeated over and over again in graph visualizations: people misunderstanding the purpose of graph visualization entirely. Graphs excel at communicating to the viewer information about relationships between objects. In this example, none of the graph edges are remotely traceable, the relationships between the implied complexes are indistinct and there is no evidence to suggest WHY any of the complexes should exist. There is NOTHING achieved by this graph that could not have been done more elegantly with tables of protein names. It is the Michael Bay of graphs.

"...and then the protein complex transforms into a DEATH JET
that shoots FIRE while Megan Fox SWEATS PROVOCATIVELY!"

The effect of showing this graph was immediate: the entire room groaned. They understood, very quickly, the concepts I had been stumbling across in my unpracticed narrative. Including the slide was to my advantage. It saved my ass.

So, two lessons learned.

In closing, I'd like to leave you with something that kept me sane while I experimented with dangerous presentation techniques. The following is the first part of a lecture given by Louie Simmons, a trainer and competitive power-lifter. I listened to this lecture in between editing sessions to remind me that a good presentation doesn't need ANY slides. It needs content and a presenter capable of communicating it.

Do not mock Louie. He will destroy you.

Friday, May 18, 2012

GLBIO 2012 Wrap-Up: I am not Tony Stark

Science conferences as I want them to be.

GLBIO 2012 has come and gone. I came, I ate the food my registration fees paid for, I presented. In case you're curious, the slides and relevant materials I presented are here.

And here is what I learned:

Tutorials need to go easy on figures and formulas. Audience retention was terrible in the tutorials I witnessed, and I suspect part of it was due to presenters translating a paper to Powerpoint too literally. Bioinformatics is a very broad discipline; it's likely that the only person in the audience that understands enough about your paper to follow a presentation on it is you.

Your presentation needs a punchline. Mine, sadly, didn't have one. I spent an hour discussing the ins and outs of a visualization workflow, only to have my example of that workflow... not be spectacular. Not that I think I needed a pyrotechnics display, but having a definite conclusion instead of just stopping dead and thanking your sponsors would have been the icing on the cake.

Pictured: Not the end of my tutorial.

Be open to criticism. I swear some people go to presentations just to be assholes. There's nothing worse than listening to two eggheads prattle on about the third variable someone chose in part 4 of their 18 part analysis for ten minutes. However, sometimes criticism comes from a place of shared interest, which means you meet the coolest people by listening to what they have to say.

Wednesday, May 9, 2012

Garbage In, Garbage Out

In less than a week, I'll be journeying to Ann Arbor for GLBIO to run a tutorial on graph visualization for biologists.

Right now, this is Slide 1:

I may have to dial down the haranguing I had originally intended to occupy 119 of my presentation's 120 minutes. It's not like biologists are dangerous, even in large numbers, but exposing them to my rants about data representation and standardization might drive them back into the wet-lab and away from the computer desk.

Wednesday, May 2, 2012

The Anthropocene

Here's a fantastic example of visualization done right. This video does some panning and dancing (plus the obligatory "we stand at the edge of a new era" speech) around a very well done map of earth's urban centers and transport networks.

More views, and the sources that were used to generate them are available via Globaia here.

Hat tip: TechEBlog

Friday, April 20, 2012

Where did all the Meathead Stuff Go?

You've no doubt noticed that there is no longer any meathead content on 'parse, dammit!'. All the posts on long dismantled Soviet super-athlete programs, bar napkin barbell physics and my awful deadlift are nowhere to be found.

There are reasons for this. I carefully studied the information that Google's awesome surveillance apparatus was producing regarding you lot (Kru Pete, you should be ashamed of yourself). I also re-read some of the posts and noted the somewhat jarring jumps between subject matter.

This has lead me to the conclusion that though meatheads and scientists share a lot of common ground, the kind of science I get paid to do does not mix elegantly with the land of chalk and iron.

The theoretical center of this Venn diagram can
deadlift poundages that only exist on the complex plane.

So, I've decided, in the interests of coherence and simplicity, to move those posts to a brand new blog. Raise your hands if you know your 1 Rep Max in anything. Anyone with your hand raised can go here now:

Those of you remaining, I've got a new and interesting way to explore biological pathway databases that I'll post just as soon as we're assured that no one can steal it from us.

Thursday, March 29, 2012

The World Map of Metal?

A more literal map, illustrating the concentration of metal bands per capita. The source material is The Metal Archives, which is curated by a crack team of metalheads who ensure 'Bullet For My Valentine' never gets classified as heavy metal, ever. parse, dammit approved!

Unstartlingly, worldwide sales of white and black face paint look almost exactly the same.

Hat tip: Neatorama

Thursday, March 15, 2012

Standing on the Shoulders of Giants

Though technologically advanced, my Map of Rock is anemic in comparison to Pete Frame's Family of Rock.

I bow to You, O God,
Who appears in the astonishing form of Pete Frame

The curation task I took upon myself in mapping the various relationships between musicians is something that Pete has been doing for decades.

For the obligatory bit of math here, the format used by Pete, though called a Family Tree, is not technically a tree. Trees, according to computer science, are graphs that have no direction, and are acyclic. Pete's trees are constrained by time and thus qualify as directed graphs. To illustrate, the Ozzy Ozbourne of 1985 can't loop back to join the Elf of 1967. Furthermore, this non-looping quality makes them a directed acyclic graph, a structure that has showed up regularly in my work.

I'm both looking forward to, and dreading, trying to add all this data into my own map.

Hat tip to BoingBoing for pointing me to the Family of Rock. You've destroyed what little sleep I could have had over the next decade.

Tuesday, February 28, 2012

I am the God of Hellfire, and I give you...

Fire.
I'm led to believe this was period costume in 1968.

So it took some time, but I got my five comments, and the Gods of Rock have been appeased. So without further ado here's the location of the Map of Rock:

It's behind the freaky heads.

What I'm linking to is Many Eyes, which is an interesting visualization experiment put together by the people at IBM. Christophe Viau was kind enough to point me in its direction. The concept behind Many Eyes is simple: you upload data, and then everyone can have a go at visualizing it. If you've kept a log of the weight you've squatted for every workout in the past ten years (and who hasn't?), you can upload it, and the whole wide internet can turn it into a multitude of charts and diagrams. Here is the data behind my graph. Graphs, sadly, seem to have only one visualization, so you're pretty much stuck with the visualization I linked to above. I'm not going to complain too much though: until I can build a better engine for online visualizations, my readers (all five of them) can have a poke around my graph and see who's been playing with who.

If anyone has any submissions to add to the Map of Rock, or explanations as to how Deep Purple could have turned into the awful Whitesnake, please leave them in the comments.

Friday, February 24, 2012

Rocket Science

This is what passes for swag in the bioinformatics world:

Confused?

My co-worker arrived from a product showcase wearing the above. I appreciate the sentiment only because no-one I've talked to knows what bioinformatics is. My spellchecker still thinks it's not a word.

I disagree with the shirt, because if a bioinformatician designed a rocket, the bolts would all be in metric, the nuts in imperial, everything else in cubits, and the whole mess would probably turn into a pinwheel of flaming death if it ever got off the ground. We're working on that.

Friday, February 10, 2012

The Rumors of my Demise have been Greatly Exaggerated

I haven't abandoned my loyal readers (all three of you); there are more exciting things in the pipeline!

The Map of Rock is going to change format and illustrate some quirks of protein-protein interaction networks. By the way, if you haven't pitched in your two cents towards getting the Map of Rock released into the wild, you can still do so here. Get on that, by the way. It's expanded to Iced Earth, thanks to input from a knowledgeable coworker, a development I feel ill at ease with.

I'm counting on my readership to be either too lazy or too well informed
to go listen to Iced Earth.

Also, there's going to be some rumbling at Annex Fitness, the strength and conditioning group I help coach. It will include lots of grainy black and white photographs of the strongmen of yore, and a chunk of relatively obscure Toronto history.

In the meanwhile, I read things:

Why You Need Domain Knowledge

More Ricky Bruch

And something that fell into my lap a month late from the minds at Starting Strength.

Wednesday, January 25, 2012

Teaser: The Map of Rock

I've had a LOT of fun building the Map of Rock over the past month, and with a few pointers from Christophe Viau, I've managed to find a spot for it online, where everyone can have a gander. There's a problem though: I see a lot of people reading this blog, but no comments. I cannot make bricks without clay, people. So here's the deal: five comments. If five comments show up on this post, you can have the ultra-secret volcano-lair location of the Map of Rock. It's a steal, ladies and gentlemen.

Ten comments, and I'll go through the trouble of inserting all the missing umlauts.

All Hail Our New Database Overlords!

I recently had to do a presentation on biological pathways, or to be more accurate, my ignorance of biological pathways. For all of the insights I got from my audience, I might as well have presented in the form of an interpretive dance.

All your pathways are horse. Hear the blue.

After the awkward silence died down, I did manage to have a chat with some interested parties, and a few ugly truths came out.

If you got a headache contemplating the inconsistencies of protein nomenclature, you will be less than pleased to know that while we may not know which proteins are what, we're busily building interaction and pathway databases based on them. Remember PKACA, the protein I mentioned in an earlier blog post? Of course you do. You've been paying attention. You will recall a couple of databases that store information on PKACA (I keep reading it as 'pee-kaka'. Serious, it's like there's a five-year old romping around inside my head.) and they all have different information on it. When I wrote the earlier blog post, I didn't even blink when I linked to multiple databases, each with different ideas of about PKACA. The number of interactions were anything from 1 (DIP), 15(MINT) and 17(IntAct). Looking at metabolic pathways, PKACA is either a member of 27 (KEGG), 5 (PID) or 54 (Reactome). Is the entire bioinformatics community just trying to mess with my head? No: these databases fall prey to some real-world constraints. Often, they are curated by individuals. Those individuals have to sort through this monster:

PubMed. Vincenzo Natali could have a field day with this.

PubMed is a database which contains just about every respectable medically-related publication in a searchable format, as far back as when 'The Practicability of transporting the Negro back to Africa.' was still a respectable article title. Curators have to sort through article after article involving individual proteins. PKACA shows up in 93 articles in PubMed's search (PubMed's search has it's own issues, which I'll leave for another day, for now lets just say 93 is a really fuzzy number). A curator has to read those articles and extract interactions from them. Recall that protein names are ambiguous and that the curator can't possibly have knowledge of the every one of the 20,242 sequences listed in UniProt for a species like Homo Sapiens. It passes from Herculean to Sisyphean when the definition for each protein has a less than stable identity. It's tracking a population of Jason Bournes.

This would not be the first time this has happened in a field of study. Take cartography, which has a history of inconsistencies that would be comical if peoples lives hadn't depended on it. From cartography, we have the example of portolan charts, a powerful navigational tool used by nautical navigators from the 13th century until the renaissance. These charts focused on compass headings instead of going through the whole tedious business of surveying coastlines, allowing sailors to plot a course between different landmarks. This is like that uncle that refuses to use street names and gives you directions like 'take a left when you see the tree beside a bigger tree that looks kind of like a goose'. When you look at a map today, you likely imagine that what you see will be mirrored almost exactly on another map, because both should have been derived from the same data. This is because you have been spoiled by more than a century of photography and accurate reproduction. Even so, you shouldn't believe everything you navigate by.

Google Earth or psilocybin?

Back before modern cartography, errors abounded. Portolan charts varied, their content dependent on the needs of their eventual owners. A Venetian merchant and a Dutch one would have different ports of interest, depending on where they had friendly trading partners. Omissions would be a matter of course. Individual portolan charts, as with many other maps, were often derived from charts before them, not necessarily from observations of the world they were meant to represent. Some charts were patchworks of other charts. This meant that errors were carried down from chart to chart. Strangely enough, the system worked, well enough so that some states kept their portolan charts as state secrets, a good indication of their value. The charts were safely in the realm of 'good enough'. What they weren't good enough for was the immensity of things like the Atlantic Ocean, which was kind of bigger than cartographers were used to (Portolans were primarily used for inland bodies for water) and required much more detail to navigate successfully. Map making picked up the ball and ran with it after this point, changing from attempts to document individual routes to attempting to model geographical reality as accurately as possible.

Which brings me back to the points outlined in 'Too many roads not taken', except the criticism isn't only of the direction research takes, but of the way we choose to document it. What we see now is the equivalent of the portolan chart: every map is not necessarily built upon the same data. Certain interactions and pathways are of importance to individual curators, even if only by the reason that they were on top of a pile of articles they had to read. Errors are propagated, even in this age, by legacy literature.

So where to now? The good news is that our knowledge is expanding, as are the tools at our disposal. Maybe what we'll end up seeing, as our ability to identify and characterize proteins, interactions and pathways improves, is a kind of convergence. Eventually, given the same data, these databases could end up looking pretty similar, the same way maps do nowadays. The larger scale studies suggested by Edwards et. al. could be part of this solution. I have another complimentary suggestion:

Nuke it from orbit.

PubMed has a very unique position. To my knowledge (jump in here and tell me I'm wrong on this) there is no other comprehensive database of medical publications with the scope and power of PubMed. PubMed could require annotations to be given for new articles referencing protein-protein interactions, and were they to request only one particular format, say RefSeq, it would provide a strong incentive for others to follow suit. Remember Malcom McLean? This could be the ruthless monopoly that would transform the industry, much the same way as Malcom transformed the global shipping industry with a few key deals. It would be ugly, and a lot of work hours will have been wasted to some degree on the behalf of databases that would suddenly have to pledge their allegiance to their new NCBI overlords, but the results would make my job a whole lot easier. And that's what really matters, isn't it?

Thursday, January 19, 2012

I Read Things

5 Common Mistakes People Make in the Name of Statistical Analysis

Original xkcd comic here.

Friday, January 6, 2012

The Rituals of the Cult of Cthulhu as Practiced by Bioinformaticians

BEHOLD!

If you don't immediately recognize what the above is, you're in for a treat: that is an overview of the ISO standard for intermodal containers. Intermodal containers are those big steel containers you see at every single respectable port in the world (and at the Port of Oakland. Save for Sonny Barger and Tim Armstrong, there's nothing respectable about Oakland). There's actually far more documentation pertaining to the everything from the markings to recommendations on the construction of the twist locks that allow the containers to be mounted on trains, trucks or container ships (ASTM D5728-00, ISO 9897, ISO 14829, ISO 17363, ISO/PAS 17712, ISO 18185-2 and ISO/TS 10891). Unless you ship in tonnes or have Aspergers, all of this might be of little interest to you, but consider this: 90% of all non-bulk cargo comes to your grubby little paws via an intermodal container. That's the world's entire shipping network; containerized.

Christina Hendricks is here because without her presence,
no-one would have read that paragraph about intermodal containers.
Know your audience.

Before containerization, cargo would have to be loaded and unloaded from different modes of transport by crews of longshoremen for $5.86 a ton (probably sound drinking money in those days). Intermodal containers could be loaded and unloaded with a crane directly to another mode of transport or to storage for a measly $0.16 a ton. Not only that, those shifty longshoremen couldn't access the cargo within the crates, effectively reducing cargo theft. The stackability of crates also meant that specially built container ships could carry cargo more efficiently. Within decades after their introduction in 1955, intermodal containers had changed the landscape of shipping, removing significant trade barriers from Asia, conveniently sidestepping the issue of differing track gauges between railway lines, and building entire new cities around the needs of containerized ports.

A container ship at the Port of Miami.
There's a cocaine joke in there somewhere.
In fact, there's probably a lot of cocaine jokes in there.

The advantages containerization introduced to shipping are a prime example of standardization done right: a simple standard, adopted across multiple industries, to remove barriers.

Which naturally brings me to protein nomenclature. The basic unit of bioinformatics is our data, our best digital representation of the squishy underbelly of biology. We can choose to represent genes, proteins, molecules and other biological entities, plus any commonalities or interactions between them as consistent digital representations. Well, that's what we should be doing, and what we pretty much are, except there are a lot of cooks in this particular kitchen.

I swear before Odin, this guy works in my lab.

My personal pet peeve is the protein, one of the mechanisms behind the EVERYTHING that happens in your body. We'll start with a simple one:

MPPSGLRLLL LLLPLLWLLV LTPGRPAAGL STCKTIDMEL VKRKRIEAIR GQILSKLRLA

SPPSQGEVPP GPLPEAVLAL YNSTRDRVAG ESAEPEPEPE ADYYAKEVTR VLMVETHNEI

YDKFKQSTHS IYMFFNTSEL REAVPEPVLL SRAELRLLRL KLKVEQHVEL YQKYSNNSWR

YLSNRLLAPS DSPEWLSFDV TGVVRQWLSR GGEIEGFRLS AHCSCDSRDN TLQVDINGFT

TGRRGDLATI HGMNRPFLLL MATPLERAQH LQSSRHRRAL DTNYCFSSTE KNCCVRQLYI

DFRKDLGWKW IHEPKGYHAN FCLGPCPYIW SLDTQYSKVL ALYNQHNPGA SAAPCCVPQA

LEPLPIVYYV GRKPKVEQLS NMIVRSCKCSL

That's the sequence of amino acids that defines the protein I'm referring to. There are a lot of these sequences. UniProt, my lab's protein sequence database of choice, lists 533,657. A sequence database is like a complicated phone-book. You know what a phone-book is, dammit: it's that slab of paper that props up your computer monitor. This particular phone-book lets you link up protein names with their sequences and an array of other associated data, like publications that reference the protein and rushing yards. Using this sequence to identify the protein in question is ideal for computers, but less so for a researcher, who wants to be able to refer to that sequence in a sentence. So researchers refer to this sequence by its 'name', which is Transforming growth factor beta-1. Here is where things begin to get unhinged, in a 'Who's on First' kind of way. For 'Transforming growth factor beta-1' I get 136 matches in UniProt. Why 136 matches? Partly because the same name is used across multiple species to describe similar proteins. There's a protein with the same name in human, mouse, carp and a bunch of others. OK, so maybe restricting our search to human proteins is what we need: 42 results, but only one of them shares the same name. Wonderful. We even have a nice, unique ID to refer to it by: P01137. We've gone from Joe Black, to Joe Black from 233 Whiskey Ave. New Orleans. Now do this same procedure hundreds of more times for a list of proteins, which can number in the thousands, and you will get an idea of why biologists and computer scientists get at odds with each other. A biologist writing a paper describing his favorite protein will refer to it by his favorite name, not by its unique identifier. The favorite name for P01137 could be any of the following: Transforming growth factor beta-1, TGF-beta-1, TGFB1, etc. That guy you knew in highschool who you only knew as 'Smurf', who everyone else called 'Murph' or 'Steve': try finding them in a phone-book with only that information. When that biologist gives us an Excel spreadsheet full of their favorite names instead of stable IDs, we're getting dozens of Smurfs, Wurzels and Dweezils. It's the kind of thing that results in night terrors.

Night terrors and awesome Metallica source material.

So, with our biologist now aware of our plight and referring to his data by a unique UniProt ID, we should be out of the woods, right? No. Not by a long shot. UniProt is constantly changing: IDs get retired, new IDs get added, and thousands of records are revised yearly. Your phone book from 1982 hardly helps you look up your friend from Kansas in 2012. Your UniProt ID can only be said to be certain for the snapshot in time in which you downloaded it. In addition, UniProt is not the only sequence repository in the world. There's NCBI's Entrez and RefSeq, Ensembl, and probably half a dozen smaller databases lurking in the swirling, Lovecraftian universe of bioinformatics. Each has their own version documentation. What you're doing now is combining numbers from two different phone-books, one in the English alphabet and one in Cyrillic, with an outdated Old East Slavic-English dictionary between them. Without keeping a keen eye on when you downloaded what from whom, your data will start looking like that shitty Cadillac Johnny Cash built.

Johnny Cash, ladies and gentlemen.

All of this is compounded by common problems like this: even with your version numbers hammered down, when you can translate from one database to another your mapping is not guaranteed to be one to one. If you ask Ensembl for UniProt IDs and UniProt for Ensembl IDs for their respective proteins, you won't necessarily get the same answers. There are, after all, a lot of Browns in the phone-book. To top it all off, when you go to buy a physical product, say a DNA Microarray from Affymetrix, you'll find that Affymetrix has it's own IDs, derived from multiple databases, which you'll have no choice but to translate to, either with the same mystical translations mentioned above, or with a dedicated application such as BLAST.

All of this gives us the same kind of extraneous overhead that longshoremen were responsible for in the domain of shipping. You get data from a database, you try to translate it into the format your lab uses, you end up with bizarre results from the original database's translation tables, and then you sacrifice a goat to the Old Gods, fire up BLAST and hope for the best. Then, you do it all over again when someone else asks for your data in their favorite format.

This is why the shipping container appeals to me: a common format would drive a lot of these woes beneath our notice. How did the shipping container come to dominate? There was a lot of wheeling and dealing by a fellow named Malcom McLean, a former trucker with a high-school diploma who eventually founded his own trucking company. McLean has to convince port owners to go with his standard, and that involved a lot of fighting with longshoremen's unions, who's livelihood was directly threatened by his new system. Eventually, through a series of trials and adoption by the US military, the standard hit a critical point, and there was no turning back. If you weren't shipping containers, you were on the outskirts of an industry.

There's a lot of room for Malcom McLeans in the realm of bioinformatics. Until one comes along, I'll be the guy staring wistfully as a cargo train disappears in the horizon.

parse, dammit!

Labels