parse, dammit!: December 2011

Tuesday, December 20, 2011

I Read Things

Some things I've been reading:

Backreaction has some fun with the h-index

DataVis lists some bright ideas

xkcd attempts to make visual sense of our money

Friday, December 16, 2011

While bemoaning Led Zeppelin's location on the Map of Rock in my lab I made note of the fact that where you start will greatly influence what your map ends up looking like (the map, by the way, has expanded to 150 edges. This additional exploration resulted in the loss of considerable sleep, the depletion of valuable rum stores and the discovery of the Bakerloo Blues Line). We want to think that whatever's in the center of a picture has a great deal of importance, which is why we had beautiful, detailed and utterly wrong maps like this:

You just don't get factual inaccuracy this pretty anymore.

That is a geocentric model of the universe by Bartolomeu Velho. You know it's classy because it uses text figures instead of lining figures. I have a hard-on for text figures. They do to documents the same thing a good dress does to an escort. I'm going to take an educated guess that the heliocentric model is pretty widely accepted by my readers. Lets not completely discount the geocentric model though: its picture of the earth itself is pretty much the same as the new (if you ignore its lack of Australia): in fact, accompanied by some pretty wild orbital gymnastics, the geocentric model would even tell you where our celestial neighbors were in the night sky. Where the two differ deeply is in their idea of where we sit relative to the rest of the universe.

You are here.

Which brings us back to protein interactions. If you're interested in how a protein behaves, one standard is to look it up in a database and find out what it interacts with. Take PKACA for example. According to IntAct it has 17 interactors. According to MINT, 15. According to DIP, 1. The word on the street says that there are 1532 of these if you want to amalgamate all the databases supporting the PSICQUIC interface. Thanks by the way, for generating countless competing standards for what constitutes a protein interaction. I really needed to be that insecure about the most basic level of my research.

So that's one way of looking at a protein. The issue I have with this view is that the protein then becomes a prima donna; everything you see revolves around it.

Another way, is to look at metabolic pathways. These reach far deeper into the interactome than just a single degree, but more importantly, they represent a functional unit of the cell. If you're looking at how PKACA actually does stuff, you can see it here, here and... in fact, here's the whole list of 27 places it is according to KEGG and another 5 from the Pathway Interaction Database. It's no longer the center of its own little world, it's situated

All snarky quips about competing standards aside, which view do you think represents more usable information about the protein? My money is riding on the second representation right now, which is a major shift in the way I've been designing my products so far. Stay tuned for a prototype, and maybe a downloadable version of the Map of Rock.

Thursday, December 15, 2011

News You Can Abuse

Earlier this fall, I had the pleasure of seeing this product in beta, and now it's officially live:

http://www.pifiq.com/

For those interested, pifiq is a suite of data analysis tools hosted on a secure cloud distributed in the many hardened bunkers and seaborne data centers that I'm sure Pathway Communications either owns or is buying.

A basic account (10 datasets) is free, and the developers crave feedback, so go forth and exploit it towards your own nefarious ends. Go!

Wednesday, December 14, 2011

On Graph Visualization or 'Why are you so ANGRY all the time?'

The problem with carrying a sledgehammer is that regardless of whether or not something needs hammering, you want to have a swing at it. Every door, cinder block and coconut look like their situation could be improved with your sledgehammer.

I build graph visualization software. Not the graphs you produce with Excel and bore people with in Powerpoint, but the kind represent relationships between entities. This is graph theory, a subset of mathematics that concerns lots of dots, connected by lots of lines. More formally, I refer to them as nodes and edges. I'm exactly the kind of person that would get annoyed by straying from that.

Researchers use graph theory to map and model all kinds of relationships, from my domain of bioinformatics, to infrastructure, to economics.

To understand why graph theory is so attractive, we can look at something I hold dear, which is music. Musicians are a promiscuous lot; a single musician may play with many different groups throughout his/her life. Dave Grohl is an excellent contemporary example. He has played in Nirvana, Probot, Foo Fighters, Killing Joke, Them Crooked Vultures, Queens of the Stone Age and has likely whored himself around to countless other side projects and collaborations. As an exercise, I slapped together a quick list of musicians and bands that have collaborated. That list looks something like this:

...

Nick Simper - Johnny Kidd & The Pirates

Geezer Butler - Black Sabbath

Ozzy Osbourne - Black Sabbath

Tony Iommi - Black Sabbath

Tony Iommi - Jethro Tull

Lemmy Kilmister - Motorhead

Lemmy Kilmister - Hawkwind

Mikkey Dee - Motorhead

Mikkey Dee - King Diamond

Lemmy Kilmister - Ozzy Osbourne

Lemmy Kilmister - Probot

Josh Homme - Them Crooked Vultures

Josh Homme - Kyuss

Josh Homme - Queens of the Stone Age

Dave Grohl - Them Crooked Vultures

John Paul Jones - Them Crooked Vultures

John Paul Jones - Led Zeppelin

...

... This goes on to take up 92 lines. It's visually uninteresting, but it's helpful. If I was listening to an album, I could consult the list and find out what other bands the musicians have played with. If I like the album, which is often the case with this particular list, chances are that their other work will be of similar quality (there are some notable exceptions. I'm looking at you, Them Crooked Vultures).

Enter the graph.

A labelled portion of the previous list, now in shiny graph format.

All of the text boxes are our nodes, and our edges are instances where the two parties have played with each other. Note that this doesn't make the earlier index any less informative. It contains the same data, but now it's arranged in a manner that facilitates the visual identification of paths from node to node. I can very quickly trace my way from Metallica to Journey and eventually from Journey to Nailbomb. Paths like that amuse me to no end.

Zooming out, the whole thing, 76 musicians and bands and 92 collaborations, looks like this:

This also represents half a century of trashed hotel rooms.

So now, this data has a shape. Seductive, isn't it? From this view, certain features become readily apparent. Some nodes are more densely connected than others. There are areas of the graph where different nodes share partners to a large degree. If you try to follow the paths between different nodes, you'll begin to notice that you pass over the same spots over and over again. There is an urge to ascribe importance to these features because they are recognizable. You can even map some of these features to visual attributes, such as node size.

Same graph, with node involvement in shortest paths and degree mapped to node width and height.

Now we're beginning to see some very attractive nodes. Lets look at three of them:

Top 3 standouts from our analysis. You could do worse.

This looks pretty solid. Ozzy and Lemmy alone could give you a shelf full of records, and Probot, being Dave Grohl's personal wish-fulfillment album comprised of metal stars from his childhood, stands out as a significant kind of thing to own. We're cool with that.

One problem with this is that graph structure may not necessarily denote worth. For example, according to this graph, there's a lesser known blues cover band occupying the less-fashionable south end of this graph, called 'Led Zeppelin'.

If I could illustrate Led Zeppelin and Ozzy as a giant cherub and a giant bat respectively, I would have.

How on earth could this graph treat the Zep so shabbily? There's a few reasons. For one, there's no reason to assume that deep connectivity or involvement in this graph has anything to do with musical talent or popularity. If I had went through the trouble of tracking down record sales, I could tell you that Ozzy has an estimated 100 million records sold, while Led Zeppelin has anywhere from 200-300 million. Map that data onto node size and the graph would look very different, but then Madonna would show up the same size as Zeppelin, and we'd all quickly realize that sales don't correlate with talent either.

Speaking of Madonna, why isn't she on this graph? Why aren't the Beatles? It has a lot to do with my research methods. I started this graph with Lemmy and Motorhead, and worked out from there, consulting Wikipedia as I went. My interest in the soulless void that is pop music is limited; thus any research in unfavorable directions was actively ignored. Seems like a pretty biased way to research anything, doesn't it? No way would any self-respecting researcher do such a thing in the world of science. Except when they do (Edwards et. al. 2011).

Research within confines of music and sales itself isn't the only thing we'd need for the whole picture. Geography, the ugly beast, ensured a lot of the early relationships on this graph. The fact that the members of Hawkwind, Motörhead, the Yardbirds and Cream were all located around London during the 60s and 70s greatly effects their connectivity and clustering. Or maybe the other way around, as artists would move closer to a city that could support both their musical styles and their drug habits. In fact, your efforts to build a better graph could go as far as tracking down who shared the same drug dealer in the 60s, leading to the discovery of a vitally important rundown flat situated in south-west London. Deciding where to draw the line regarding supporting data is maddening in this regard.

All of this aside, I love graphs. Like the sledgehammer, if I could, I would apply them to doors, cinder blocks and coconuts. The idea that they map things, as opposed to modelling them, is foremost in my mind though, as was cartoonishly illustrated above. It's powerfully intoxicating to imagine that what you're looking at is an actual thing, instead of our limited representation of that thing.

Return of the Living Something Something

Thus ends the long period of time in which I considered blogging a self-serving soapbox. I'm back to blogging after being inspired by a sense of impending doom and some solutions put out by the ever-enthusiastic John Robb in this post: Personal Branding.

To survive not just the impending Zombie Apocalypse, but also the various slings and arrows of outrageous fortune, I should have an online identity, a living document that outlines who it is I am, and what I have to offer. So people who need me can find me. To make me a commodity. This blog is still a soapbox, but it's hopefully dwarfed by the speakeasy it has out back.

So, very briefly, I'm going to try to outline some ground rules.

This will be about data. Then, it will be about interpreting that data. Then it will be about the data again.

I'm going to be operating in a few domains: data representation, strength and conditioning and resilient networks with some combatives thrown in. There's a lot these domains share, and I have many questions...

parse, dammit!

Labels