Wednesday, November 21, 2012

A data hoarder's delight

Withe big data, I sometimes feel like the character Davies in the masterful spy novel, "The Riddle of the Sands" by Erskine Childers. "We laughed uncomfortably, and Davies compassed a wonderful German phrase to the effect that 'it might come in useful'"

Some of the "big data" approaches that I have seen are a bit like that. We keep stuff because we can, and because it "might come in useful." For sure there are some very potent use cases. Forbes, in this piece describes how knowledge about the customercan drive predictive analytics. And valuable it is.

However the other compelling use-cases are a bit harder to find. We can certainly do useful analysis of log files, click through rates, etc., depending on what has been shown to a customer or "tire kicker." But beyond that the cases are harder to come by. That is to some extent why much of the focus has been on the technology and technology vendor side. 

There is a pretty significant dilemma here, though. If we wait to capture the data until we know what we need, then we will have to wait until we have sufficient data. If we capture all the data we can as it is flying through our systems and we don't yet know how we might use it, we need to make sure it is kept in some original form. Apply schema at the time of use. That makes us quake in our collective boots if we are data base designers, DBAs etc. We can't ensure the integrity. Things change. Where are the constraints?... In my house that is akin to me throwing things I don't  know what to do with (pieces of mail, camping gear, old garden tools, empty paint cans, bicycle pumps,... you get the idea). So when Madame asks for something - say a suitable weight for holding the grill cover down, I can say, "Aha, I have just the right thing. Here, uses this old, half full can of paint." A properly ordered household might have had the paint arranged in a tidy grouping. But actually that primary classification inhibits out of the box use.

Similarly with data. I wonder what the telephone country code for the UK is.. Oh yes, my sister lives there, I can look up her number and find it out. Not exactly why I threw her number onto the data pile, but handy nonetheless.

So with the driver of cheap storage, cheap processing, we can sudden;y start to manage big piles of data instead of managing things all neatly.

This thinking model started a while back with the advent of tagging models for email. Outlook vs Gmail as thinking models. If I have organized my emails in the Outlook folders way, then I have to know roughly where to look for the mail - all very well if I am accessing by some primary classification path, but not so handy when asked to provide all the documents containing a word for a legal case...It turns out - at least for me that I prefer a tag based model - a flat space like Gmail where I use search as my primary approach, as opposed to a categorization model where I go to an organized set of things.

There isn't much "big" in these data examples. It is really about the new ways we have to organize and manage the data we collect - and accepting that we can collect more of it. Possibly even collecting every state of every data object, every transaction that changed that state, etc. Oh and perhaps the update dominant model of data management that we see today will be replaced with something less destructive.