Friday, June 8, 2012

In stream and out of band

Big data seems to be popping up everywhere. The focus seems to be on the data and the engines and all the shiny toys for doing the analysis. However the tricky part is often getting hold of the slippery stuff in the first place.
In the cryptography world, one of the most useful clues that something big is about to "go down" is traffic analysis. Spikes in traffic activity provide signals to the monitoring systems that further analysis is required. There is useful information in changes in rate of signals over and above the information that may be contained in the message itself.
Deducing information just from the traffic analysis is an imprecise art, but knowing about changes in volume and frequency can help analysts decide whether they should attempt to decrypt the actual messages.
In our systems, this kind of Signal Intelligence is itself useful too. We see it in A/B testing. We see it in prediction about volume for capacity planning. In other words we are losing a valuable source of data about how the business and the technology environments are working if we ignore the traffic data.
Much of "big data" is predicated on getting hands (well machines) on this rich vein of data and performing some detailed analysis.
However there are some challenges:
  • Getting access to it
  • Analyzing it quickly enough, but without impacting its primary purpose.
  • Making sense of it - often looking for quite weak signals
That's where the notion of in-stream and out of band comes from. You want to grab the information as it is flying by (on what? you may ask), and yet not disturb its throughput rate or at least not much. The analysis might be quite detailed and time consuming. But the transaction must be allowed to continue normally.
In SOA environments (especially those where web services are used), all of the necessary information is in the message body so intercepts are straightforward. 
Where there is file transfer (eg using S/FTP) the situation is trickier because there are often no good intercept points.
Continuing the cryptography example, traffic intercepts allow for the capturing of the messages. These messages flow through apparently undisturbed. But having been captured, the frequency/volume is immediately apparent. However the analysis of content may take some while. The frequency/volume data are "in stream" the actual analysis is "out of band".

Thursday, June 7, 2012

CAP Theorem, partitions, ambiguity, data trust

This posting was written in response to Eric Brewer's excellent piece entitled

CAP Twelve Years Later: How the "Rules" Have Changed

I have copied the statement of the theorem here to provide some context:

The CAP theorem states that any networked shared-data system can have at most two of three desirable properties:
  • consistency (C) equivalent to having a single up-to-date copy of the data;
  • high availability (A) of that data (for updates); and
  • tolerance to network partitions (P).
The original article is an excellent read. Eric makes his points with crystal clarity.

I have found the CAP theorem and this piece to be very helpful when thinking about tradeoffs in database design - especially of course in distributed systems. It is rather unsettling to trade consistency for anything, but we have of course been doing that for years.

I am interested in your thinking about the topic more broadly - where we don't have partitions that are essentially of the same schema, but cases where we have the "same data" but because of a variety of constraints, we don't necessarily see the same value for it at a moment in time.
An example here. One that we see every day and are quite happy with. That of managing meetings.
Imagine that you and I are trying to meet. We send each other asynchronous messages suggesting times - with neither of us having insight into each other's calendar. Eventually we agree to meet next Wednesday at 11am at a coffee shop. Now there is a shared datum - the meeting. However there are 2 partitions of that datum (at least). Mine and yours. I can tell my system to cancel the meeting. So my knowledge of the state are "canceled", but you don't know that yet. So we definitely don't have atomicity in this case. We also don't have consistency at any arbitrary point in time. If I am ill-mannered enough not to tell you that I don't intend to show, the eventually consistent state is that the meeting never took place - even if you went at the appointed hour.

I would argue that almost all the data we deal with is in some sense ambiguous. There is some probabilty function (usually implicit) that informs one partition about the reliability of the datum. So, if for example I have the reputation for standing you up, you might attach a low likelihood of accuracy to the meeting datum. That low-probability would then offer you the opportunity to check the state of the datum more frequently. So perhaps there is a trust continuum in the data from a high likelihood of it being wrong to a high likelihood of it being right. As we look at shades of probabilty we can make appropriate risk management decisions.

I realize of course that this is broader than the area that you were exploring initially with CAP, but as we see more on the fly analytics, decision making, etc. we will discover the need for some semantics around data synchronization risk. It's not that these issues are new - they assuredly are not. But we have often treated them implicitly, building rules of thumb into our systems, but that approach doesn't really scale.

I would be interested to hear your thoughts.
PS I have cross posted this against the original article as well.