Thursday, June 7, 2012

CAP Theorem, partitions, ambiguity, data trust

This posting was written in response to Eric Brewer's excellent piece entitled

CAP Twelve Years Later: How the "Rules" Have Changed

I have copied the statement of the theorem here to provide some context:

The CAP theorem states that any networked shared-data system can have at most two of three desirable properties:
  • consistency (C) equivalent to having a single up-to-date copy of the data;
  • high availability (A) of that data (for updates); and
  • tolerance to network partitions (P).
The original article is an excellent read. Eric makes his points with crystal clarity.

I have found the CAP theorem and this piece to be very helpful when thinking about tradeoffs in database design - especially of course in distributed systems. It is rather unsettling to trade consistency for anything, but we have of course been doing that for years.

I am interested in your thinking about the topic more broadly - where we don't have partitions that are essentially of the same schema, but cases where we have the "same data" but because of a variety of constraints, we don't necessarily see the same value for it at a moment in time.
An example here. One that we see every day and are quite happy with. That of managing meetings.
Imagine that you and I are trying to meet. We send each other asynchronous messages suggesting times - with neither of us having insight into each other's calendar. Eventually we agree to meet next Wednesday at 11am at a coffee shop. Now there is a shared datum - the meeting. However there are 2 partitions of that datum (at least). Mine and yours. I can tell my system to cancel the meeting. So my knowledge of the state are "canceled", but you don't know that yet. So we definitely don't have atomicity in this case. We also don't have consistency at any arbitrary point in time. If I am ill-mannered enough not to tell you that I don't intend to show, the eventually consistent state is that the meeting never took place - even if you went at the appointed hour.

I would argue that almost all the data we deal with is in some sense ambiguous. There is some probabilty function (usually implicit) that informs one partition about the reliability of the datum. So, if for example I have the reputation for standing you up, you might attach a low likelihood of accuracy to the meeting datum. That low-probability would then offer you the opportunity to check the state of the datum more frequently. So perhaps there is a trust continuum in the data from a high likelihood of it being wrong to a high likelihood of it being right. As we look at shades of probabilty we can make appropriate risk management decisions.

I realize of course that this is broader than the area that you were exploring initially with CAP, but as we see more on the fly analytics, decision making, etc. we will discover the need for some semantics around data synchronization risk. It's not that these issues are new - they assuredly are not. But we have often treated them implicitly, building rules of thumb into our systems, but that approach doesn't really scale.

I would be interested to hear your thoughts.
PS I have cross posted this against the original article as well.