Monday, January 19, 2009

Data ambiguity Part 1

I was having breakfast on Saturday with an old friend. He pointed me to this article by Werner Vogels ( CTO).

This posting by Mr. Vogels is very insightful - along the data replica dimension of distributed data management. This is clearly an important dimension, but it isn't the only one. There is a more general problem of data ambiguity. This isn't necessarily just a database problem, but an overall systems problem.

The basic thought is that when you have two representations of a piece of data that are supposed to have the same value, when do they actually have the same value (and who cares?).

We can imagine the following cases.

  1. The 2 values must always have identical values (to an outside observer). Inside the "transaction" that sets the state of a datum, the values can be mutually inconsistent, but that inconsistency is not manifested to an observer outside of the transaction.

  2. The 2 values will need to be "eventually consistent" - this case is admirably covered by Mr. Vogels.

  3. The 2 values will rarely have identical values, but there are mechanisms for "explaining" the discrepancies.

The first case is almost a default case - yes we would like that please. The second case is a good perspective from a data replication perspective - essentially dealing with a common schema. The third case is the tricky case.

The first case is unattainable at large scale using ACID transactions for replicas of data at Internet scale is simply impractical for performance.

The third case is interesting because of situations where "transactions" can occur against either copy of the data independently and in arbitrary sequences. The communication mechanism between the systems that can update copies of the data may be reliable, or they may be intermittent. That isn't completely the issue.

So, to illustrate this kind of system, let's take a popular application - Quicken. Many people use Quicken to manage their household accounts. The idea is to be able to use Quicken as a kind of front end to bank accounts - but it is only intermittently connected.

At any moment, the balance that Quicken reports and the balance that the bank reports are very likely to be different values. Of course from a data management perspective they are actually different fields, however that subtlety will be lost on the majority of users. Why will the 2 have different values for the balance field? There are lots of reasons, e.g.

  • Transactions have arrived at the bank without being notified to Quicken yet. For example, in an interest bearing account, the interest payment will be automatically added to the balance on the bank's view of the account. Or possibly a paid in check has bounced - the bank will have debited the check amount and (possibly) added a penalty.
  • Transactions are processed in a different sequence in general. When a user writes the checks, there is no guarantee that they will be processed by the bank in the order in which they were written (in fact, the policy varies, e.g. process the biggest checks first if there are many to be processed because that maximizes overdraft charges in the event that an account goes overdrawn).

These reasons boil down to the need to have system autonomy for throughput (imagine having to wait at the bank to process check 101 until check 100 had been processed).

Of course it doesn't matter to us that the systems are rarely fully synchronized, that the "balance" doesn't agree across them - we have accounting methods to help us reconcile. In other words we can accept that everything is OK without caring whether the systems have the same value of the balance.