Tuesday, January 18, 2011

More on what event models can learn from database models

One of my earliest posts was entitled, "What can SOA learn from RDBMS?" In that posting I was arguing for "Event Triggers" that fire when "interesting" business objects change. Of course, "interesting" needs further definition and so for tha matter does "change". However that isn't where I am going in this post.

Historically in the database world, most systems implemented "pessimistic" locking mechanisms. In a pessimistic approach, access (with possible intent to update) to a data resource could not be obtained unless the requestor were able to get exclusive control of the resource. More recent systems employ an "optimistic" approach whereby the data resource is only locked while it is actually being updated. The semantics are quite different, but the intent is the same - makle sure that we don't have conflicting updates.

In general pessimistic locking delivers absolute guarantees, but may cause throughput reductions or excessive resource consumption. Optimistic locking assumes that there will be little contention and therefore doen't consume resources not hold locks "until it has to". In a system with optimistic locks the update requestor may have to deal with the contention behavior (retrying operations or whatever). In the pessimistic scenario, the requesting application can't proceed unless it is safe to do so.

In eventing models, we have similar trade offs that we need to make. No, they aren't lock based, but they do have implications on throughput. These are all focussed around guranteed sequence and guaranteed delivery. Both of these impose throughput limitations on an event based system. Guaranteed sequence isn't really any kind of guarantee however. It is a guarantee that over a certain time frame, messages will arrive in sequence. However this unpacks into not applying events in consuming systems until the guarantee period has expired in some cases.

Why you might ask? Imagine that you have timestamped events coming in, you may not know that an event that comes in does not have an earlier predecessor that has somehow been held up. So you might choose to wait for the appropriate time interval "just to make sure nothing is coming in later" and then apply all events in proper sequence. Set the event sequence window to 5 minutes and in the worst case you are processing 5 minute batches. That doesn't sound very friendly.

So maybe you think about applying sequence numbers to the events. That's fine (well kind of) unless the sequence number calculator is a central shared resource, in which case there may be contention for that. At least however, you can detect that events are out of sequence because if a receiving system perceives a gap in sequence numbers, then it can wait for the "time out period" or the appearance of the missing sequence number(s) whichever happens first and then apply the updates. This is a bit intrusive to the event creators - managing sequence numbers was probably not in their original plan.

In neither of the above scenarios do we get maximal throughput. They are both intrusive and will deliver quite inconsistent throughput. They are rather akin to the pessimistic locking approach taken in database systems.

Looking at it a different way, we can perhaps detect (externally to the applications) that things have happened out of sequence. maybe using an out of band control infrastructure. Thus if a sender identifies when something was sent (using a messaging infrastructure), the receiver's act of receiving will identify when it was received. If the receipts are out of sequence the infrastructure can alert and trigger actions. The assumptions are that:
  • Sequence errors are few and far between
  • It is OK to recognize that the sequence has been violated
  • The infrastructure is capable of re-delivering the events in an appropriate sequence
  • There have been no irrevocable side effects as a result of the out of sequence receipt.
Attempting to detect post-hoc is likely (and I have not done the work to prove this yet) to deliver mmore consistent (and higher) throughput than attempting to guarantee, with a slight increase in risk.

This looks to be a fruitful area of research (or if the research has already been done, a fruitful area for some best practice patterns) as we treat 2011 as "The Year of the Event" as Todd Biske proposes.