Tuesday, February 12, 2013

Operational Data Stores and canonical models

One way of thinking about an operational data store is to be a near real time repository of/for transactional data organized to support an essentially query workload against transactions within their operational life. This is particularly handy when you need a perspective on "objects" that have a longer life than their individual transactions might allow. Examples might include supply chain apps at a large scale, airline reservations - business objects that may well have transactions against them stretching over time.
In both cases, the main "object" (large, business grained) has a life span that could be long - surviving system resets, versions of the underlying systems, etc.
Considering the case of an airline reservation, it can have a lifespan of a couple of years - 330 days prior to the last flight the resevation can be "opened", and (especially in the cas eof refund processing) it might last up to a year or so beyond that. At least, give or take.
The pure transactional systems (reservations, check in, etc.) are most concerned (in normal operations) with the current transactional view. However there are several processes that care about the historical view while the reservation is still active. There are other processes that deal with and care about history of completed flights, going back years. Taxes, lawsuits, and other requests that can be satisfied from a relatively predictable data warehouse view.
It's the near term stuff that is tricky. We want to gain fast access to the data, the requests might be a bit unpredicatble, the transactional systems may have a stream of XML data available when changes happen, ...
So how do we manage that near real time or close in store? How do we manage some kind of standard model without going through a massive data modeling exercise? How do we get value with limited cost? How do we deal with unknown specific requirements (recognizing the need for some overarching patterns).
Several technologies are springing up in the NoSQL world (MongoDB, hybrid SQL/XML engines, Couchbase, Cassandra, DataStax, Hadoop/Cloudera) which might fit the bill. But are these really ready for prime time and sufficiently performant?
We are also not dealing with very big data in these cases, or they data might become big as we scale out. It is kind of intermediate sized data. For example in a reservation system for an airline serving 50 million passengers/year (a medium sized airline), the data size of such a store is only of the order of 5TB. It is not like the system is "creating" tens of MB/second as one might see in the log files of a large ecommerce site.
If we intend to use the original XML structures as the "canonical" structure - i.e. the structure that will be passed around and shared by consuming apps, then we need a series of transforms to be able to present the data in ways that are convenient for consuming applications.
However, arbitrary search isn't very convenient or efficient against complex XML structures. Relational databases (especially at this fairly modest scale) are very good at searching, but rather slow at joining things up from multiple tables. So we have a bit of a conundrum.
One way might be to use the RDB capabilities to construct the search capabilities that we need, and then retrieve the raw XMLfor those XML documents that match. In other words a hybrid approach. That way we don't have to worry too much about searching the XML itself. We do have to worry, however, about ensuring that the right transforms are applied to the XML so we can reshape the returned data, while still knowing that it was derived from the standard model. Enter XSLT. We can thus envisage a multi part environment in which the search is performed using the relational engine's search criteria, but the real data storage and returned set comes from the XML. The service request would therefore (somehow!), specify the search, and then the shaping transform required as a second parameter.
It is a bit of a kluge pattern, perhaps but it achieves some entyerprise level objectives:
  • Use existing technologies where possible. Don't go too far out on a limb with all the learning curve and operational complexity of introducing radical technology into a mature organization
  • Don't bump into weird upper bound limits (like the 16MB limit in MongoDB)
  • Don't spend too much time in a death by modeling exercise
  • Most access to the underlying data comes through service calls, so data abstraction is minimized
  • Use technology standards where possible.
  • Rebuild indexes, etc. from original data when search schema extensions are needed
  • Possibly compress the raw XML since it is only required at the last stage of the processing pipe
It also has some significant disadvantages:
  • Likely to chew up considerable cycles when executing
  • Some management complexity
  • Possible usage anarchy - teams expressing queries that overconsume resources
  • Hard to predict resource consumption
  • Maybe some of the data don't render cleanly this way
  • Must have pretty well defined XML structures
So this pattern gives us pause for thought. Do we need to go down the fancy new technology path to achieve some of our data goals? perhaps not for this kind of data case. Of course there are many other data cases where it would be desirable to have specially optimized technology. This doesn't happen to be one of them.