Thursday, December 29, 2016

Following up on Spraint

This posting isn't just a blog entry it is a magnum opus. TL;dr version is "don't send the complete state of the system object downstream if the downstream systems are trying to deliver events". It fits into EA (admittedly on the technical side of things) because it digs into what being event driven means to the enterprise.

In a previous posting I introduced the notion of having to dig through old copies of data to figure out what happened. That post didn't dwell on ways to avoid doing that, so this one will.
The question on the table is a simple one. How can a system inform other systems that an event has occurred. I work mostly in the domain of airlines, reservation and operational systems and tying systems together. So this post will draw from airline reservations for its examples.
In the major reservation systems, a "transaction" can be quite a big thing. A passenger can establish a booking, with a suitable itinerary, have it priced and booked all in a single business transaction. A "transaction" can also be quite a small thing. A passenger can add a passport number to the booking. These are clearly at wildly different levels of granularity. So what's a poor architect to do?

The brute force approach is to change the state of the reservation object (or PNR in travel industry parlance) and then ship the whole PNR off downstream to be consumed by other applications. Oh and by the way, a PNR in XML might have a signal to noise ratio of about 10% and it might be as large as 1MB. If a receiving application were to need to know what happened to cause the second message to be sent, then it could look at the first one and deduce the difference. Lots of compute energy consumed to figure out what the upstream system already knew. We will refer to this as BF

Another approach might be for the upstream system to ship the whole PNR with some data somewhere in it telling downstream what changed. Still pretty heavyweight, but at least the decoder ring is just the decoder ring for the header and doesn't require decoding of the whole PNR. We will refer to this approach as BFD

A third approach might be for an upstream system to send the whole PNR only for the first transaction, and then send only deltas (actions perhaps) for subsequent transactions. We will refer to this approach as DO

There is also an information reliability aspect to contend with. Because the systems that need to communicate can have a variety of undesirable traits (they might receive data out of sequence, data might be lost somewhere in the network, a downstream system forgot to upgrade the schema,...) we also need an approach that provides sufficient reliability.

So looking at the needs of systems from a variety of potential consumers.

A Data Warehouse that needs all of the "state" information for each transaction

If the whole architectural approach to the enterprise is based on a collection of data stores (domain oriented operational and data warehouses), then this predominant pattern is for you. But it doesn't necessarily deliver the greatest business agility.

BF Approach

Taking the BF approach, the data warehouse has pretty much what it needs. There are the sequence issues to contend with, but by and large this is the easiest approach for the warehouse. You have complete information at every stage, so it is easy enough just to store the data as it comes in.

Except of course it's a lot of data. And this kind of data storage is often the most expensive storage in the enterprise. So maybe a Change Data Capture (CDC) approach makes sense. So what has actually happened is that a producing system has sent a stateful thing to the data warehouse. The data warehouse breaks it down to see what changed and stores the changed bits. Hmm, sounds like the upstream system is carefully packaging something only for the data warehouse to unpackage it to deduce what happened. Essentially (continuing with the scatalogical metaphors) looking for the pony in there.

BFD approach

The BFD approach has the advantages of the BF approach in that the data warehouse is in some senses complete. So no real impact there.

DO approach

The DO approach is the hardest approach for the data warehouse. Since the whole transactional history is transmitted in the header, the warehouse will need to apply changes forward (i.e. it knows what's changed so CDC). A kind of reverse CDC. Potentially no worse than the forwards CDC.

Downstream systems need to deliver lightweight "something happened" events

This architectural pattern for the enterprise assumes that information can be acted upon as soon as it becomes available. It doesn't mean it has to be, but it could be. Transactional systems typically execute the business transactions (statement of the obvious, I know), but rarely have the scope to deal with the implications. The implications are left to be dealt with by other systems.

BF approach

This is the least convenient approach for systems with an event generation requirement. To figure out what happened (and thus which events to emit), the application must determine the difference between the current message and its predecessor. This an be an expensive operation. It is also inherently unreliable because:
  • The eventing system has to fully process the messages in order so that it can determine state change 
  • The messages may arrive out of sequence
  • It may not be possible to determine that there is a gap in message sequence a priori.
This has a limiting factor that the processing of messages is a sequence preserving activity. Such sequence preserving activities are, by nature, governors on throughput.

BFD approach

In the BFD approach, the downstream, event producing system can identify what changed from the delta information it was given,  At least it says what changed from the previous message. Coupled with data that identifies what the previous values were, and it becomes possible to generate events properly. Except, again, for missing messages. Quite complex logic has to be put in place to deal with gaps in sequence when they are detected.

DO approach

In the DO approach, the downstream, event producing system can determine what happened from the transaction history. It doesn't have to wade through full state to figure it out. But there is some need to make sure that full transaction history be sent with each event because you can't recreate the history if there are gaps. So this is a bit of a hybrid approach. It is a bit like a banking system, whereby you have a periodic statement, and you can see ithe individual transactions between statements.
This approach gives a degree of flexibility - allowing for a kind of duality between state and event. But it still feels unsatisfactory.

Conclusion

There really isn't a one size fits all approach to information management when you have such diverse temporal use cases. The immediate action systems need it fast and light. The historical reporting systems need it less fast, but in full, fine grained detail. So a poor architect has to think varefully about the relevant patterns, and decide which trade offs to make.




Wednesday, November 9, 2016

Events and Context, and cauliflower.

I was chatting with a friend yesterday about composability, events, and other distributed computing concepts.
He is quite a fan of doing composition at the point of consumption. Using the idea of eating family style - a group of people, a common table and piles of food. Each diner chooses what to put on their own plate and how to arrange it.

But there is a flaw which we discussed (maybe many flaws), but the one we talked about was what memories exist as well. So for example, when the food arrives, one of the vegetables is cauliflower. I generally like cauliflower (Hmm, how do I know that - where is that information stored?). However, I have had the cauliflower at this restaurant before (a couple of years ago) and it was awful. (Where is that information stored?).  The key point here is that when events (food delivered to table, say), there is both information that is directly pertinent to the event (what kind of food, when did it arrive, who brought it, what was the temperature) and information that lives in history (I don't like cauliflower here).  We need both sets of data in order for me to have a satisfactory meal.
And that is why in our systems we do have to manage and make available historical state - even when our systems are driven by events.

Wednesday, November 2, 2016

System Spraint

Spraint is a quaint English word for otter droppings. Analysis of otter populations and their dietary habits can be performed by analysis of their spraint.

I see the same kind of analysis being required in systems that send their "spraint" - often in the form of messages downstream for subsequent systems to figure out what happened.

We want events (a passenger checked in), but instead we get a giant message with all of the reservation data and somehow we have to deduce what happened.

Knowing the current state of something doesn't tell us how it got to be in that state. If we are to try to figure out what happened from the state models we have to compare a previous piece of system spraint with the current one and look for differences. That is only sensible if the agent making the change can't or won't tell you what the change was.

In this day and age when we are in a mobile, somewhat reliable, but still constrained network world, it is extremely expensive to send giant messages around the system of systems when all that one of the systems might need is a little piece of knowledge that "something happened".

Being told that something happened, vs having to deduce it makes the job of downstream systems way easier.

This line of thinking is fundamental to the architecture of the enterprise - organizing the enterprise around the business events that can happen, and then having meaningful interpretation and use of those events available instantly should be a goal.

Ask the executive out of whose budget the analysis os coming, "Would you like to be told what happened? Or would you like us to figure it out?" A business taxonomy of events becomes vital.

Would you prefer to know, "Passenger Chris Bird just boarded flight 1234" or here's are a pair of booking records that show that Chris wasn't boarded and then he was? 

Sunday, July 26, 2015

Internationalization, Currency, etc.

As we think more globally, co companies that have only previously allowed transactions in a single currency realize that they may have to support many. That generally speaking pervades everything. From the user experiences (How do we choose which currency to show our prices in initially?what language should the home screen display?) to all transactions trying to compute value throughout the corporation. Today's sales? Accounting structures? Refunds? Loyalty equivalencies, the list goes on for ever.

Just looking at the UX, it appears that it might be a good idea to use the IP address of the user to determine the language in which to display the opening screens. But that turns out to be as bad an experience as you can imagine.

I am a Brit living in the USA and I travel to Canada. When I am visiting Montreal, and I bring up an ecommerce site, do I really want to see the site en Francais, and the prices in CDN? Not only no, but hell no. That's a kind of throw the device at the wall experience. Je ne parle pas Francais.

I do realize that in the absence of any information, I have just gone to "yourcompany.com" and don't have an account. You have to make some kind of a choice.
What are you to do?

  • Show the splash screen in a local language, with front page offers in the local currency?
  • Show a generic splash screen with flags or some other language/country devices and ask me to choose? (And by implication not show any promotional pricing on the splash screen)
  • Do it in the company's own language. "We are British, Damnit" - But don't expect to transact in Quebec.
And then there is the dialect issue - especially in Spanish. In this interview Cristina Saralegui talks about the difficulty of communicating with Spanish speaking peoples in the different countries of Central America. Listen for where she is talking about the word for "beans" and realize that there is no single standard. But please also be careful, the next piece of the clip deals with where things can wrong quite offensively. If you were selling beans online, I wonder what you would say?

Coming back to the IP address, many people have ways of masking their IP addresses anyway. Is it commonplace? No, not yet. But more and more people who wish not to be targeted by advertising, social engineering and other distasteful practices are finding ways to keep their IP addresses hidden. Also handy if you are a visitor in a country with extra surveillance and wish to maintain some degree of unsurveilled access.

Why is this even in an architecture post, you might ask? It is here because the issues around currency and online presence are both horizontal (organization and customer base wide) and deep (throughout the very fabric of the enterprise). It isn't just the purview of a marketing department, but gets to the very essence of the company.


Friday, November 28, 2014

Shearing Layers - Part 3

In the previous Shearing Layers post, I was talking quite theoretically about the rate at which stores could change their layouts to suit some time period. E.g. changing seasonal merchandise, etc.
How about we get even more radical. Perhaps we could (and maybe should) rearrange the store by time of day.
The traditional grocery store (at least in the USA)  remains relatively stable in layout (produce and deli at the edges, dairy and other cold stuff at the back, frozen food in the middle, other processed items in their own neat aisles). This well arranged for the weekly shopper. However for the quick in and out shopper this is not particularly convenient.
Let's do some imagineering here:
There is some body of shoppers who would like to go to the store more frequently and buy less stuff at once.
This group comes in in the time before dinner so they can buy what they need for dinner.
This group is intimidated by having to go all around the store to find what they need
We know what things are bought at which time of day.

Perhaps if there were a way to understand these kinds of habits we could arrange the store layout so it is a bit easier for these people to find what they want for dinner that evening.

Use small temp displays that can be prepared in the mystery area at the back, and have them rolled out to the specialized area. Create an ad campaign that matches the philosophy. Give extra points/rewards or whatever for people buying the dinner time special items. Vary the items a bit by day so you don't see the same can of beans each time.

The bottom line - Think about what can be changed easily (simple shearing layer), look deeply into the data you already have (and can get) to see if there are changes that might be beneficial. Make the changes quickly and cheaply (it shouldn't be an expensive operation - if it is you are at the wrong shearing layer). Measure effectiveness. Rince and repeat.

Thursday, June 12, 2014

Events, messages and state

I have recently had the need to think about messages, events, and state. I may have finally caught up to where Nigel Green was three years ago!

These three ideas are often bundled together, but really they represent different things. And when we try to treat them homogeneously we run into difficulties.

Dealing first with a message. messages are units of transmission. They provide a mechanism for moving some bits from point A to point B. The arrival of a message is an interesting "event" in the technical domain.  Useful for capturing statis, handling billing, etc. But it is not really the business trigger for business functionality.

The events amy well be bundled up inside a message. A message may well contain information about multiple events. In the airline reservation world, w might have events that create a new reservation skeleton, add passenger(s) to that reservation skeleton, add itinerary information to that reservation skeleton, etc. In other words events that indicate lifecycle happenings to the reservation. However, during the reservation business process, the various happenings may all end up being bundled into a single message. Or they might be separate. remember the message is about transmission not functionality.

Then there is state. By state we mean the values of all of the attributes associated with an object of interest. yeah, I know, a bit vague :-(. So the state of a reservation is the state after the various events have affected the reservation. That state might be notified after just the itinerary has been built, or it might be after the itinerary + passenger + pricing. The  state is set at a relatively arbitrary point in time. That's a packaging issue.

In some cases, message size limitations will demand that multiple transmission units (messages) be required to transmit the complete state.

Bottom line EAI type patterns generally assume some kind of unification between message/state/event, but in reality they are separate concepts and should be handled differently. 




Victim of a professional with attitude - a requirements story

Again, a post that isn't so much about architecture - more about miscommunication, but this time with me as the 'user'. It illustrates how the poor 'users' feel when we behave like Scott.

First the back story. In our kitchen we have french doors leading to the back yard (garden for my non-American followers). They have double paned glass and mini-blinds between the panes. So we get insulation and privacy. The whole assembly (doors, threshold, etc.) is in a single unit.

If you think about it there are essentially 4 configurations possible.
  • Doors open outwards, left door is the main door
  • Doors open outwards, right door is the main door
  • Doors open inwards, left door is the main door
  • Doors open inwards, right door is the main door
Of course, the beginning of the issue  can be seen in the above definitions. Left/Right - from which perspective? In/Out - from which perspective.

The first mistake that I made was that I didn't know that most of the time the building codes here specify that french doors open inwards. So I described what I needed to the salesman (let's call him Scott - that is after all his real name). I explained that I wanted the left door facing outwards to open. So, he did the mental gyrations and pointed me to the one he thought I wanted. It wasn't - it did have the left door facing out as the opening door. But the whole assembly opened inwards - and is not reversible.

So I had to return it and order the proper one. Now I am not an expert in the internal naming of sides of doors, conventions in the industry etc. I have a requirement. Open outwards, left door facing outwards must open. So the developer - oops, Scott again translates my requirements to the specification (the order) and asks me to sign off. I naively assumed that the requirements would have been correctly translated to design - silly me. What did I get? Outward opening, right door. And since it was a "special order", I would have to pay again to get the correct one.

There is presumably some assumption about how doors are specified. Is the specification left/right as determined by the direction of opening? Is there some other way? I don't know. That is part of  the technical world of doors, not part of my desire for use.

With all that rigmarole, I came to a few conclusions for us as practitioners:
  • Our users don't know our vocabulary
  • Making users sign off on specifications when they don't know the vocabulary is costly
  • When we then make the users pay twice because of miscommunication we are failing the people that we should be delighting. 
All in all a bit humbling for me - when I assume that the user actually understands my jargon/terminology I am usually wrong. Note to self, when providing a service, don't jump to conclusions in your translation from the world that the user inhabits to the world you inhabit.

Saturday, February 15, 2014

Narcissistic Applications and Architecture

This post comes out of some work that I am doing with a client. Getting to the essence of event processing and what needs to be in place.
As many have observed, the metadata is often as interesting to the enterprise as the actual data. The trouble is that the enterprise doesn't necessarily know ahead of time what may or may not be interesting. perhaps applications that manage the state of domain objects should tell the world when they have changed the state of a domain object that they manage.
It is only when applications start bragging about what they have done that the enterprise has the ability to draw conclusions that range across the domains.
So while the current state of an interesting domain object may well be locked up in a transactional database somewhere, that the state change occurred could (and should) be made available to any/all interested parties.
Let's think in terms of an intelligent (but fictitious) home environment that we will call the IHE.
Our daily activities in the house include:
  • Using hot water 
  • Turning lights on and off
  • Accessing computers
  • Watching TV
  • Sleeping
  • Opening/closing the refrigerator
  • Cooking
  • Eating
  • Managing the trash
  • Managing the recycling
  • Filling the dishwasher
  • Dressing
  • Doing (or having done) the laundry 
  • Opening/Closing exterior doors
  • ...
I have this feeling that my home bills are too high, so it might be interesting to see if any of my activities are inefficient (I leave lights on sometimes - so we have an event, followed by a negative event), if some of my activities can be co-related. Perhaps a change in one suggests an opposite direction change in another.

Now if, hypothetically, all my activities resulted in events being notified and somehow analyzed, then perhaps (and this is a big perhaps) I have the opportunity to look at my patterns and make some changes that result in savings in time, energy or general annoyance.

Of course we do the obvious ones. When we sleep, running the dishwasher is a no-brainer. But what about multiple uses of the oven? What about leaving lights on? What about leaving doors/windows open correlated with when the heating/cooling are running.

The point is that all of these state changes describe the minutiae of my life and I don't have the time, not the energy to capture them. That detail should be captured at the time it happens - if I am truly interested. It shouldn't wait until after the event when my recollection is hopelessly flawed.

Johnny Cash on Technical Architecure

Yes, that Johnny Cash aka the man in black. He of the deep voice, great songs, San Quentin concert...
I was having a cup of coffee at a local Starbucks yesterday when a friend showed me an "architectural diagram" of the technology components that a customer of his had shown him. Very proudly, all based on open source (because they don't want to pay license fees) they unveiled this masterpiece that they had taken several years to build.

Immediately I was reminded of this terrific song....

http://youtu.be/5GhnV-6lqH8

We architects do need to work on ensuring a few things:
  • Don't overdo the technology
  • Open source may be the way to go, but joining disparate things up can get expensive fast
  • Ensuring that the pieces can connect (bolts and bolt holes anyone 2:05?)

Tuesday, August 27, 2013

Shearing Layers - Part 2, Stuff

In the previous post I introduced the notion of shearing layers - taken from Stewart Brand's book "How Buildings Learn". In this one I am going to look at how having better data can possibly affect the "Stuff" layer.
For example, shopping habits on the web site can show buying patterns and trends that could translate to the brick and mortar store. Looking at what people search for together online could give a clue to what they are looking for when they get to a store.
Note that the could in the previous observation is very much an imponderable. While the shearing "Stuff" on the web site is really easy (facilitating A/B testing, for example), it is still tricky in the brick and mortar store.
Reorganizing store shelves/layout runs the risk of confusing staff and customers. Things aren't where they were yesterday. Our ingrained habits and expectations no longer work for us. So the risk is definitely there, but there could be some interesting small experiments.
Perhaps it is worth grouping trousers by style/size and not by color. Perhaps it is worth grouping shoes by size, mixing up the brands. Of course that one is very tricky because we buy shoes with our eyes, so we may need to see a floor sample which will be of a single size.
The desires of the store, the desires of the brands and the desires of the customer may well come into opposition.
The online shopping experience can give us a rate of change greater than that in the physical store - delivering data to the store planners and merchandisers that can influence product placement - and the ultimate goal of selling more "Stuff" to the customers

Sunday, August 25, 2013

Shearing Layers - Part 1 physical buildings

In Stewart Brand's terrific work, "How Buildings Learn", there are some great analogies to what we do in Enterprise Architecture. He expanded on the concept of "shearing layers" introduced by Robert V. O'Neill in his "A hierarchical concept of ecosystems". The primary notion being that we can hierarchically understand our ecosystems better by understanding the different rates of change possible at the different layers.
 
 
The diagram above is reproduced from How Buildings Learn, and represents the parts of a building which change at different rates. It is arranged from outside in with, in this representation, no absolute correlation between the parts and rate of change. 
Using Brand's own explanation, the layers have the following descriptions:

Site

The site is the geographical setting, the urban location and the legally defined lot whose boundaries and context outlast generations of ephemeral buildings

Structure

The foundation and load bearing elements are perilous and expensive to change, so people don't. These are the building.

Skin

External surfaces can change more frequently than the structure of the building. Changes in fashion, energy cost, safety, etc. cause changes to be made to the skin. However tradition and preservation orders often inhibit changes to the skin since the skin is very much the aesthetic.

Services

These are the working guts of the building, communications/wiring, plumbing, air handling, people moving. Ineffective services can cause buildings to be demolished early, even if the surrounding structure is still sound.

Space Plan

The space plan represents the interior layout - the placement of walls, ceilings, doors, etc. As buildings are re-purposed, as fashion changes so can the interior quickly be reconfigured.

Stuff

The appurtenances that make the space useful for its intended purpose. Placement of tables, chairs, walls cubicles, etc.
 
In further articles, I will develop this theme in 2 directions. First in thinking about how data can affect the way that retail organizations can think about their layout and organization (shearing at the Stuff/Space Plan layers of both brick and mortar and web stores. Second in looking at Enterprise Architecture through the lens of shearing layers - by analogy with Brand's writing and thinking.

Saturday, June 8, 2013

Trying to understand IaaS, and other nonsense

There's been something upsetting me about the whole notion of Infrastructure as a Service. It has taken me a while to put my finger on it, but here goes. But first, an analogy with electricity usage and provisioning.
When I flick on the light switch, I am consuming electricty. It doesn't matter to me at the moment of consumption where it is coming from as long as:
  1. It is there when I want it
  2. It comes on instantly
  3. It delivers enough of it to power the bulb
  4. It doesn't cause a breaker to trip
  5. It doesn't cause the bulb to explode
  6. ...
So that's the consumption model. That's independent of the provisioning model - at least as long as those requirements are met.
I could satisfy that need through several mechanisms:
  1. I could have it delivered to my house from a central distribution facility
  2. I could make it myself
  3. I could steal it from a neighbour
  4. ....
Regardless of which provisioning methods I use, I am still consuming the electricty. The lightbulb doesn't care. However the CFO of the birdhouse does care. Thinking about the service of electricity - it's about how I procure it and pay for it, not how I consume it. Sure I can add elasticity of demand, it's summer and I am running the air conditioners throughout the house and both oven are on and every light.....But that is a requirement on how I procure the service not on how the devices use it.

Similarly in the software defined infrastructure world, the application that is running doesn't really care how the infrastructure it is running on was provisioned. The "as a service" part of IaaS is about the procurement of the environment on which the application runs.

The procurement model can, of course affect the physical environment of the equipment. Just as delivering electricity to my house requires cable, effects on the landscape, metering, centralized capacity management we have to have those kinds of capabilities in our IaaS procurement worlds. No argument there, but at the end of the day it's about how the capability is delivered and paid for, not in how it operates that really matters.

Monday, March 11, 2013

Rate of Change

I have been trying to help operational IT groups understand how important rate of change of resource consumption is. Finally I cam up with an analogy that helps. In the airline industry, it isn't necessarily a bad thing if an aircraft is on the ground. It may not be returning anything to the business, but it isn't necessarily awful. However the RATE at which it got to the ground is very important. Impact at 300kts is unlikely to be what anyone had in mind. Graceful "impact" at 150 kts may be perfectly OK.
So while the lower rate of change is no guarantee that things are good, the higher rate of change means that immediate action will need to be taken.
Likewise in systems, if a disk gets to use 80% of its capacity gradually, there is probably no need to panic. A careful plan will allow operations to ensure that disaster doesn't strike. If it spikes suddenly then there is a definite need to do something quickly before the system locks up or starts losing transactions.
Knowing that there will be an issue is even more important than recovering from an issue that has already happened.

Tuesday, February 12, 2013

Operational Data Stores and canonical models

One way of thinking about an operational data store is to be a near real time repository of/for transactional data organized to support an essentially query workload against transactions within their operational life. This is particularly handy when you need a perspective on "objects" that have a longer life than their individual transactions might allow. Examples might include supply chain apps at a large scale, airline reservations - business objects that may well have transactions against them stretching over time.
In both cases, the main "object" (large, business grained) has a life span that could be long - surviving system resets, versions of the underlying systems, etc.
Considering the case of an airline reservation, it can have a lifespan of a couple of years - 330 days prior to the last flight the resevation can be "opened", and (especially in the cas eof refund processing) it might last up to a year or so beyond that. At least, give or take.
The pure transactional systems (reservations, check in, etc.) are most concerned (in normal operations) with the current transactional view. However there are several processes that care about the historical view while the reservation is still active. There are other processes that deal with and care about history of completed flights, going back years. Taxes, lawsuits, and other requests that can be satisfied from a relatively predictable data warehouse view.
It's the near term stuff that is tricky. We want to gain fast access to the data, the requests might be a bit unpredicatble, the transactional systems may have a stream of XML data available when changes happen, ...
So how do we manage that near real time or close in store? How do we manage some kind of standard model without going through a massive data modeling exercise? How do we get value with limited cost? How do we deal with unknown specific requirements (recognizing the need for some overarching patterns).
Several technologies are springing up in the NoSQL world (MongoDB, hybrid SQL/XML engines, Couchbase, Cassandra, DataStax, Hadoop/Cloudera) which might fit the bill. But are these really ready for prime time and sufficiently performant?
We are also not dealing with very big data in these cases, or they data might become big as we scale out. It is kind of intermediate sized data. For example in a reservation system for an airline serving 50 million passengers/year (a medium sized airline), the data size of such a store is only of the order of 5TB. It is not like the system is "creating" tens of MB/second as one might see in the log files of a large ecommerce site.
If we intend to use the original XML structures as the "canonical" structure - i.e. the structure that will be passed around and shared by consuming apps, then we need a series of transforms to be able to present the data in ways that are convenient for consuming applications.
However, arbitrary search isn't very convenient or efficient against complex XML structures. Relational databases (especially at this fairly modest scale) are very good at searching, but rather slow at joining things up from multiple tables. So we have a bit of a conundrum.
One way might be to use the RDB capabilities to construct the search capabilities that we need, and then retrieve the raw XMLfor those XML documents that match. In other words a hybrid approach. That way we don't have to worry too much about searching the XML itself. We do have to worry, however, about ensuring that the right transforms are applied to the XML so we can reshape the returned data, while still knowing that it was derived from the standard model. Enter XSLT. We can thus envisage a multi part environment in which the search is performed using the relational engine's search criteria, but the real data storage and returned set comes from the XML. The service request would therefore (somehow!), specify the search, and then the shaping transform required as a second parameter.
It is a bit of a kluge pattern, perhaps but it achieves some entyerprise level objectives:
  • Use existing technologies where possible. Don't go too far out on a limb with all the learning curve and operational complexity of introducing radical technology into a mature organization
  • Don't bump into weird upper bound limits (like the 16MB limit in MongoDB)
  • Don't spend too much time in a death by modeling exercise
  • Most access to the underlying data comes through service calls, so data abstraction is minimized
  • Use technology standards where possible.
  • Rebuild indexes, etc. from original data when search schema extensions are needed
  • Possibly compress the raw XML since it is only required at the last stage of the processing pipe
It also has some significant disadvantages:
  • Likely to chew up considerable cycles when executing
  • Some management complexity
  • Possible usage anarchy - teams expressing queries that overconsume resources
  • Hard to predict resource consumption
  • Maybe some of the data don't render cleanly this way
  • Must have pretty well defined XML structures
So this pattern gives us pause for thought. Do we need to go down the fancy new technology path to achieve some of our data goals? perhaps not for this kind of data case. Of course there are many other data cases where it would be desirable to have specially optimized technology. This doesn't happen to be one of them.

Friday, November 23, 2012

Cash to Delivery

This post looks at a specific high level process drawn from the "eating out" industry. It's origins are from that fine barbecue establishment in Dallas - the original Sonny Bryan's. The time 1985. The conversation took place in the shack that was as crowded as ever at lunch time. The method was, place order, pay for order, hang around and wait for order, pick up order, attempt to squueze oversized bottom into school chairs, devour product. While casually waiting for the order to be prepared, I idly asked my colleague, "how do they match the orders with the people?". It seemed as if the orders always came out in sequence, so, being a systems person, I got to wondering about the nature of the process.

I had clearly paid in a single "transaction". Sonny Bryan's had my money. In return I had a token (numbered receipt) that stated my order content as evidence of what I had paid for. However that transaction was not synchronous with the delivery of the food, nor was the line held up while the food was delivered. Had it been, the place would have emptied rapidly because the allotted time for lunch would have expired.

I, as the customer, think that the transaction is done when I have wiped my mouth for the last time, vacated my seat and thrown away the disposable plates, etc. But the process doesn't work like that.
There are intermediate "transactions". The I paid and got a receipt transaction (claim check, perhaps?). The I claimed my order transaction. The I hung around looking for somewhere to sit transaction. The I threw away the disposables transaction.

Each of these transactions can fail, of course. I can place my order and then discover I can't pay for it. No big deal (from a system perspective, but quite embarrasing from a personal perspective). I could be waiting for my order, and get called away, so my food is left languishing on the counter. Sonny Bryans could have made my order up incorrectly. I could pick up the wrong order. I could have picked up the order and discovered no place to sit. Finally I could look for a trash bin, and discover that there isn't one available (full or non-existent).

I defintely want to view these as related transactions, not one single overarching transaction (in the system's sense). In reality what I have is a series of largely synchronous activities, buffered by asynchronous behavior between them.

Designing complete systems with a mixture of synchronous and asynchronous activities is a very tricky business indeed. It isn't the "happy path" that is hard, it is the effect of failure at various stages in an asymchronous world that makes it so tough.

Wednesday, November 21, 2012

A data hoarder's delight

Withe big data, I sometimes feel like the character Davies in the masterful spy novel, "The Riddle of the Sands" by Erskine Childers. "We laughed uncomfortably, and Davies compassed a wonderful German phrase to the effect that 'it might come in useful'"

Some of the "big data" approaches that I have seen are a bit like that. We keep stuff because we can, and because it "might come in useful." For sure there are some very potent use cases. Forbes, in this piece describes how knowledge about the customercan drive predictive analytics. And valuable it is.

However the other compelling use-cases are a bit harder to find. We can certainly do useful analysis of log files, click through rates, etc., depending on what has been shown to a customer or "tire kicker." But beyond that the cases are harder to come by. That is to some extent why much of the focus has been on the technology and technology vendor side. 

There is a pretty significant dilemma here, though. If we wait to capture the data until we know what we need, then we will have to wait until we have sufficient data. If we capture all the data we can as it is flying through our systems and we don't yet know how we might use it, we need to make sure it is kept in some original form. Apply schema at the time of use. That makes us quake in our collective boots if we are data base designers, DBAs etc. We can't ensure the integrity. Things change. Where are the constraints?... In my house that is akin to me throwing things I don't  know what to do with (pieces of mail, camping gear, old garden tools, empty paint cans, bicycle pumps,... you get the idea). So when Madame asks for something - say a suitable weight for holding the grill cover down, I can say, "Aha, I have just the right thing. Here, uses this old, half full can of paint." A properly ordered household might have had the paint arranged in a tidy grouping. But actually that primary classification inhibits out of the box use.

Similarly with data. I wonder what the telephone country code for the UK is.. Oh yes, my sister lives there, I can look up her number and find it out. Not exactly why I threw her number onto the data pile, but handy nonetheless.

So with the driver of cheap storage, cheap processing, we can sudden;y start to manage big piles of data instead of managing things all neatly.

This thinking model started a while back with the advent of tagging models for email. Outlook vs Gmail as thinking models. If I have organized my emails in the Outlook folders way, then I have to know roughly where to look for the mail - all very well if I am accessing by some primary classification path, but not so handy when asked to provide all the documents containing a word for a legal case...It turns out - at least for me that I prefer a tag based model - a flat space like Gmail where I use search as my primary approach, as opposed to a categorization model where I go to an organized set of things.

There isn't much "big" in these data examples. It is really about the new ways we have to organize and manage the data we collect - and accepting that we can collect more of it. Possibly even collecting every state of every data object, every transaction that changed that state, etc. Oh and perhaps the update dominant model of data management that we see today will be replaced with something less destructive.

Thursday, October 25, 2012

Data ambiguity and tolerance of errors

In the current election "season" in the USA there has been much ado about ensuring that only registered voters are allowed to vote. The Republican Party describes this as ensuring that any attempts at fraud are squelched. The Democratic Party describes this as being an attempt to reduce the likelihood that certain groups (largely Democrat voting) will vote. I certainly don't know which view is correct, and that is not the purpose of this post, but it does inform the thinking.

The "perfect" electoral system would ensure that everyone who has the right to vote can indeed do so, do so only once, and that no one who is not entitled to vote does not. Simple, eh? Not so much! Let me itemize some of the complexities that lead to data ambiguity.
  • Registration to vote has to be completed ahead of time (in many places).
  • The placement of a candidate on the ballot has to be done ahead of time, but write-in candidates are permissable under some circumstances..
  • Voters may vote ahead of time.
  • Voters vote in the precinct to which they are assigned (at least in some places)
  • Voters may mail in their votes (absentee ballots)
Again they don't appear insurmountable except that the time element causes some issues. Here are some to think about:
  • What if a person votes ahead of time, and then becomes "ineligible" prior to voting day. Possible causes include death, conviction of a felony, certifiably insane.
  • What if a person moves after registration, but before they vote?
  • What if a candidate becomes unfit after the ballots are printed and before early voting? (death, conviction of a felony, determination of status - eg not a natural born citizen
  • What if a candidate becomes unfit after early votes for that candidate have been cast?
  • .....
These are obviously just a few of the issues that might arise, but enough to give pause in thinking about the process. If we really want 100% accuracy we have a significant problem because we can't undo the history. Now if a voter has become ineligible after casting the vote (early voting or absentee ballot or before the closure of the polls if voting on election day), then how could the system determine that? It would be possible, to cross reference people who have voted with the death rolls (except of course if someone voted early so they could take their trip to look at the Angel Falls where they were killed by local tribespeople and no one knew until after the election).

On a more serious note, voting systems deliver inherently ambiguous results. Fortunately that ambiguity is tiny, but in ever closer elections, it gives those of us who think about systems somethings that are very hard to think about. That is, "How do we ensure the integrity of the total process?" and "How good is good enough?"

Actually that thinking should always apply. While we focus on the happy paths (the majority case), we should always be thinking about what the tolerance for error should be. It is, of course politcal suicide to say that there is error in the voting system, but rest assured - even without malice, there is plenty of opportunity for errors to creep in.

Tuesday, September 4, 2012

Intension vs Extension

Sometimes I feel really split brained! On the one hand I am thinking about the importance of controlling data, data quality, data schema, etc. On the other hand, I realize I can't! So the DBA in me would like the data to be all orderly and controlled - an intensional view of the data. What the model looks like as defined by the kinds of things.
But then I look outside the confines of a system and realize that, at least this human, tends to work extensionally. I look at the pile of data and  create some kind of reality around it. Probably making many leaps of faith, many erroneous deductions, probably drawing erroneous conclusions, positint theories and adding to my own knowledgebase.
So a simple fact (you are unable to meet me for a meeting) + the increase in your linkedin activity, + a tripit notification that you have flown to SJC will at least give me pause for thought. Perhaps you are job hunting! I don't know, but I might posit that thought in my head and then look for things to confirm or deny it (including phoning you to ask). How do I put that into a schema? How do I decide that is relevant?

I don't. In fact I may never have had the explicit job-hunt "object" or at least never had explicit properties for it, but somehow this coming together of data has led me to think about it.

The point here is, of course, that if we attempt to model everything about our data intensionally we are doomed. We will be modeling for ever. If we don't model the right things intensionally, we are equally doomed.

This is the fundamental  dichotomy pervading the SQL/NoSQL movement today. We want to have the control that intensional approaches give us so that we can be accurate and consistent - especially with our transactional data, but we also want the ability to  make new discoveries based on the data that we find.

We can't just have a common set of semantics and have everyone expect to agree. In Women, Fire and Dangerous Things, George Lakoff describes some categories that are universal across the human race. Those are to some extent intensional. Then there are all the others that we make up and define newly, refine membership rules, etc. and those are largely extensional.

Friday, June 8, 2012

In stream and out of band

Big data seems to be popping up everywhere. The focus seems to be on the data and the engines and all the shiny toys for doing the analysis. However the tricky part is often getting hold of the slippery stuff in the first place.
In the cryptography world, one of the most useful clues that something big is about to "go down" is traffic analysis. Spikes in traffic activity provide signals to the monitoring systems that further analysis is required. There is useful information in changes in rate of signals over and above the information that may be contained in the message itself.
Deducing information just from the traffic analysis is an imprecise art, but knowing about changes in volume and frequency can help analysts decide whether they should attempt to decrypt the actual messages.
In our systems, this kind of Signal Intelligence is itself useful too. We see it in A/B testing. We see it in prediction about volume for capacity planning. In other words we are losing a valuable source of data about how the business and the technology environments are working if we ignore the traffic data.
Much of "big data" is predicated on getting hands (well machines) on this rich vein of data and performing some detailed analysis.
However there are some challenges:
  • Getting access to it
  • Analyzing it quickly enough, but without impacting its primary purpose.
  • Making sense of it - often looking for quite weak signals
That's where the notion of in-stream and out of band comes from. You want to grab the information as it is flying by (on what? you may ask), and yet not disturb its throughput rate or at least not much. The analysis might be quite detailed and time consuming. But the transaction must be allowed to continue normally.
In SOA environments (especially those where web services are used), all of the necessary information is in the message body so intercepts are straightforward. 
Where there is file transfer (eg using S/FTP) the situation is trickier because there are often no good intercept points.
Continuing the cryptography example, traffic intercepts allow for the capturing of the messages. These messages flow through apparently undisturbed. But having been captured, the frequency/volume is immediately apparent. However the analysis of content may take some while. The frequency/volume data are "in stream" the actual analysis is "out of band".

Thursday, June 7, 2012

CAP Theorem, partitions, ambiguity, data trust


This posting was written in response to Eric Brewer's excellent piece entitled

CAP Twelve Years Later: How the "Rules" Have Changed

I have copied the statement of the theorem here to provide some context:

The CAP theorem states that any networked shared-data system can have at most two of three desirable properties:
  • consistency (C) equivalent to having a single up-to-date copy of the data;
  • high availability (A) of that data (for updates); and
  • tolerance to network partitions (P).
The original article is an excellent read. Eric makes his points with crystal clarity.

Eric,
I have found the CAP theorem and this piece to be very helpful when thinking about tradeoffs in database design - especially of course in distributed systems. It is rather unsettling to trade consistency for anything, but we have of course been doing that for years.

I am interested in your thinking about the topic more broadly - where we don't have partitions that are essentially of the same schema, but cases where we have the "same data" but because of a variety of constraints, we don't necessarily see the same value for it at a moment in time.
An example here. One that we see every day and are quite happy with. That of managing meetings.
Imagine that you and I are trying to meet. We send each other asynchronous messages suggesting times - with neither of us having insight into each other's calendar. Eventually we agree to meet next Wednesday at 11am at a coffee shop. Now there is a shared datum - the meeting. However there are 2 partitions of that datum (at least). Mine and yours. I can tell my system to cancel the meeting. So my knowledge of the state are "canceled", but you don't know that yet. So we definitely don't have atomicity in this case. We also don't have consistency at any arbitrary point in time. If I am ill-mannered enough not to tell you that I don't intend to show, the eventually consistent state is that the meeting never took place - even if you went at the appointed hour.

I would argue that almost all the data we deal with is in some sense ambiguous. There is some probabilty function (usually implicit) that informs one partition about the reliability of the datum. So, if for example I have the reputation for standing you up, you might attach a low likelihood of accuracy to the meeting datum. That low-probability would then offer you the opportunity to check the state of the datum more frequently. So perhaps there is a trust continuum in the data from a high likelihood of it being wrong to a high likelihood of it being right. As we look at shades of probabilty we can make appropriate risk management decisions.

I realize of course that this is broader than the area that you were exploring initially with CAP, but as we see more on the fly analytics, decision making, etc. we will discover the need for some semantics around data synchronization risk. It's not that these issues are new - they assuredly are not. But we have often treated them implicitly, building rules of thumb into our systems, but that approach doesn't really scale.

I would be interested to hear your thoughts.
PS I have cross posted this against the original article as well.