Friday, November 23, 2012

Cash to Delivery

This post looks at a specific high level process drawn from the "eating out" industry. It's origins are from that fine barbecue establishment in Dallas - the original Sonny Bryan's. The time 1985. The conversation took place in the shack that was as crowded as ever at lunch time. The method was, place order, pay for order, hang around and wait for order, pick up order, attempt to squueze oversized bottom into school chairs, devour product. While casually waiting for the order to be prepared, I idly asked my colleague, "how do they match the orders with the people?". It seemed as if the orders always came out in sequence, so, being a systems person, I got to wondering about the nature of the process.

I had clearly paid in a single "transaction". Sonny Bryan's had my money. In return I had a token (numbered receipt) that stated my order content as evidence of what I had paid for. However that transaction was not synchronous with the delivery of the food, nor was the line held up while the food was delivered. Had it been, the place would have emptied rapidly because the allotted time for lunch would have expired.

I, as the customer, think that the transaction is done when I have wiped my mouth for the last time, vacated my seat and thrown away the disposable plates, etc. But the process doesn't work like that.
There are intermediate "transactions". The I paid and got a receipt transaction (claim check, perhaps?). The I claimed my order transaction. The I hung around looking for somewhere to sit transaction. The I threw away the disposables transaction.

Each of these transactions can fail, of course. I can place my order and then discover I can't pay for it. No big deal (from a system perspective, but quite embarrasing from a personal perspective). I could be waiting for my order, and get called away, so my food is left languishing on the counter. Sonny Bryans could have made my order up incorrectly. I could pick up the wrong order. I could have picked up the order and discovered no place to sit. Finally I could look for a trash bin, and discover that there isn't one available (full or non-existent).

I defintely want to view these as related transactions, not one single overarching transaction (in the system's sense). In reality what I have is a series of largely synchronous activities, buffered by asynchronous behavior between them.

Designing complete systems with a mixture of synchronous and asynchronous activities is a very tricky business indeed. It isn't the "happy path" that is hard, it is the effect of failure at various stages in an asymchronous world that makes it so tough.

Wednesday, November 21, 2012

A data hoarder's delight

Withe big data, I sometimes feel like the character Davies in the masterful spy novel, "The Riddle of the Sands" by Erskine Childers. "We laughed uncomfortably, and Davies compassed a wonderful German phrase to the effect that 'it might come in useful'"

Some of the "big data" approaches that I have seen are a bit like that. We keep stuff because we can, and because it "might come in useful." For sure there are some very potent use cases. Forbes, in this piece describes how knowledge about the customercan drive predictive analytics. And valuable it is.

However the other compelling use-cases are a bit harder to find. We can certainly do useful analysis of log files, click through rates, etc., depending on what has been shown to a customer or "tire kicker." But beyond that the cases are harder to come by. That is to some extent why much of the focus has been on the technology and technology vendor side. 

There is a pretty significant dilemma here, though. If we wait to capture the data until we know what we need, then we will have to wait until we have sufficient data. If we capture all the data we can as it is flying through our systems and we don't yet know how we might use it, we need to make sure it is kept in some original form. Apply schema at the time of use. That makes us quake in our collective boots if we are data base designers, DBAs etc. We can't ensure the integrity. Things change. Where are the constraints?... In my house that is akin to me throwing things I don't  know what to do with (pieces of mail, camping gear, old garden tools, empty paint cans, bicycle pumps,... you get the idea). So when Madame asks for something - say a suitable weight for holding the grill cover down, I can say, "Aha, I have just the right thing. Here, uses this old, half full can of paint." A properly ordered household might have had the paint arranged in a tidy grouping. But actually that primary classification inhibits out of the box use.

Similarly with data. I wonder what the telephone country code for the UK is.. Oh yes, my sister lives there, I can look up her number and find it out. Not exactly why I threw her number onto the data pile, but handy nonetheless.

So with the driver of cheap storage, cheap processing, we can sudden;y start to manage big piles of data instead of managing things all neatly.

This thinking model started a while back with the advent of tagging models for email. Outlook vs Gmail as thinking models. If I have organized my emails in the Outlook folders way, then I have to know roughly where to look for the mail - all very well if I am accessing by some primary classification path, but not so handy when asked to provide all the documents containing a word for a legal case...It turns out - at least for me that I prefer a tag based model - a flat space like Gmail where I use search as my primary approach, as opposed to a categorization model where I go to an organized set of things.

There isn't much "big" in these data examples. It is really about the new ways we have to organize and manage the data we collect - and accepting that we can collect more of it. Possibly even collecting every state of every data object, every transaction that changed that state, etc. Oh and perhaps the update dominant model of data management that we see today will be replaced with something less destructive.

Thursday, October 25, 2012

Data ambiguity and tolerance of errors

In the current election "season" in the USA there has been much ado about ensuring that only registered voters are allowed to vote. The Republican Party describes this as ensuring that any attempts at fraud are squelched. The Democratic Party describes this as being an attempt to reduce the likelihood that certain groups (largely Democrat voting) will vote. I certainly don't know which view is correct, and that is not the purpose of this post, but it does inform the thinking.

The "perfect" electoral system would ensure that everyone who has the right to vote can indeed do so, do so only once, and that no one who is not entitled to vote does not. Simple, eh? Not so much! Let me itemize some of the complexities that lead to data ambiguity.
  • Registration to vote has to be completed ahead of time (in many places).
  • The placement of a candidate on the ballot has to be done ahead of time, but write-in candidates are permissable under some circumstances..
  • Voters may vote ahead of time.
  • Voters vote in the precinct to which they are assigned (at least in some places)
  • Voters may mail in their votes (absentee ballots)
Again they don't appear insurmountable except that the time element causes some issues. Here are some to think about:
  • What if a person votes ahead of time, and then becomes "ineligible" prior to voting day. Possible causes include death, conviction of a felony, certifiably insane.
  • What if a person moves after registration, but before they vote?
  • What if a candidate becomes unfit after the ballots are printed and before early voting? (death, conviction of a felony, determination of status - eg not a natural born citizen
  • What if a candidate becomes unfit after early votes for that candidate have been cast?
  • .....
These are obviously just a few of the issues that might arise, but enough to give pause in thinking about the process. If we really want 100% accuracy we have a significant problem because we can't undo the history. Now if a voter has become ineligible after casting the vote (early voting or absentee ballot or before the closure of the polls if voting on election day), then how could the system determine that? It would be possible, to cross reference people who have voted with the death rolls (except of course if someone voted early so they could take their trip to look at the Angel Falls where they were killed by local tribespeople and no one knew until after the election).

On a more serious note, voting systems deliver inherently ambiguous results. Fortunately that ambiguity is tiny, but in ever closer elections, it gives those of us who think about systems somethings that are very hard to think about. That is, "How do we ensure the integrity of the total process?" and "How good is good enough?"

Actually that thinking should always apply. While we focus on the happy paths (the majority case), we should always be thinking about what the tolerance for error should be. It is, of course politcal suicide to say that there is error in the voting system, but rest assured - even without malice, there is plenty of opportunity for errors to creep in.

Tuesday, September 4, 2012

Intension vs Extension

Sometimes I feel really split brained! On the one hand I am thinking about the importance of controlling data, data quality, data schema, etc. On the other hand, I realize I can't! So the DBA in me would like the data to be all orderly and controlled - an intensional view of the data. What the model looks like as defined by the kinds of things.
But then I look outside the confines of a system and realize that, at least this human, tends to work extensionally. I look at the pile of data and  create some kind of reality around it. Probably making many leaps of faith, many erroneous deductions, probably drawing erroneous conclusions, positint theories and adding to my own knowledgebase.
So a simple fact (you are unable to meet me for a meeting) + the increase in your linkedin activity, + a tripit notification that you have flown to SJC will at least give me pause for thought. Perhaps you are job hunting! I don't know, but I might posit that thought in my head and then look for things to confirm or deny it (including phoning you to ask). How do I put that into a schema? How do I decide that is relevant?

I don't. In fact I may never have had the explicit job-hunt "object" or at least never had explicit properties for it, but somehow this coming together of data has led me to think about it.

The point here is, of course, that if we attempt to model everything about our data intensionally we are doomed. We will be modeling for ever. If we don't model the right things intensionally, we are equally doomed.

This is the fundamental  dichotomy pervading the SQL/NoSQL movement today. We want to have the control that intensional approaches give us so that we can be accurate and consistent - especially with our transactional data, but we also want the ability to  make new discoveries based on the data that we find.

We can't just have a common set of semantics and have everyone expect to agree. In Women, Fire and Dangerous Things, George Lakoff describes some categories that are universal across the human race. Those are to some extent intensional. Then there are all the others that we make up and define newly, refine membership rules, etc. and those are largely extensional.

Friday, June 8, 2012

In stream and out of band

Big data seems to be popping up everywhere. The focus seems to be on the data and the engines and all the shiny toys for doing the analysis. However the tricky part is often getting hold of the slippery stuff in the first place.
In the cryptography world, one of the most useful clues that something big is about to "go down" is traffic analysis. Spikes in traffic activity provide signals to the monitoring systems that further analysis is required. There is useful information in changes in rate of signals over and above the information that may be contained in the message itself.
Deducing information just from the traffic analysis is an imprecise art, but knowing about changes in volume and frequency can help analysts decide whether they should attempt to decrypt the actual messages.
In our systems, this kind of Signal Intelligence is itself useful too. We see it in A/B testing. We see it in prediction about volume for capacity planning. In other words we are losing a valuable source of data about how the business and the technology environments are working if we ignore the traffic data.
Much of "big data" is predicated on getting hands (well machines) on this rich vein of data and performing some detailed analysis.
However there are some challenges:
  • Getting access to it
  • Analyzing it quickly enough, but without impacting its primary purpose.
  • Making sense of it - often looking for quite weak signals
That's where the notion of in-stream and out of band comes from. You want to grab the information as it is flying by (on what? you may ask), and yet not disturb its throughput rate or at least not much. The analysis might be quite detailed and time consuming. But the transaction must be allowed to continue normally.
In SOA environments (especially those where web services are used), all of the necessary information is in the message body so intercepts are straightforward. 
Where there is file transfer (eg using S/FTP) the situation is trickier because there are often no good intercept points.
Continuing the cryptography example, traffic intercepts allow for the capturing of the messages. These messages flow through apparently undisturbed. But having been captured, the frequency/volume is immediately apparent. However the analysis of content may take some while. The frequency/volume data are "in stream" the actual analysis is "out of band".

Thursday, June 7, 2012

CAP Theorem, partitions, ambiguity, data trust


This posting was written in response to Eric Brewer's excellent piece entitled

CAP Twelve Years Later: How the "Rules" Have Changed

I have copied the statement of the theorem here to provide some context:

The CAP theorem states that any networked shared-data system can have at most two of three desirable properties:
  • consistency (C) equivalent to having a single up-to-date copy of the data;
  • high availability (A) of that data (for updates); and
  • tolerance to network partitions (P).
The original article is an excellent read. Eric makes his points with crystal clarity.

Eric,
I have found the CAP theorem and this piece to be very helpful when thinking about tradeoffs in database design - especially of course in distributed systems. It is rather unsettling to trade consistency for anything, but we have of course been doing that for years.

I am interested in your thinking about the topic more broadly - where we don't have partitions that are essentially of the same schema, but cases where we have the "same data" but because of a variety of constraints, we don't necessarily see the same value for it at a moment in time.
An example here. One that we see every day and are quite happy with. That of managing meetings.
Imagine that you and I are trying to meet. We send each other asynchronous messages suggesting times - with neither of us having insight into each other's calendar. Eventually we agree to meet next Wednesday at 11am at a coffee shop. Now there is a shared datum - the meeting. However there are 2 partitions of that datum (at least). Mine and yours. I can tell my system to cancel the meeting. So my knowledge of the state are "canceled", but you don't know that yet. So we definitely don't have atomicity in this case. We also don't have consistency at any arbitrary point in time. If I am ill-mannered enough not to tell you that I don't intend to show, the eventually consistent state is that the meeting never took place - even if you went at the appointed hour.

I would argue that almost all the data we deal with is in some sense ambiguous. There is some probabilty function (usually implicit) that informs one partition about the reliability of the datum. So, if for example I have the reputation for standing you up, you might attach a low likelihood of accuracy to the meeting datum. That low-probability would then offer you the opportunity to check the state of the datum more frequently. So perhaps there is a trust continuum in the data from a high likelihood of it being wrong to a high likelihood of it being right. As we look at shades of probabilty we can make appropriate risk management decisions.

I realize of course that this is broader than the area that you were exploring initially with CAP, but as we see more on the fly analytics, decision making, etc. we will discover the need for some semantics around data synchronization risk. It's not that these issues are new - they assuredly are not. But we have often treated them implicitly, building rules of thumb into our systems, but that approach doesn't really scale.

I would be interested to hear your thoughts.
PS I have cross posted this against the original article as well.

Tuesday, May 15, 2012

The importance of context

I am about to display my programming roots. History alert.
In a far off kingdom computers were made by an all powerful company - called IBM. IBM had the most magnificent Operating System, inventively called "OS". This operating system came in a number of dialects (OS/MFT, OS/MVT - eventually morphing to SVS and MVS before becoming Z/OS). The people marveled. What wondrous naming! But I digress.
To get work done on these behemoths - especially batch work, a special dialect, conjured from an unfettered imagination, was created. This dialect - whose name is uttered in hushed tones was "JCL" or "Job Control Language".
JCL provided the context under which jobs are scheduled, programs executed, files were created or disposed (disposition processing). The JCL sorcerers were much in demand in the early devops days.
IBM provided a series of utilities for doing useful tasks to files, jobs, etc. But the most cunning, the most fiendish of all was the well named IEFBR14. Befor describing its inner workings in gory detail, we need to step back and look at the JCL some more.
When a program executes in an "OS" environment, it can indicate to the environment that it has been successful or has failed. This is done using a "Return Code". Nothing strange there - at least not on the surface. However the return code value can be used to control what happens next. For example if a program is supposed to create a file, but somehow aborts, one can through the magic of JCL say that the system is to delete the file. If the program is successful, one could tell the system that the file is to be kept, etc.
Genius.
So where was this return code kept? In a general pirpose "register" called register15 (R15) for short. Why there? Because R15 had a use at the beginning of the program and not much thereafter. When a program executes, R15 contains the memory address of the entry point of the program (well almost, but that's close enough for government work). So the one value one would not expect in R15 was 0. It was thus important to explicitly set R15 to the proper value before the program terminated. Other wise the return code would be the starting address of the program. Awkward.
Now lets look at the program IEFBR14. It's genius was that it did absolutely nothing. It started and immediately exited. It used to consist of a single machine instruction. The instruction (BR 14) that causes the program to terminate (actually branch to the address held in register 14, which by convention at the end of the program is back to the OS). When the program terminates, disposition - as controlled by JCL takes over. Since the return code value was random and arbitrary (except its value is always nonzero and evenly divisible by 4), no exection of IEFBR14 ever executed cleanly. Thus messing up disposition processing.
To end a long story, the size of the program IEFBR14 was doubled. From one instruction to 2. First off R15 was cleared to zero - at least its value is now predictable. And then the BR14 instruction executes. Victory!
The important lesson, however, is that you cannot ignore the context in which your systems (in this case a simple program) execute. The environmentals are key.

Law is the requirements specification for the system of being a resident

Our (US) government uses taxation and tax relief as a way of enacting all sorts of public policy legislation. For example, we have a way of paying medical bills through the use of a tax relieved account. Sounds great doesn't it. I put so much aside every month into an account (a Flexible Spending Account or some such TLA). Then when I have to pay for some procedure (like new glasses, eye exams, dental stuff, etc.) that is somehow not covered by my healthcare payment system (aka "Insurance" which it patently isn't), I use money from this tax relieved account. Sounds great - I'll have some of that think the politicians who prepared the bill and sold it to banks, lobbyists, lawyers and eventually the people.

However it doesn't quite work! On every occasion that I have used the account this year I get a letter from the account sdministrators that essentially says, "I don't believe that you have used this for a legitimate purpose, so please provide suitable documentation".
I bought eyeglasses and lenses, I had my teeth cleaned. The credit card receipts showed where I spent the money, but not on what. So, I now have to go through find the receipts with the actual things I paid for on them, submit them to the processing company - who have presumably got a bunch of employees doing low value work verifying that I haven't somehow spent the money on something not covered (toothpaste for $185  at the dentist?, Glasses cleaner for $350 at the optomerist?)

When our lawmakers specify the "system", they don't seem to take the possibility of fraud into consideration. The initial assumption is all unicorns and rainbows. It is assumed that people won't cheat the system, that the happy path is the only path....

That thinking, however, fails to take into account the inventiveness of part of the population. That part of the population that will attempt to use the system in a way it was not designed to be used, for personal gain. So the cycle seems to be.

  1. Create legislation that makes things look really rosy for the populace, vested interests, lawyers, etc.
  2. Roll the "system" that embodies that legislation out
  3. Be shocked that there is abuse
  4. Place layer upon layer of administrative/bureaucratic overhead to prevent the potential abuse
  5. Ignore fraudsters
  6. Proclaim that jobs have been created
  7. Rinse and repeat.
If we can't be sure that a relatively small system will behave properly - even with iterative development methods, what hope is there for the waterfall approach in the legislative process?

Thursday, May 10, 2012

A rant against 1:1

Everey now and again, I get really annoyed with sites that assume you have only one of something. "Please enter your email address" is a common request - except that I have several, and would like to have the opportunity to use any of them as my login id. After all they are each unique. Tripit.com does it right. Many other sites do it wrong. This posting from Robert Scoble illustrates the kind of muddy thinking. Apple making the assumption that there is one credit card.
Years ago, I used Plaxo. However the geniuses behind that didn't think I might have more than 1 email address, more than 1 set of followers, so I would get suggestions from them to follow people I was already following.
In the world there are very few 1:1 correspondences that are timeless. So any time a system assumes that there is a pair of things that are in absolute 1:1 correspondence, I am mightily suspicious.
There are 2 interesting cases to ponder:
1:1 at a time and 1:1 over time.
1:1 at a time, I get. However there have to be rules/policies/processes or whatever to change to a new one. But even those are suspicious because we may have to account for the zero case. And it usually isn't bilaterally 1:1.
1:1 over time is much harder. If it isn't possible for two things to exist independently of each other (for each one there is always exactly one of the other), then we have to question why they are not combined. By the way there are often good technical reasons, but maybe not so many good business or policy reasons.
So a word to the wise, when someone tells you there is a 1:1 correspondence, then they may be talki8ng about a single world view and that you should at least explore the alternatives lest you be trapped in an expensive rethinking process.

Saturday, April 28, 2012

Event Distribution and Event Processing

I have recently been involved in several discussions (sales opportunities perhaps), where the answer seems to be, "We need a CEP engine". Of course if one chooses solutions based on products there's something wrong. And then working with the sales force, I hear, "Customer X wants to buy our CEP engine, you know something about the industry, what use cases should we propose?" When I delicately suggest that nothing they have said so far qualifies a CEP need, and that the problem is bigger (based on industry knowledge), and that it will require more than the CEP engine, I get the message, "That will drive the price up too much, and anyway we have told them that the CEP engine is the way to go, so we can't change..." So why ask me?

But that isn't the whole point of this post.

There are things that CEP engines are *really* good at. However, distributing events isn't necessarily one of them. Now when it comes to interpreting the events in relationship to each other in a tight time window - now we are talking. When it comes to creating events out of that interpretation, we again have good cases. But that isn't distribution either - that's just notification.

But the nagging question is there. "How does the CEP engine (or indeed any other kind of event processor) get to hear about the events it is monitoring?"

A way of looking at that is in terms of the Event Distribution Network. Now that is serious architecture and infrastructure. Not to speak of some mental gymnastics on behalf of both the business and technology communities.

Conceptually, events are easy things. "Something happened". Of greater trouble is making sure that the knowledge that "something happened" gets to the right place.

The right place might be a CEP engine - we want to see the implications of what happened with a whole bunch of other things that happened. Oh, and do it in Near Real Time (NRT) (Whatever that means!).

But another place might be at the next stage of a business process. "The customer paid their bill, let's ship the goods". In other words thge event as the trigger with a process call-to-action. These aren't exclusive alternatives.

Of course there can be many things that need to know about the same event.
So just because you see a customer need that says "event" and you have a product that has the word "event" in its description, don't make the mistake of assuming that one matches the other.

It's as absurd as the trouble compilers have with English in examples like this. "Fruit flies like a banana". "Time flies like an arrow"