Tuesday, July 6, 2010

Controls and Trust

I am appalled as I look at systems in various companies with whom I have consulted, or who have employed me at the lack of system controls in key places. If you are in the data delivery business and you have agreements with your customers, wouldn't you want to know that you are meeting your service level agreements? Or better still when you are not going to (for whatever reason) and be able to issue warnings, do something about it, or whatever?

Similarly when looking at flow through from one system to another, can you reasonably be assured that everything that was supposed to be processed was?

Do you count your cash after going to the ATM. Maybe the machine didn't deliver correctly because a couple of notes were stuck together. Maybe a new software version caused a miscount under some weird circumstances. The ATM is a "black box" to me. That means that at its boundaries I have to decide what my trust relationship with it will be.

So when I have systems which are supposed to communicate in some way (e.g. by passing data) what controls should be in place to make sure everything is properly accounted for? Should a sending system keep a count of what it has sent? Should receiving systems similarly keep track? How do we reconcile? Should the reconciliation be in-band? Should it be out-of-band? Is logging adequate? Do we have to account for the "value" of the transmission as well as just counts? What tolerances matter if we are concerned with value (perhaps one system rounds off the value differently from another so at the end of the day the total value has a discrepancy)?

This need for controls is exacerbated by systems that use Events as the primary means of notification. Because at the individual event level we can indeed count, maintain value, etc. But often the controls need to be at an aggregate level. One would think in, for example, an airline boarding system that as long as every boarding event is properly received by the "flight", then the system should be in balance. Try telling that to Easyjet. There is a manual control system whereby the Flight Attendants actually count the number of passengers on the plane and attempt to reconcile that with the "expected" number. How the expected number is derived, I have no idea. It could be simply the number of boarding cards collected - but what about electronic boarding? It could be the "system's" view of how many bums there should be on seats. Whatever it is it doesn't appear to be reliable. Chris Potts (Twitter @chrisdpotts) told me the story of what happens when the count is wrong. they recount, they look for people in bathrooms, they delay the flight. It's all a mess.

In the 1960s when phone phreaking was at its peak, people could make free calls because the control signals (tones) for managing the connection system were on the same band of the infrastructure as the call itself. So when a signal tone was detected (and you could get whistles to generate these tones), the system went into a signalling state. By signalling the correct sequence you could generate the sequence to make free calls. Simple fix - put the controls out of band with what you want to transmit.

In a properly reliable infrastructure, the appropriate controls should be built in from the beginning. Again, you may ask, "What's this got to do with Enterprise Architecture?". I argue that it has a great deal to do with the architecture of the enterprise. Good controls make for good compliance and a high level of confidence in our business practices. Bad controls can make your corporation star in places you don't want to be - the front page of the WSJ, in anecdotes among the social networks, resulting in a loss of confidence in your organization.

4 comments:

Samples said...

Recently I underpaid my mortgage by $1. Recognizing the error while balancing my checkbook and before the actual mortgage payment was due, I sent another $1 off to the mortgage company. Unfortunately their systems already went to "work" applying the underpayment as extra principal to the previous month's payment and applying the $1 payment as payment on the current month's payment, eventually making the current month's payment overdue. This took many weeks and phone calls to not only identify the problem but to get it corrected. Had I overpayed my mortgage by $1, my account would have been current and an extra $1 applied to principal with none of the ensuing people costs. Clearly the controls and trust here were only built in favor of the bank.

Chris Bird said...

To me the story told by "Samples" shows an interesting balance between the Policy/Value/Trust axes. It shows the need that even when you think you have a black box to "Trust but Verify" What seems "obvious" to Samples (and to most right thinking people, I expect) is clearly twisted in his/her example.
As architects/thinkers we are responsible for uncovering the trust boundaries and making them explicit. making decisions about how "optimistic" to be.
In this venal world, optimism is misplaced a lot of the time.
Just remember we have the best legal/government/political and other systems that money can buy!

Aidan said...

It seems that in the world of technical systems, trust means mistrust. Don't believe what you are told about black box behaviour because it may well be wrong. On the news this morning it was reported that there was a vast increase in the level of fraud with the major frauds being committed by people who join their companies for the purpose. The level of due diligence for senior staff is lower than for mid-level staff.
However, mistrust has a high cost too. The previous comment could be used as an example: of course banks never trust their customers and the denial of their own venality is at the roots of the credit crunch and its unfolding sequelae. I think what you are saying, Chris, is that certain sorts of checking and reconciliation ought to be routinely implemented in systems so that systems don't cause cost by undermining human trust.

Chris Bird said...

Aidan,

You said it well. There are definitely many kinds of checking and reconciliation that should be routinely implemented in systems so that human trust is not undermined. I feel another posting coming along however - this time to do with systems of record and systems of reference, the CAP theorem, etc.