Monday, November 30, 2009

Updates Harmful

I have written about this before, I suspect. So forgive me if this is another representation of that resource.

Hanging out with Nigel Green and John Schlesinger is dangerous, so be warned. There be sacred cows being slaughtered in this post.

It all started innocently enough a year or so when Nigel and I were discussing the role of databases (record keeping) vs the role of databases (transaction scheduling, message management, etc.). Faulty normalization came into the picture too. Then the harm done by data modeling (where we thought we could model the rules in data). Large scale data modeling efforts requiring significant investment and it becoming very hard to see where the return comes from. Then an aha. If we go back a bit in time, the worst office job used to be "filing clerk." An imaginary discussion when someone comes home from work, "Well dear, what did you do at work today?" "We opened a new file for Acme enterprises. That meant that our index cards were all messed up because we had run out of space for companies starting with A, so we had to rewrite some of those cards, but while we were there we saw that some cards pointed to files that no longer existed so we removed those - but only after we had added more space for the letter A companies (which we didn't need right now anyway.)" "That's nice dear, how about a nice cold beer?"

The point is that filing clerks used to be the lowliest members of the office - and yet in their electronic reincarnation they have acquired really expensive care and feeding. Of course the new clerks are the databases, the expensive care and feeding is manifested by having a group of thugs (DBAs) who hold everyone to ransom with their talk of normalization, SGA, proper keys,.. All things which we did pretty easily with clerks. So what's going wrong? What is normalization for?

Taking normalization first - it is simply for ensuring that we don't get update anomalies. That something that is to have the same value regardless of usage actually does have that value. You don't have to have a normalized database to ensure that the update anomalies aren't present. Although it is a bit easier.

What is going wrong, is in many ways a harder question. One fundamental thing going wrong is that we use the "filing cabinet" as a scratch pad. So returning to the physical world for a bit. Let's imagine a filing cabinet in which we store the active accounts (perhaps bank accounts). When someone wishes to open an account, we give them a whole bunch of forms to fill in, they go off and fill them in and hand them back to us. We transcribe those forms and do some checkup on the data contained. Once we are happy with the data, we can now give the stuff to the filing clerk and have the clerk create the new file folder. So where were the forms and the checking? In some kind of "blotter" or case management pile on the clerk's desk. They weren't in the active accounts cabinets. And nor should they be.

No we go to a computerized system. We enter the data from the completed forms into the system and "poof" they create an active account. But actually it is more insidious than that. We go through a series of screens putting in different bits of the account - each leading us to a more perfect account, but we aren't there yet. Eventually they will be in the active accounts database (but probably with an inactive flag) so that they can sometime be transacted. This is nuts. We are using the record keeping database (aka the filing cabinet) to manage work in process. This is not a proper separation of duties.

It gets worse. The company decides to "go online". Expensive consultants are hired, golf outings are scheduled, expensive dinners eaten and the "new account workflow" is eventually unveiled. It, too is a sequence of steps. However, the poor schmuck filling this in has to complete each page of the form before moving on. That means that s/he cannot stop for a break - store a scratchpad version of this, do it out of sequence because they can't remember their spouse's Social Security number or whatever. The people in charge of the design of the system understand that THE SYSTEM needs accurate record keeping, have heard that "it is ALWAYS better to validate the data at the point of capture" and other platitudes, but forget that at the end of the line there is the poor user. For these kinds of data entry systems, (and a whole host of housekeeping systems) we need to store the "process state" separately. Don't use the state of the key entity as a substitute for that. Store where I am in the account opening in the account opening process, not in the entity that represents the account.

So what got this diatribe really going? The notion that updates are unnatural - and probably harmful. I posit that the reason that we do updates is mostly because the common need for retrieval of something is the most recent version of it. So it makes sense to have access to the most recent version and update in place. But that isn't always the most expedient behavior. Certainly the most recent value is often the value you need - especially in an operational system. However more and more systems really need the ability to look back. Even something as simple (looking) as you medical record is not something you want to update. Patient History is key. We don't need to know the current cholesterol level (in isolation), we need its trend. So we don't just update the "cholesterol value" in the patient record. We add a new item for the cholesterol and keep the history. We keep the record sorted in time sequence so we can see the latest. We don't just overwrite the value. Our uses of data are so unpredictable, that simply updating database arbitrarily is going to give us data loss. We don't know in advance how serious that data loss might be. Perhaps it would be better to assume that we will need everything and come up with a scheme that at some backbone level ensures that the current view can be reconstructed by replaying the operational events.


Richard Veryard said...

And of course the same is true of "Delete". (This means you shouldn't delete this comment even if you've already made this point somewhere else.)

Chris Bird said...

As always a pithy remark, Richard. Thanks.

Delete is interesting - especially in massively distributed systems. In teh grand scheme, of course, things can't be completely deleted (or at least that cannot be guaranteed).

However ignoring that for a moment, just thinking about how we undo transactions. Properly audited systems allow us to issue "contra" entries to allow the effect of the transaction to be undone. Of course it doesn't undo the transaction - i.e. the fact that the transaction occurred. It is an important Trust boundary and separation of concerns to prperly distinguish the effect from the existence. Of course by simply issuing delete, we conflate those 2 concepts and then wonder why we can't understand history.
Organized crime seems to be better at this. Don't write anything down (unless you have to) and kill any distributed copies of the data (literally!)

dave hollander said...

Of course, this ties into the previous post on compensating transactions and complexity. The inability to back track because of update and delete make compensation so much more "interesting".

Anonymous said...

personally I have a BI background and I feel you just give the perfect example why we shouldn't just keep all history.

If the account changes name or address that's info that can be used current and historically. Knowing that there was a version without social secno for the 5min however has very little value in most situations

cbbrowne said...

This evokes, to my mind, the MVCC concept (MultiVersion Concurrency Control), where tuples don't normally get updated, but rather the DBMS just writes new versions.

The one trouble with that is that typically this is done fairly much invisibly to the user, so that it's mighty difficult to arbitrarily get back to elder versions.

Here's an example of how it's sometimes accessible: Time Travel in PostgreSQL. But that's an answer only for one DBMS for some scenarios.

VJ Kumar said...

In Oracle, you can just say:

select * from any_table as of

This is trivial in comparison to the Postgresql solution. It is unclear why other MVCC vendors(Postgresql, InnDB, Firebird, SQL Server >=2005) do not expose this kind of functionality.

Chris Bird said...

I generally disapprove of anonymous posters, but I will respond to the one above. It isn't axiomatic that "Knowing that there was a version without social secno for the 5min however has very little value in most situations".

It really does depend on why you are thinking about this. For example, discovering that people enter their social security number late in the process can give us some clues as to whether we have designed the process properly.
o while I may not have made that argument properly in the original post, there is more than just the current data value to think about. Sure from a pure BI perspective that timing of social security number entry probably doesn't matter. However for process improvement it may. Our data are not single purposed. So throwing values away can be fraught.

Bottom line for me - update in place should not be the kind of no-brainer default method of managing databases. We should look to the broader perspective to decide.

sayap said...

We tried to keep all UPDATE/DELETE history on a few tables, i.e. the transaction time side of bitemporal. So far, from a BI perspective, this has yielded 0 value.

For BI, the other side of bitemporal, i.e. valid time, is much more useful.

John Schlesinger said...

For me the bigger picture here is that an enterprise application needs to separate completely the read/write (scratchpad) side of the data model from the read-only (historical) side. Chris is well aware of the pioneering work done by the global distribution systems over the last ten years as the ratio of 'looks' to 'books' went from 5 to 1 to 1000 to 1. The same is happening in banking. If a teller transacts, it is 5 to 1. If I do it over the Internet 50 to 1, if I do it on mobile 500 to 1 and if the Internet of Things (credit cards, fridges) does it 5000 to 1. So we are changing our product to have two data models, a transactional one and a reporting one. This is not data warehousing, the data is homogeneous. But the change is profound. The read only database is a source of time consistent bulk change. The read only database is not an interface at all, merely a side effect of doing the record keeping.