domingo, 13 de julho de 2008

The Quest for an ObjectScript Iterator (part 2)

In this series of blog-posts (starting here) I'm exploring the issues of maintaining an abstraction layer between Caché database and your business logic. Caché's tight integration of database and program environment tempts away from this. And, in particular, the functions for accessing and looping through tables pull against it.

In this post, I'll give some background and explanations of how Caché ObjectScript programs see the database. Experienced Caché and MUMPS programmers probably know this already, but new arrivals from other worlds might find it informative.


The Caché database is structured as sparse, multi-dimensional arrays (known as "globals") containing chunks of data in strings. Because the arrays can use strings as indexes (ie. the keys can be strings) meaningful information in a record is usually spread across both the keys and the actual value. But only keys are easy to search on.

This is different from a relational database where all fields are (at least from the SQL writer's perspective) more or less equal.

Let's create an example. A rather simplistic patient record might be stored something like this :


^Patient("general hospital",324542)=john~smith~malaria


where the hospital name is the first key, patient id is the second, and the actual data (first name, last name and disease) is encoded as sub-strings (known as "pieces" in Caché terminology) separated by the ~ character.

Such a database structure makes it easy and very fast, to pull out data if you have all the necessary keys. To get this record from the database into a variable p :


set p = ^Patient("general hospital",324542)


It's also pretty simple to manipulate a subtree. For example there are operations which can copy an entire subtree to another variable.


merge gh = ^Patient("general hospital")


will grab all general hospital patients and put them into a subtree in the variable gh.

You can delete subtrees with

kill ^Patient("old hospital")

etc.

On the other hand, if you want to find all patients who have malaria, you have a slog. You either have to manually run through all the records checking which contain "malaria" in the disease field of the string. Or, if looking-up patients by disease is a common requirement that needs to be fast, you make a second array as a fast, searchable index, that is structured like this.


^PatientDiseaseIndex("malaria","general hospital",324542)


And make sure you keep it in sync.

Iterating through these multi-dimensional arrays is "baroque" to say the least. Caché ObjectScript provides two commands : $order and $query for looping through tables.

$order takes as argument an array descriptor (name and keys), and returns the next key at the same level of hierarchy as the right-most key listed in the array expression.

Huh?

To make that last sentence clearer, here's an example.

Let's suppose we have three patient records :


^Patient("general hospital",324542)=john~smith~malaria
^Patient("general hospital",324549)=martha~jones~measles
^Patient("local clinic",2323)=donna~noble~flu


Calling the $order function like this :


$order(^Patient("general hospital",324542))


will return the value 324549. Why? Because 324549 is the next key at the "patient id" level of the key indexes.

Similarly


$order(^Patient("general hospital"))


will return the string "local clinic", because here we're only giving the top-level key of ^Patient. And the next key after "general hospital" is "local clinic".

Using $order, then, it's possible to loop through each key at a particular level of the hierarchy. It also knows how to find the first key at any level; you simply pass it an empty string. So

$order(^Patient(""))

returns "general hospital", the first top level key. And

$order(^Patient("general hospital",""))

returns 324542, the first second level key below "general hospital".

When the $order runs out of keys at any particular level of a subtree, it returns an empty string.

For example,

$order(^Patient("general hospital",324549))

returns "", which signals to us that there are no patient ids after 324549 in the "general hospital" subtree.

To loop through all records in the table we have to use nested loops. Typically something like this.

set hospital=""
for {
set hospital=$order(^Patient(hospital))
quit:hospital=""
set id=""
for {
set id=$order(^Patient(hospital,id))
quit:id=""
set p = $get(^Patient(hospital,id))
... do something with patient p
}
}



In other words, it's a bloody performance! Especially when you come from the sort of language where you're used to being able to write something like this :

for p in Patient {
... do something with patient p
}


But the problem is far more pernicious than simple verbosity. This code hardwires a great deal of commitment to the particular database structure. Let's suppose we realize at a later date that we really need to add a third key to Patients. For example, our hospital network expands into a neighbouring state and we now need to support a new structure :


^NewPatient(region,hospital,id)


Migrating the existing data is a bit of work. But now every single place in the code that loops through patients looking for records that match some criteria will have to be rewritten as well!

The alternative iterating function $query offers some help. But has its own bizarre qualities.

Like $order, the $query function takes an array name and keys. But it returns a string which contains the full array access expression of the next item regardless of the level of hierarchy. So


$query(^Patient("general hospital",324542))

will return a string containing "^Patient("general hospital",324549)"

This can then be evaled in the next statement. ObjectScript has an @ operator for eval, so we loop through the array like this.


set q=$query(^Patient)
for {
if q '= "" {
set p = @q
... do something with patient p
}
set q=$query(@q)
quit:q=""
}


As before, there's a way to get at the first record - the $query(^Patient), and when $query returns "" we've reached the end.

This is somewhat of an improvement in that we're back to one loop. And it would still work if we moved to a new structure for ^Patient. It's a minor inconvenience that we've got ourselves into a "for" which only tests for the exit condition at the end of the loop body so we need an extra test that q isn't "" for the actual "do something" part.

The bigger concern is that we've now lost our keys. The value of q is going to be something like "^Patient("general hospital",324549)" while the value of p is "martha~jones~measles". If, in the "do something", I want to know what hospital we're talking about I'm going to have to cut up the string q to extract it. That's a bit painful.

In the next part of this series, I'll start showing some "iterator" routines which do successfully hide the structure of a table and yet give access to necessary key information.

Nenhum comentário: