The Cell Programming Language

Design process

Let's now quickly sketch out the design process for relational automata. The first step is to partition the information your application handles into a number of subdomains, each of which will be implemented as a relational automaton, and identify all the dependencies between them, which will be then translated into dependant/dependees relationships between automata. My own preference (and advice) here is to have only a few coarse-grained domains instead of a large number of smaller ones: remember that despite some superficial similarities, relational automata have very little in common with objects and classes (which indeed tend to be very fine-grained in well-designed object-oriented applications), so you should be wary of carrying over design practices developed for a very different paradigm. Instead, the closest analog I can think of in mainstream programming paradigms is the components of the Entity/Component architecture, of which there's usually just a handful in a tipical game. I suggest that you aim for that level of granularity, but of course nothing stops you from experimenting with different approaches.

After that we'll need to identify all entities in the application domain, to create a specific data type for each one of them and to arrange those datatypes hierarchically if their corresponding entities lend themselves to being classified in such a way. I strongly recommend that you design those datatypes so that they only identify the corresponding entities and their types, and use relations to store their attributes, as was done in all the examples that we've seen so far, like PartId, SupplierId, Employee, SalariedEmployee and so on. Using relations instead of records to store attributes of any given type of entity (or, more generally, all information related to them) provides several very substantial advantages. Some of them should be already apparent by now, like the ability to do fast seaches on any attribute or combination of them; to have unique attributes like usernames in Logins or code in Supply; to naturally encode attributes of relationships and not just entities (unit_price and availability in Supply, or purchased in BookMarket); and of course to imperatively update their values, while retaining the benefits of pure functional programming. But we've barely scratched the surface here: we will discuss this topic more extensively in another chapter. Keep also in mind that Cell is developed around the notion that information should be encoded using relations, and that's also true for performance optimization: if you use records to store the attributes of your entities, you'll end up paying a price in terms of performance too.

The next step is to define the data types for the attributes of both entities and relationships, and this is another place where schema design in Cell markedly differs from conventional database design. Here attributes can be of any user-defined type, and those types should be defined in the way that makes the most sense for each specific attribute. These user-defined types can be either application-specific ones like BookCondition in BookMarket or more generic and reusable types like Money (which was used in several examples throughout the chapter) that may even come with their own logic.

There's of course some tension between the two recommendations above: using relations instead of records to store information, and using user-defined types for the arguments of those relations. As an example, which of two schemas shown below should be preferred?

schema StrategyGame {
  enemy_unit(EnemyUnitId)
    position_x : Int,
    position_y : Int,
    ...
}

type Point = point(x: Int, y: Int);

schema StrategyGame {
  enemy_unit(EnemyUnitId)
    position : Point,
    ...
}

In a case like this, I would be inclined to use a single attribute of type Point rather than two integer attributes position_x and position_y. But in that case, why stop there and not, say, merge the first_name and last_name attributes of the Employee entity in a single Name type, as demonstrated below, which is something I would very strongly object to?

type Name = name(
  first_name:  String,
  last_name:   String,
  middle_name: String?
);

schema Workforce {
  // Next unused employee id
  next_employee_id : Nat = 0;

  // Shared attributes of all employees
  employee(Employee)
    name  : Name,
    ssn   : String;

  ...
}

Here are a few points to consider when making that decision:

First of all, remember that the issue arises only if you're dealing with (possibly tagged) non-polymorphic record types. If the attibutes in question are best modeled as a union type, then there isn't really much of a choice.
Is the type you're considering an "interesting" one, that is, one that comes with its own logic? Point does, to some extent. Or is it a "boring" one, like Name, that groups together several somehow related pieces of information, but has no intrinsic behaviour? In the former case, defining a complex type may well be a good idea. Otherwise, it's almost always a terrible one. If it's just plain data, relations are superior to records in every possible way.
Once you group together several attributes you loose the ability to perform fast searches on any one of them. In the case of Workforce and Name, for instance, merging the two simple attributes first_name and last_name into a single composite one would loose you the ability to search employees by first name only, or last name only: you would have to provide the complete value for the name attribute in order to perform an optimized search. The same goes for some integrity constraints: for instance, once a unique attribute is merged with others, it becomes impossible to declaratively enforce its uniqueness.
You also loose the ability to imperatively update the attributes in question individually. Once first_name and last_name are merged, only the whole name attribute can be the target of an imperative assignment, not its individual fields.

That's not an exhaustive list, but it should be enough to make an informed decision in the vast majority of cases.

Once the previous steps have been completed, what's left is more or less normal database design, which conventional database design wisdom applies to, despite a few differences between the two flavors of the relational model. It basically boils down to three main points:

Avoid any form of redundancy at all costs. No piece of information should ever be duplicated, full stop. There's simply no reason whatsoever to do that in Cell.
As far as possible, do not store derived data, that is, data that can be calculated from other data that is already in your schemas. That's every bit as bad as having duplicate data, and for exactly the same reasons. Unfortunately it is, in some cases, difficult to avoid. But it should be done only as a last resort.
Use all the available integrity constraints to ensure the consistency of your data.

The overarching goal of the above guidelines is to avoid inconsistencies in your dataset, which are dangerous because they almost invariably cause you code to misbehave.

A piece of information that is duplicated, that is, stored in multiple locations, opens the door to inconsistencies: if when that particular value changes some but not all locations are updated, you'll end with an inconsistent dataset. Duplicating data is a common practice in low-level programming paradigms like OOP, and when it's done for a reason it's usually done for convenience: it may in many cases save you the trouble of traversing an object graph in search of the information you need, or even building the required access paths into it in the first place. In other cases it may be done for performance, as is for example the case with the practice of denormalizing data in relational databases in order to increase access speed. But I'm not aware of any reason to do that in Cell schemas: the data stored inside a relational automata is always easy to access as there's no object graph to traverse, there aren't artificial barriers like encapsulation either, and the data structures used to implement mutable relation variables don't require any denormalization to improve performance.

Derived information poses similar problems: core and derived data can get out of sync during a buggy update. In this case though the performance issues are unfortunately all too real, and that makes it difficult to avoid storing derived data entirely. But how much of it and what pieces of it exactly you decide to store can actually make a lot of difference. Every extra piece of derived data in your schemas is a trap that you set for yourself, which you have to be careful to avoid during updates. The fewer traps you set the better, so try to reduce the amount of such data to a minimum, and choose it carefully in order to maximize the performance gains and minimize the downsides. The language is meant to eventually provide several forms of memoization that should obviate the need to store derived data in the vast majority of cases, but their implementation is not trivial and it will take some time to get there.

Several other types of inconsistencies can plague your datasets, even in the absence of duplicate or derived data, but a careful use of integrity constraints (keys and foreign keys) can prevent most of them, like required information that might be missing (e.g. missing mandatory attributes); information with the wrong multiplicity (e.g. an attribute that has several values even though it's supposed to have only one); and "dangling" foreign keys. Many (but not all) of the necessary integrity constraints will be declared implicitly when using the available syntactic sugar. Incidentally, another good reason to be very thorough when declaring keys and foreign keys is that the compiler will take advantage of them to optimize the low-level data structures used to implement relations, and omitting them will make your code considerably slower.

Another undesirable consequence of having duplicate or derived data in your schema is that it makes updates more complex. A stored derived relation for instance will have to be updated every time a tuple is inserted, updated or deleted in any of the relations it depends on, and that could well be a lot of places. On top of that, efficiently updating only the part of a derived data structure that has changed is typically a lot more complex that recalculating everything from scratch. So not only your code will in all likehood have more bugs, but you'll also have more work to do, and the extra code you'll have to write is going to be especially complex.

There's also a Cell-specific reason to properly design your schemas, and it's related to orthogonal persistence. Orthogonal persistence requires the ability to restore the state of an automaton using a previously-taken snapshot of the state of another automaton of the same type. But when performing a restore operation you've no guarantees of any sort about the origin of the data you're dealing with: it could have been created manually; or it could have been produced by code that is not part of your application, for example by a script that processed other data; or it could come from an earlier version of your own application; or who knows what else. The bottom line is that in order to make sure that an automaton will never enter an invalid state you cannot rely on the correctness of the code that updates it: the only defence here is to design schemas to be so tight as to rule out as many invalid states as possible.

One point that could never be emphasized enough is that the design of your schemas is the wrong place to be lazy or negligent: by all means cut corners in the actual code, if you think that will give you an edge: code is complex and time-consuming and difficult to write, and even the ugliest hacks can at times be justified. But there's plenty of reasons to try hard to get your data structures right, and no reason not to do it. Unlike code, data structures are both short and easy to design and write, so you're not going to save any effort by not doing that with the due diligence, but you've got a lot to loose, because badly designed data structures will make your code slower, more complex, less reliable and more difficult to read and understand.