Why relations are better than objects

The title of this post is a bit misleading, as the first thing the word "object" brings to mind is OOP. Here instead we'll be using said word in a looser sense, one that includes, but is not specific to OOP. For our purposes, an object is any mutable record that can be referenced and manipulated through a pointer or object id. So everything we will be saying applies not only to object-oriented languages, but also to procedural languages like C or Pascal, and even some impure functional ones.

Some of the benefits conferred by the relational model are just a consequence of the fact that the relational model is value-based, that is, it has no notion of pointers. For a good overview of why it's better to avoid pointers in a high-level language, I recommend that you watch, if you haven't already, this talk by Rich Hickey (the creator of Clojure and Datomic): The Value of Values. He explains it much better than I ever could. We'll take all that for granted, and we'll be focusing here instead on several aspects that are either not discussed in that talk, or that are specific to the relational model.

In what follows, we'll be talking specifically about the particular flavor of the relational model that is used in Cell (which is very different from the one used in relational databases) and we'll assume the reader has a basic familiarity with it, so you'll need to read the introductory example first. It would also be useful, but not strictly necessary, to take a look at the chapter on relational automata.

Redundancy

In the database community it has been conventional wisdom for nearly half a century now (basically since the invention of the relational model) that in designing your database schema you should be careful to avoid any kind of redundancy. That's what database normalization theory is all about. For some unfathomable reason, the same kind of thinking is never (or almost never) applied to software construction, even though it would be as beneficial (possibly even more so) as it is for databases. So, before we start our discussion, it's a good idea to talk a bit about redundancy, and to explain what's so harmful about it.

The simplest form of redundancy is when a given piece of information is stored in more than one place. In a relational database, it may for example appear in multiple rows of the same table if the database is not normalized. In an object model, it may be stored inside different objects, or it may appear in slightly subtler forms (see the works_for() example in the next paragraph).

This type of redundancy causes two main problems. The first one is that if a piece of information is stored in multiple locations, those locations have to be kept in sync: every time you update one of them, you've to update all of them, or you'll put your dataset in an inconsistent state. While consistency is no guarantee of correctness, the former is certainly a prerequisite for the latter. So getting rid of redundancy increases the reliability of your code by avoiding an entire class of bugs.

The second main reason that makes redundancy harmful, one that is more relevant to software construction than database systems, is that having to update a piece of information that is stored in multiple locations requires more work, which makes your code more complex than it should be. This is especially true for object systems, where these redundant pieces of data may be scattered all over a difficult-to-navigate object graph, and where an artificial barrier like encapsulation often stands in the way.

This first type of redundancy is pretty easy to get rid of with the relational model. In relational databases you may still occasionally have to denormalize your schema for performance, but in Cell, due to the different flavor of the relational model and the fact that Cell is a programming language and not a database system, that simply doesn't apply.

But there's also a slightly less obvious and much more toxic form of redundancy, that occurs when a piece of data is not the exact copy of another, but can still be computed from data that's already there. This second type of redundancy is much more difficult to avoid, for the simple reason that those calculations can be computationally expensive. On top of that, its consequences are a lot worse. As an example, let's say that we have a redundant (in the second sense) piece of data, y, which can be derived from the non-redundant pieces of data x1 .. xn using the formula y = f(x1, .. xn). First of all, it's obvious that this situation is more error-prone that the one we had before, for the simple reason that y has to be updated every time any of the xi change. The number of dependencies is multiplied by n, so the chances of making a mistake go up.

The second, more important reason, is that in many situations you simply cannot afford to reevaluate the f(x1, .. xn) expression, because that may be computationally expensive (if it isn't, reactive programming and memoization can easily solve the problem). What you need to do, instead, is to efficiently update only the part of the derived data structures (y, in this case) that has changed, and that is almost invariably a lot more complex than recalculating everything from scratch. We'll see a non-trivial example of this later.

Relations are can be navigated in any direction

Say your your domain model contains entities like companies and employees/contractors/freelancers, you may have a relation like the following one:

works_for(PersonId, CompanyId)

You can easily navigate from a company to its employees/contractors/freelancers using works_for(?, a_company) or from a person to its employers with works_for(a_person, ?) or just works_for(a_person) if a person has only one employer.

With pointers, you've to set up two of them, one for each direction. One of them will be redundant, in the sense that it doesn't add any information that is not already there, and that will cause all the redundancy-related problems mentioned before: more work to do when updating your data, and the possibility that a bug in the code may leave it in an inconsistent state: a situation, for example, where an employee object "thinks" it works for a certain company, but that company doesn't have him or her in its list of employees.

Relations can be searched efficiently on any combination of their arguments

Using the domain model described in the introductory example, you could, for example, easily retrieve:

// All users who signed up on day d
signup_date(?, d)

// All users whose last name is n
last_name(?, n)

// All users that joined a chat group g on a day d
joined_on(?, g, d)

The first result of those searches would be retrieved (on average) in constant time, and every subsequent one would require only a few clock cycles. With objects, you would need to build ad-hoc data structures for each of those searches, if you wanted them to be efficient.

No need to wire an object graph

When designing your object model, you've to make sure that every object is wired to all the other objects it needs to have access to in order to perform the tasks that have been assigned to it. This wiring can easily get complex, and it tends to be rather unstable, in the sense that as the application grows, you'll often find yourself spending a not insignificant amount of time adding extra wiring to it. You mileage may vary here of course, but in my experience it's pretty common for a new feature to be difficult to implement not so much because it involves complex algorithms, but rather because it requires several objects/classes to cooperate for its implementation, in a way that requires them to be all wired to one another, and if that wiring is not already in place adding it may involve making significant changes all over your code base.

With the relational model, on the other hand, the wiring problem simply disappears. That's in part a consequence of the fact that relations can be navigated in any direction, and in part stems from the fact that every piece of data is always accessible from the automaton that contains it. The end result, anyway, is that, just like with searches, you get navigation for free.

Relations can elegantly encode complex facts that are awkward to represent with objects

Take, for example, the following statements:

  • Supplier S sells part P at price C (with every supplier charging a different price for the same part)
  • Supplier S sells part P at price C for orders of at least N items
  • Person A was introduced to person B by person C
  • User U joined chat group G on date D

In the relational world, they can be model by the following relations:

sells_for(SupplierId, PartId, Money)        [key: 0:1];
sells_for(SupplierId, PartId, Money, Int)   [key: 0:1:3];
introduced_by(PersonId, PersonId, PersonId) [key: 0:1];
joined_on(UserId, ChatGroupId, Date)        [key: 0:1];

where each entry in a relation represents an instance of the corresponding informal statement. I can't imagine a more natural way of encoding that information.

With objects, what you can model naturally is attributes of an entity. If, for example, every supplier charged the same price for the same part, you could model price as an attribute of parts, which could be represented as member variable price of class Part. But in a more complex case where every supplier charges a different price, you'll have to bend over backwards to do it.

If you're not convinced, just try to write the equivalent data structures using objects for one of the above facts (let's say it's the first one), remembering that you'll need to be able to, among other things:

  • Given a supplier s and a part p to check if the supplier sells a given part, and if so to retrieve its price (sells_for(s, p, *) and sells_for(s, p) in Cell, respectively)
  • Given a supplier s, to retrieve all the parts it sells, and their prices (sells_for(p, ?, ?) in Cell)
  • Given a part p, to retrieve all the suppliers that sell it, and the price they charge (sells_for(?, p, ?) in Cell)
  • To make sure that there cannot be duplicate entries for the same supplier/part combination (that's what the [key: 0:1] is for)

Of course you'll need to do all those things efficiently, since you might have a lot of parts and suppliers, so linear searches are out of the question. In the best-case scenario, you'll end up building an ad-hoc, informally-specified, bug-ridden, slow implementation of a relation.

Integrity constraints

With the relational model you can express declarative integrity constraints on your data: using the domain model described in the introductory example, you can for instance enforce the facts that no two users can have the same username, or that for every combination of user and chat group there can be only one join date and karma. That's in addition to the fact that the relational model allows you to squeeze most redundancy out of your data structures, thereby reducing the likelihood that they'll end up in an inconsistent state.

With object you've don't have such a simple, declarative and reliable way to enforce the consistency of your data, which deprives you of a powerful mechanism to avoid and to defend yourself against programming errors.

Declarative query languages

With relations you can have declarative query languages (think Datalog, not SQL). I've never seen anything similar for any other data model. A language like Datalog is not Turing complete, of course, but for the kind of computation it allows, its simplicity and expressiveness are simply unrivaled.

I can't quite put my finger on why there's nothing remotely equivalent for any other data model, but I think that the fact that relations can be easily searched, and that their tuples/entries are not ordered (unlike what happens, for example, with sequences, where some of the information is conveyed by the order in which elements are arranged, not just by the elements themselves) are both crucial to that.

The availability of a declarative query language is not important only because it simplifies some types of computation. Another reason is that the result of certain classes of declarative expressions can be recalculated incrementally and automatically whenever the input data changes. That's especially important in a programming language, as opposed to a database system.

Let's illustrate this with an (admittedly contrived) example. Let's say you're building a massively multiplayer online game, where each player gets to build their own base/town, and where they need to extract resources of some sort (oil/timber/ore/...) to do so, and to build their armies. Let's also say that players can sign bilateral trade agreements, that allow them to buy the resources they need and sell those they have an excess of. Trade can be done through intermediaries: if Alice has a trade agreement with Bob, and Bob has one with Charlie, then Alice and Charlie can trade even if they don't have a direct trade agreement, although in this case Bob will be charging a fee for his services. For simplicity, every trade will be allowed to go through only one or maybe two intermediaries at most (things would get a lot more complicated if we allowed intermediary chains of arbitrary length).

In the game, we want every player to be able to see in real time if any given resource can be purchased, what's the best price for it, what's the second best one and so on, and how much of it can be bought at each price.

Expressing this logic in a Datalog-like language is straightforward: it's almost as easy as stating it in natural language. Computing the same information from scratch in either an imperative or functional language is significantly more complex that that, although still not excessively so.

But what we would probably want to do in a case like this is not to recompute everything from scratch every time something in the (massively multiplayer) game changes, but to update the results incrementally as the game progresses. In a declarative query language, all the queries involved in this particular case can be updated incrementally and automatically by a reasonably sophisticated query engine (or by code generated under the hood by a reasonably sophisticated compiler) without any extra effort on the part of the developer. But in an imperative (or functional) one, incrementally recomputing the output map is significantly more complex than computing it from scratch, and a lot more error-prone. That's because there's a lot of factors than can change the results, all of which have to be accounted for. The stocks of resources for sales can be replenished, or they can decrease and run out. Prices can change. So can the fees that players charge for acting as intermediaries in a trade. New trade agreements can be signed, and existing ones can expire or be canceled. All of these changes have to propagated to all pieces of information that depend on them, and the exact propagation paths are different in each case.

Attributes can be modeled in a uniform way independently of their cardinality

Let's say in your data model you want to store the telephone numbers of your users. If you want to allow multiple phone numbers for each user, you can declare a binary relation like the following one:

phone(UserId, String)

If on the other hand you want each user to have at most one phone number, you'll declare the relation as follow:

phone(UserId, String) [key: 0]

In both cases the data representation is exactly the same, and the only thing that changes is that in the latter case you'll have to declare a key for the relation, that is, an integrity constraint. The same goes for relationships: the works_for relation we saw before, for instance, can encode either a many-to-many or a many-to-one relationship (it could also become a one-to-one relationship of course, if it weren't for the fact that that doesn't make any sense in this context) depending on the integrity constraints that are applied to it, and the same happens with mandatory and optional attributes.

There's at least a couple of benefits in having such a uniform representation for attributes or relationships with different cardinalities. The first one is that code written in a declarative language like Datalog will keep working without changes even as those integrity constraints change over time. The second one is that this uniformity will make it easier for new code to work with old data: if an attribute that used to have only one value becomes multivalued, you'll be able to reuse the old data without having to do any sort of conversion (of course that's not going to work in the opposite direction, since the data model in that case becomes more restrictive). The same happens when a mandatory attribute suddenly becomes optional.

This is, by the way, one of the reasons why in Cell every attribute or relationship is stored in a separate relation. In SQL databases, all single-valued attributes are stored in the same table, while multivalued ones require a separate table, and optional attributes require the use of a problematic feature like NULLs.

Controlling the scope of side effects

Since the relational model is value-based, you can have mutability and control over the scope of side effects at the same time: when you update the state of an automaton instance in Cell you've the guarantee that nothing else is affected, because automata don't share mutable state (and also because state mutation and I/O cannot be mixed, of course). That combines the advantages of functional and imperative programming, while mostly avoiding the problems of both: when you can control the scope of side effects it becomes much easier to reason about your code, and you can safely update different automata concurrently, but with most of the convenience and performance that comes from being able to update the application's state in place.

But once you introduce pointers in your data model, it becomes very difficult to retain any amount of control over the scope of the side effects (if there's an effective way to do it, I'm not aware of it). By mutating an object, you affect any part of your program from which such object can be reached, either directly or indirectly.

Pointer aliasing is incompatible with reactive programming

Let's illustrate that with an example. Say you've two variables, x and y, and you want y = f(x) to hold in your program at any time. Moreover, let's say that evaluating f(x) is computationally expensive, so that you cannot afford to do it every time you need to read the value of y. That's the kind of task reactive programming is for. In a reactive system, every time x changes, f(x) is automatically reevaluated and the result stored in y. But in order for the compiler to insert the under-the-hood code that does so, it must be able to detect when x changes. With a value-based data model that's easy to to: a variable can change only if a new value is explicitly assigned to it. But in the presence of pointer aliasing that's not possible anymore, at least in the general case. That's because there can be other references to whatever object x is pointing to, and any of them can be used to mutate it. Moreover, any change of state of any object that is reachable from x even indirectly can be enough to change the result of the evaluation of f(x).

There are, of course, reactive frameworks for object-oriented languages. For the most part, they just pretend that the above problem can't happen, and typically place the burden of detecting when a given data structure has changed on the programmer. The end result is that the code ends up being a lot more complex than it needs to be, thereby obfuscating the much simpler underlying logic. There are also other, more subtle (and insidious) problems: for example, when dealing with non-trivial reactive systems, where changes are propagated through several layers of state, all recalculations have to be done in a very specific order to ensure that they never make use of "stale" values (check the Wikipedia article on reactive programming for more details). In a value-based language, the compiler can easily figure out the correct order, but with objects that's not possible anymore and those recalculations end up being done in what is basically a random order.

Other frameworks, like ReactiveX, eschew the notion of state altogether and go fully functional (even though they are designed for imperative languages), with computation being expressed as the composition of functions that transform streams of events. That's a sound and interesting approach, but it takes a rather restrictive view of reactive programming, whose underlying ideas are useful in many context where that specific approach cannot be employed.