A comparison with OOP
Note: if you haven't already, read the introductory example before reading this.
We're now going to re-implement the Cell code we saw in the previous chapter in Java, in order to get a better sense of the differences between the two programming paradigms, OOP and the functional/relational approach used in Cell. Were're not trying here to replicate more "advanced" features of Cell, like orthogonal persistence, transactions, or the ability to record and replay the execution of the application. We just want to show how they compare in terms of the usual metrics that we value as software engineers, things like simplicity, reliability, maintainability, overall development effort and so on.
On the Java side, in order to avoid boring boilerplate code that would make the comparison less clear, we'll be only partially following one of the fundamental tenets of OOP, encapsulation. All member variables will be public, but they are intended to be read-only outside the class they belong to. Any change to the state of any object will only be carried out by methods of its class. This is similar to what happens in Cell with relational automata, whose state is visible in read-only mode to the outside world, but can be mutated only by the message handlers of the same automaton, although there are several major differences here, one of them being the fact that in Cell that's actually enforced by the language, while in Java it is left to the programmer's self-discipline.
In the Java version we will also omit all constructors and a number of very simple methods, since their implementation will be obvious.
We've already seen the Cell schema for the core logic of our online forum:
There are of course many ways to design a set of equivalent data structures in Java (unlike what happens with Cell, where there's just one obvious design), each of them with its own trade-offs, but this seems a reasonable starting point:
There are a couple obvious issues with the Java version of the schema. The first one is that there's some redundancy in the data. The most obvious part of it is the
memberships member variables in
Group: either one could be eliminated without actually removing any information from the data set. The redundancy is needed because we have to be able to efficiently retrieve both the list of groups a given user has joined (and also join date and karma for each of them) and the list of all members of a given group. The same goes for
friendships, because the fact that Alice and Bob are friends has to be recorded in both the corresponding objects.
Redundancy causes two main problems: the first one is that if some piece of information is stored in multiple places, every time it changes one has to update each of the locations that contain it, and that means writing more code. The other problem has to do with consistency: a bug in the code could cause redundant data structures to go out of sync, and that would almost surely cause the software to misbehave. One could end up for example in a situation where a
User "thinks" it has joined a
Group, but the corresponding
Group object doesn't have such user among its members. The redundancy also makes data structures less clear and more difficult to understand.
The other major problem is the fact that in OOP there's no way to enforce a number of useful integrity constraints on the data. Nothing would prevent buggy code from, for example, creating two different
Membership objects for the same combination of user and chat group, with the result that a user could end up having two different join dates and karmas in the same group. Similar problems could occur with
Friendship objects. And of course there's no way to declaratively enforce the fact that usernames have to be unique.
All these problems are easily avoided using the relational model. In OOP the only way to keep your application or dataset from ending up in an inconsistent, invalid state is to write bug-free code. But code is complex and difficult to write, and will invariably contain bugs in any non-trivial application, while redundance-free data structures and declarative integrity constraints are very easy to design and get right.
Updating the information in the dataset
We're now going to write methods that are equivalent to the Cell message handlers. Inserting a new user or group is trivial and very similar in both languages, so we will just skip that. Instead, we'll start with the code that allows a user to join a chat group, or two users to become online friends. Here's the Cell code for that:
In Java the same functionality is implemented by the following methods of the
Group classes (constructors are not shown):
The Java code is slightly more complex, in the case of
join(..) because the same
Membership object has to be inserted in both
Group.memberships, and in the case of
befriend(..) because the same
Friendship object has to be added to the
friendships member variables of both parties involved. But that's not much of deal.
Keep in mind though that in Cell all message handlers that insert an entity or a relationship and all their attributes in the dataset have a standard structure and can and will be auto-generated in a future version of the compiler, so a few months from now you won't have to write any code at all for that. That's in general not possible in OOP, because the data model is a lot messier.
The code that deletes data from the dataset is more interesting. Here's the Cell version (it's a slightly more general version of the code we saw previously):
In Java, one would add a
delete() methods to both
Here the problems created by a bad data representation are more evident. The first thing we've to do is to add new wiring to the class model: both
Group now need to have a pointer to
OnlineForum, since their
delete() methods need to remove the corresponding objects from the list of live objects that is held by
OnlineForum. That means adding a new member variable to both, changing both constructors (not shown in the above code) and all places where objects of either class are instantiated. In a toy application like this one that may not be much of a problem, but everyone who has enough experience of real-world software development is familiar with this situation: implementing a new feature causes the classes/objects where the logic is located to need access to other nodes of the object graph, which has therefore to be augmented with some extra wiring. Depending on how "distant" the two parts of the graph are, many places in the codebase may have to be modified in order to carry a pointer to the required data structures from the place where it's already available to the one where it's now needed. This is a problem that completely disappears with the relational model.
As was the case with the
befriend(..) methods we saw before, the
delete() methods have more work to do because of the redundancy in the data:
User.delete() has to leave all the chat groups the user has joined, and remove the other side of each friendship, in orders to remove all traces of itself, while
Group.delete() has to make all members leave the group. And both of them need to remove the target objects from
OnlineForum as we mentioned earlier.
Another thing that is making the code more difficult to understand and less elegant is the fact that there's a mutual dependency between
removeMember(..): each of them needs to call the other in order to complete its work, and therefore there has to be some check in place to avoid entering an infinite loop. It's an example of how easily a low-level data representation can lead to spaghetti code as soon as your data model includes relationships that need to be navigated in either direction.
The same type of cyclical dependency is repeated with the
delete() methods and the corresponding
remove(..) methods in
OnlineForum. If the latter were simply defined like this:
they would be unsafe, because if they were erroneously called on their own by the user of our classes (which could well happen since they are public) they would cause the dataset to enter an inconsistent state: the user or group in question would be removed from the list of objects held by
OnlineForum, but would still be reachable from all the
Friendship objects that reference them.
Let's now implement one of the (read-only) methods that we saw previously,
This is one way to implement it, using the
Multimap class in Google's Guava libraries:
usersWhoSignedUpOn(..) method and the
usersBySignupDate member variable are new, while
remove(..) were already there, but had to be modified to support the new functionality (The
add(...) methods was not shown before, as it was a trivial one-liner).
Here the changes to the Java code are pretty simple, but that still has to be repeated for each type of search for which efficiency is a concern, and there's always the possibility that one might miss some of the locations in one's code that need to be updated. With the relational model, on the other hand, you get efficient searches for free.
There's obviously only so much that can be gathered from a toy example like this, but some of the issues that affect code written in a low-level programming paradigm like OOP should already be discernible:
- Data structures are significantly more complex than they should be, and contain a lot of noise/redundancy that cannot be removed without compromising their usability and performance.
- Code that updates those same data structures ends up being longer and more difficult to write, mainly because of data redundancy
- Redundancy and the lack of declarative integrity constraints make it way, way easier to have bugs that lead to an inconsistent and invalid application state
- There's a nontrivial amount of effort that has to be spent writing and maintaining code that wires the object graph and creates and updates the extra data structures that are needed to efficiently search your data. Using relations makes those issues disappear
- None of the above problems are specific to Java: instead, they affect any programming language that uses records and pointers in its data model, and that includes all object-oriented languages
One thing that cannot be gauged by a simplistic example like this one is the magnitude of the savings in terms of code size (and complexity) that are made possible by the use of the relational model. For that, we definitely need a bigger and more realistic example, but you'll have to wait a few more months for that. Stay tuned.