A comparison with OOP

Note: if you haven't already, read the introductory example before reading this.

We're now going to re-implement the Cell code we saw in the previous chapter in Java, in order to get a better sense of the differences between the two programming paradigms, OOP and the functional/relational approach used in Cell. Were're not trying here to replicate more "advanced" features of Cell, like orthogonal persistence, transactions, or the ability to record and replay the execution of the application. We just want to show how they compare in terms of the usual metrics that we value as software engineers, things like simplicity, reliability, maintainability, overall development effort and so on.

On the Java side, in order to avoid boring boilerplate code that would make the comparison less clear, we'll be only partially following one of the fundamental tenets of OOP, encapsulation. All member variables will be public, but they are intended to be read-only outside the class they belong to. Any change to the state of any object will only be carried out by methods of its class. We will also omit constructors and some very simple methods, since their implementation will be obvious.

Both the Cell and Java versions of the code are available on github.

Data structures

We've already seen the Cell schema for the core logic of our online forum:

type UserId  = user_id(Int);
type GroupId = group_id(Int);

schema OnlineForum {
  next_id : Int = 0;

  user(UserId)
    username       : String [unique],
    signup_date    : Date,
    first_name     : String,
    last_name      : String,
    date_of_birth? : Date;

  chat_group(GroupId)
    name  : String,
    admin : UserId;

  member(UserId, GroupId)
    joined_on : Date,
    karma     : Int;

  friends(UserId | UserId)
    since : Time;

  admin(_, id) -> user(id);
  member(u, g) -> user(u), chat_group(g);
  friends(u1, u2) -> user(u1), user(u2);
}

There are of course many ways to design a set of equivalent data structures in Java (unlike what happens with Cell, where there's just one obvious design), each of them with its own trade-offs, but this seems a reasonable starting point:

public class User {
  public int        id;
  public String     username;
  public LocalDate  signupDate;
  public String     firstName;
  public String     lastName;
  public LocalDate  dateOfBirth;

  public Set<Membership> memberships;
  public Set<Friendship> friendships;
}

class Group {
  public int    id;
  public String name;
  public User   admin;

  public Set<Membership> memberships;
}

class Membership {
  public User       user;
  public Group      group;
  public LocalDate  joinedOn;
  public int        karma;
}

class Friendship {
  public User       user1;
  public User       user2;
  public LocalDate  since;
}

class OnlineForum {
  int nextId = 0;

  public Set<User>  users;
  public Set<Group> groups;
}

There are a couple obvious issues with the Java version of the schema. The first one is that there's some redundancy in the data. The most obvious part of it is the memberships member variables in User and Group: either one could be eliminated without actually removing any information from the data set. The redundancy is needed because we have to be able to efficiently retrieve both the list of groups a given user has joined (and also join date and karma for each of them) and the list of all members of a given group. The same goes for friendships, because the fact that Alice and Bob are friends has to be recorded in both the corresponding objects.

Redundancy causes two main problems: the first one is that if some piece of information is stored in multiple places, every time it changes one has to update each of the locations that contain it, and that means writing more code. The other problem has to do with consistency: a bug in the code could cause redundant data structures to go out of sync, and that would almost surely cause the software to misbehave. One could end up for example in a situation where a User "thinks" it has joined a Group, but the corresponding Group object doesn't have such user among its members. The redundancy also makes data structures less clear and more difficult to understand.

The other major problem is the fact that in OOP there's no way to enforce a number of useful integrity constraints on the data. Nothing would prevent buggy code from, for example, creating two different Membership objects for the same combination of user and chat group, with the result that a user could end up having two different join dates and karmas in the same group. Similar problems could occur with Friendship objects. And of course there's no way to declaratively enforce the fact that usernames have to be unique.

All these problems are easily avoided using the relational model. In OOP the only way to keep your application or dataset from ending up in an inconsistent, invalid state is to write bug-free code. But code is complex and difficult to write, and will invariably contain bugs in any non-trivial application, while redundance-free data structures and declarative integrity constraints are very easy to design and get right.

Updating the information in the dataset

We're now going to write methods that are equivalent to the Cell message handlers. Inserting a new user or group is trivial and very similar in both languages, so we will just skip that. Instead, we'll start with the code that allows a user to join a chat group, or two users to become online friends. Here's the Cell code for that:

OnlineForum.join_group(user: UserId, group: GroupId, date: Date) {
  insert member(this.user, this.group)
    joined_on = this.date,
    karma     = 0;
}

OnlineForum.add_friendship(user1: UserId, user2: UserId, since: Time) {
  insert friends(this.user1, this.user2)
    since = this.since;
}

In Java the same functionality is implemented by the following methods of the User and Group classes (constructors are not shown):

class User {
  ... // Same as before

  public void join(Group group, LocalDate date) {
    Membership membership = new Membership(this, group, date);
    memberships.add(membership);
    group.addMembership(membership);
  }

  public void befriend(User user, LocalDate since) {
    Friendship friendship = new Friendship(this, user, since);
    friendships.add(friendship);
    user.friendships.add(friendship);
  }
}

class Group {
  ... // Same as before

  public void addMembership(Membership membership) {
    memberships.add(membership);
  }
}

The Java code is slightly more complex, in the case of join(..) because the same Membership object has to be inserted in both User.memberships and Group.memberships, and in the case of befriend(..) because the same Friendship object has to be added to the friendships member variables of both parties involved. But that's not much of deal.

Keep in mind though that in Cell all message handlers that insert an entity or a relationship and all their attributes in the dataset have a standard structure and can and will be auto-generated in a future version of the compiler, so a few months from now you won't have to write any code at all for that. That's in general not possible in OOP, because the data model is a lot messier.

The code that deletes data from the dataset is more interesting. Here's the Cell version (it's a slightly more general version of the code we saw previously):

OnlineForum.delete_user(id: UserId) {
  id = this.id;
  delete user(id), friends(id, *);
  for g <- member(id, ?)
    leave_group(id, g);
}

OnlineForum.delete_group(id: GroupId) {
  delete chat_group(this.id), member(*, this.id);
}

OnlineForum.leave_group(user: UserId, group: GroupId) {
  leave_group(this.user, this.group);
}

using OnlineForum {
  leave_group(UserId user, GroupId group) {
    delete member(user, group);
    members = [u : u <- member(?, group), u != user];
    if members != [] {
      new_admin = any(max_by(members, karma($, group)));
      update admin(group, new_admin);
    }
    else
      delete chat_group(group);
  }
}

In Java, one would add a delete() methods to both User and Group:

class User {
  ... // Same as before

  public OnlineForum onlineForum;

  public void delete() {
    onlineForum.remove(this);

    for (Membership m : memberships)
      m.group.removeMember(this);

    for (Friendship f : friendships) {
      User otherUser = f.user1 == this ? f.user2 : f.user1;
      otherUser.friendships.remove(this);
    }
  }

  public void leaveGroup(Group group) {
    if (memberships.remove(group))
      group.removeMember(this);
  }
}

class Group {
  ... // Same as before

  public OnlineForum onlineForum;

  public void delete() {
    onlineForum.remove(this);

    for (Membership m : memberships)
      m.user.leaveGroup(this);
  }

  public void removeMember(User user) {
    if (memberships.remove(user)) {
      user.leaveGroup(this);
      if (user == admin) {
        Optional<Membership> newAdminMembership = memberships.stream()
          .collect(Collectors.maxBy(m1, m2 -> m1.karma - m2.karma));
        if (newAdminMembership.isPresent())
          admin = newAdminMembership.get().user;
        else
          delete();
      }
    }
  }
}

class OnlineForum {
  ... // Same as before

  public void remove(User user) {
    if (users.remove(user))
      user.delete();
  }

  public void remove(Group group) {
    if (groups.remove(group))
      group.delete();
  }
}

Here the problems created by a bad data representation are more evident. The first thing we've to do is to add new wiring to the class model: both User and Group now need to have a pointer to OnlineForum, since their delete() methods need to remove the corresponding objects from the list of live objects that is held by OnlineForum. That means adding a new member variable to both, changing both constructors (not shown in the above code) and all places where objects of either class are instantiated. In a toy application like this one that may not be much of a problem, but everyone who has enough experience of real-world software development is familiar with this situation: implementing a new feature causes the classes/objects where the logic is located to need access to other nodes of the object graph, which has therefore to be augmented with some extra wiring. Depending on how "distant" the two parts of the graph are, many places in the codebase may have to be modified in order to carry a pointer to the required data structures from the place where it's already available to the one where it's now needed. This is a problem that completely disappears with the relational model.

As was the case with the join(..) and befriend(..) methods we saw before, the delete() methods have more work to do because of the redundancy in the data: User.delete() has to leave all the chat groups the user has joined, and remove the other side of each friendship, in orders to remove all traces of itself, while Group.delete() has to make all members leave the group. And both of them need to remove the target objects from OnlineForum as we mentioned earlier.

Another thing that is making the code more difficult to understand and less elegant is the fact that there's a mutual dependency between leaveGroup(..) and removeMember(..): each of them needs to call the other in order to complete its work, and therefore there has to be some check in place to avoid entering an infinite loop. It's an example of how easily a low-level data representation can lead to spaghetti code as soon as your data model includes relationships that need to be navigated in either direction.

The same type of cyclical dependency is repeated with the delete() methods and the corresponding remove(..) methods in OnlineForum. If the latter were simply defined like this:

class OnlineForum {
  ... // Same as before

  public void remove(User user) {
    users.remove(user.id)
  }

  public void remove(Group group) {
    groups.remove(group.id);
  }
}

they would be unsafe, because if they were erroneously called on their own by the user of our classes (which could well happen since they are public) they would cause the dataset to enter an inconsistent state: the user or group in question would be removed from the list of objects held by OnlineForum, but would still be reachable from all the Membership and Friendship objects that reference them.

Searches

Let's now implement one of the (read-only) methods that we saw previously, users_who_signed_up_on(..):

using OnlineForum {
  [UserId] users_who_signed_up_on(Date d) = [u : u <- signup_date(?, d)];
}

This is one way to implement it, using the Multimap class in Google's Guava libraries:

class OnlineForum {
  ... // Same as before

  Multimap<LocalDate, User> usersBySignupDate =
    new HashMultimap<LocalDate, User>();

  public Collection<User> usersWhoSignedUpOn(LocalDate date) {
    return usersBySignupDate.get(date);
  }

  public void add(User user) {
    users.add(user);
    usersBySignupDate.put(user.signupDate, user);
  }

  public void remove(User user) {
    if (users.remove(user)) {
      user.delete();
      usersBySignupDate.remove(user.signupDate, user);
    }
  }
}

The usersWhoSignedUpOn(..) method and the usersBySignupDate member variable are new, while add(..) and remove(..) were already there, but had to be modified to support the new functionality (The add(...) methods was not shown before, as it was a trivial one-liner).

Here the changes to the Java code are pretty simple, but that still has to be repeated for each type of search for which efficiency is a concern, and there's always the possibility that one might miss some of the locations in one's code that need to be updated. With the relational model, on the other hand, you get efficient searches for free.

Conclusions

There's obviously only so much that can be gathered from a toy example like this, but some of the issues that affect code written in a low-level programming paradigm like OOP should already be discernible:

  • Data structures are significantly more complex than they should be, and contain a lot of noise/redundancy that cannot be removed without compromising their usability and performance.
  • Code that updates those same data structures ends up being longer and more difficult to write, mainly because of data redundancy
  • Redundancy and the lack of declarative integrity constraints make it way, way easier to have bugs that lead to an inconsistent and invalid application state
  • There's a nontrivial amount of effort that has to be spent writing and maintaining code that wires the object graph and creates and updates the extra data structures that are needed to efficiently search your data. Using relations makes those issues disappear
  • None of the above problems are specific to Java: instead, they affect any programming language that uses records and pointers in its data model, and that includes all object-oriented languages

One thing that cannot be gauged by a simplistic example like this one is the magnitude of the savings in terms of code size (and complexity) that are made possible by the use of the relational model. For that, we definitely need a bigger and more realistic example, but you'll have to wait a few more months for that. Stay tuned.