Relational automata

With the functional subset of the language now out of the way, we can finally examine those constructs and features that are unique to Cell, and we'll start with the most important one, relational automata. Before we delve into all the gory details, make sure you've read both the introductory example and the overview: the former will give you a much gentler introduction and the latter will explain the basic principles that govern the workings of relational automata.

Schemas

The first thing you need to provide in order to define a relational automaton is the type of its state. An automaton instance is a sort of glorified variable, which has, at any given point in time, a value which we'll usually refer to as its state. The state of an automaton is always a record, and its type is defined with a schema definition. Here's an example, similar to something we've seen in a previous chapter:

schema Counter {
  value:   Int = 0;
  updates: Int = 0;
}

The above schema definition looks a lot like a type definition, and indeed a schema can be used, with some limitations, as a normal type in function signatures. When Counter is used that way, it works exactly as if it had been defined like this:

type Counter = (value: Int, updates: Int);

Schemas are essentially types, but they are augmented by extra bits of information that apply only to automata. One such piece of information is on display in the schema definition above: a default value for the initial state of the automaton. Whenever an instance of Counter is created its intial state will be (value: 0, updates: 0), unless a different value is provided explicitly.

While an instance of an automaton like Counter can be though of as a variable of a record type, whose value can be read just as if it were a normal variable, in some ways such an instance behaves more like a mutable data structure (like a struct in C or a class in an object-oriented language) with two member variables value and updates that can be modified directly inside a message handler. In the functional subset of the language we examined in the previous chapters there's a strict distinction between variables and values: variables can be updated, that is, the value they contain can change over time, but values themselves are immutable: if you have, say, a variable v that contains a record value with two fields x and y you cannot update those fields directly like you do in imperative languages: all you can do is create a mutated copy of the record in v and store that new record inside v. If the language allowed imperative updates of the form v.x = new_x; and another variable v' somewhere else in the program pointed to the same physical data structures as v, updating v would have the side effect of updating v' as well. That's what happens all the time in imperative languages of course, but preventing this kind of unrestriced side effects is quite possibly the most fundamental feature of functional programming. In the case of automata, though, the same end is achieved by different means: either by preventing (or restricting) aliasing in the first place (that's what happens with mutable relation variables, discussed below) or by making a physical copy of some of the data structures involved, which is what usually happens under the hood when you read the whole value of an automaton variable, as opposed to the value of one of its individual fields.

Mutable relation variables

As already mentioned in the overview, the most important feature of relational automata, and of the entire language in general, is the ability to use (mutable) relation variables to encode their state. Here we need a slightly more complex example: let's say you want to build an old-fashioned chat server, that will allow users to connect from a remote computer using a command line tool like telnet. As soon as they connect they will be automatically assigned a numeric id by the server (which will be invisible to them) and they will be able to choose a (unique) username using a command of the form login somecutename. Once they're logged in they will be able to send and receive messages to and from other individual users, and will also be able to join any number of chat groups with the command join chatgroupname. Groups are managed dynamically: a group will be automatically created, if it doesn't exist already, as soon as a user tries to join it, and destroyed when the last member leaves. If you wanted to create an automaton that keeps track of usernames and the list of chat groups each user has joined, you could for instance start with something like this:

schema Logins1 {
  usernames   : [Nat -> String] = [];
  memberships : [Nat, String]   = [];
}

The usernames field/variable is a map that associates each numeric user id with its corresponding username (which is a string, of course), and memberships is a binary relation that keeps track of which chat groups each user has joined, with users identified by their numeric id and groups by their name. Here's a sample value for the state of (an instance of) the Logins1 automaton:

( usernames: [
    0 -> "tom",
    1 -> "sara",
    2 -> "betty",
    3 -> "luke",
    4 -> "clark"
  ],
  memberships: [
    0, "football";
    0, "politics";
    2, "politics";
    3, "football";
    4, "football"
  ]
)

Alternatively, the normal relation variables usernames and memberships could be turned into mutable relation variables, as shown here:

schema Logins2 {
  usernames(Nat, String) [key: 0];
  memberships(Nat, String);
}

What's the difference between Logins1 and Logins2? With regards to the set of possible states they can assume, they're identical: every valid state for Logins1 is also a valid state for Logins2, and vice versa. The difference lies in the operations you're allowed to perform on the relation variables usernames and memberships. With Logins1 you can use them as ordinary variables, whose value can be read and passed around, and you can assign a new value to them inside a message handler, subject to a number of restriction we'll talk about later. But you cannot imperatively update the values they point to, as explained earlier. With Logins2, on the other hand, those relations can be efficiently updated in place, although of course only inside a message handler and with all the restrictions that apply. You can insert new tuples and update or delete existing ones, just like you would do with SQL in a relational database. The price to pay for such a priviledge is that mutable relation variables cannot be aliased. You cannot copy them, pass them to other functions, return them from methods, or use them to build larger composite values. The following expressions or statements, for example, are all rejected by the compiler in the case of Logins2, but are perfectly valid if applied to Logins1:

// Creating a composite value
(usernames, memberships)

// Passing a relation variable to a function
merge(usernames, other_usernames)

// Copying a relation variable
usernames_copy = usernames;

// Returning a relation variable from a method
return memberships;

All you can do with mutable relation variables, apart from inserting, updating and deleting tuples, is lookups, searches, linear scans and retrieving the size of the entire relation or a projection of it:

// Looking up a user's name given their numeric id
usernames(id, !!)

// Syntactic sugared version of usernames(id, !!)
usernames(id)

// Looking up a user's numeric id give their name
usernames(!!, name)

// Returns true if there's a user whose numeric identifier
// is <id> and whose username is <name>, false otherwise
usernames(id, name)

// Returns true if there's a user whose
// numeric id is <id>, false otherwise
usernames(id, *)

// Returns true if there's a user whose username
// is <name>, false otherwise
usernames(*, name)

// Retrieving all groups a given user has joined
[g : g <- memberships(id, ?)]

// Retrieving the ids of all users who have joined a given group
[id : id <- memberships(?, group_name)]

// Making a copy of the entire relation
[u, g : u, g <- memberships]

// Number of logged-in users
|usernames|

// Number of users who joined a given group
|memberships(?, group_name)|

// Number of groups a user has joined
|memberships(id, ?)|

The result of all these expressions is an ordinary value, which cannot be updated in place but can otherwise be manipulated without restrictions. In particular, if you need to copy the content of a mutable relation variable elsewhere you need to make a physical copy of it, as show in the last of the above expressions.

You can also iterate through a mutable relation variable or a projection of it using a for loop:

// Iterating through all groups a given user has joined
for g <- memberships(u, ?):
  ...
;

// Iterating through the ids of all users who have joined a given group
for u <- memberships(?, g):
  ...
;

// Iterating through the entire relation
for u, g <- memberships:
  ...
;

Just like normal relations, mutable relation variables can be either unary, binary or ternary. Each type has its own set of use cases: unary ones are typically used to state the existence of a given entity, or to encode boolean attributes; binary ones to store entity attributes (both mandatory and optional, and single-valued or multi-valued) or relationships; and ternary ones for attributes of relationships (as opposed to of entities). We'll see plenty of example later.

Keys

The [key: 0] annotation to the declaration of usernames in Logins2 simply declares that the first column is a key for the relation, that is, that no two tuples in usernames can have the same value for the left argument. This is the same as saying that the usernames relation is actually a map, just like it is in Login1. But in this case we can do better: not only numeric identifiers but also usernames should be unique. This can be enforced by declaring a second key for usernames as shown here:

schema Logins {
  usernames(Nat, String) [key: 0, key: 1];
  memberships(Nat, String);
}

Now usernames is a bidirectional map: every numeric id is associated to a single username and vice versa. The ability to declare multiple keys is only available for mutable relation variables and not for ordinary types, at least for now. Ternary relations can also have composite keys: we'll see examples of that later.

Foreign keys

The second main type of integrity constraints that is common to pretty much all flavors of the relational model is foreign keys. Here's an example:

schema BooksAndAuthors {
  // Book ids and titles
  books(Nat, String) [key: 0];

  // Author ids and names
  authors(Nat, String) [key: 0];

  // The first argument is the id of a book, the second one that
  // of the author that wrote it, or one of the authors that did.
  // It's a many-to-many relationships, because an authors
  // generally writes many books and a book may be written
  // by more than one author
  written_by(Nat, Nat);

  // Foreign keys that ensure that written_by can
  // only reference valid books and authors
  written_by(b, a) -> books(b, _), authors(a, _);

  // Foreign key that ensure that for every book
  // there's at least one author
  books(b, _) -> written_by(b, _);
}

The last two lines in the definition of BooksAndAuthors are foreign key declarations. The first one states that an entry b, a can appear in written_by if and only if there's an entry in books whose left value is b and another entry in authors whose left value is a. Using foreign keys to guarantee that all references to a certain entity or relationship are valid is something that we're all already familiar with, since it's standard practice in the design of database schemas.

In Cell though foreign keys are also used for a second purpose: to make attributes mandatory. The second foreign key declaration in BooksAndAuthors is an example of that: it states that in order to insert a book in our dataset we must also insert at least one author for it. In a sense this second foreign key is the inverse of (part of) the first one: for every entry of the form b, _ in written_by we must have a corresponding entry of the form b, _ in books and vice versa. Together the two foreign keys basically form a single "bidirectional" key between the first column of books and the first column of written_by.

This second use of foreign keys is not something that is common in SQL databases, because there the identifier of an entity or relationship and all its single-valued attributes are stored in the same table, and therefore you're forced, when you insert a new entity or relationship, to also insert all its non-nullable attributes, or your insertion command will fail.

In Cell, on the other hand, each attribute is stored in a separate relation, and without all the appropriate foreign keys in place, you could insert a new entity or relationship without all its mandatory attributes. Therefore a typical Cell schema will contain many more foreign keys declarations than a typical SQL schema. Fortunately, the vast majority of them will be declared implicitly when using the syntactic sugar that we'll examine in the next chapter.

Here's a complete list of all types of foreign keys that are currently supported in Cell:

// Unary to unary
unary_rel_1(x) -> unary_rel_2(x);

// Binary to unary
binary_rel(x, _) -> unary_rel(x);
binary_rel(_, y) -> unary_rel(y);

// Ternary to unary
ternary_rel(x, _, _) -> unary_rel(x);
ternary_rel(_, y, _) -> unary_rel(y);
ternary_rel(_, _, z) -> unary_rel(z);

// Ternary to binary
ternary_rel(x, y, _) -> binary_rel(x, y);

// Unary to binary
unary_rel(x) -> binary_rel(x, _);
unary_rel(y) -> binary_rel(_, y);

// Unary to ternary
unary_rel(x) -> ternary_rel(x, _, _);
unary_rel(y) -> ternary_rel(_, y, _);
unary_rel(z) -> ternary_rel(_, _, z);

// Binary to ternary
binary_rel(x, y) -> ternary_rel(x, y, _);

If two foreign keys originate from the same relation you can combine their declarations. This declaration, for example:

binary_rel(x, y) -> unary_rel_1(x), unary_rel_2(y);

is equivalent to:

binary_rel(x, _) -> unary_rel_1(x);
binary_rel(_, y) -> unary_rel_2(y);

Some combinations of foreign keys are checked at runtime, while others can be checked at compile time. We'll say more about that in another chapter.

Polymorphic mutable relation variables

In the same automaton you can declare multiple mutable relation variables with the same name. Here's an example:

type UserId   = user_id(Nat);
type GroupId  = user_id(Nat);

schema UsersAndGroups {
  user(UserId);                  // Ids of all registered user
  name(UserId, String) [key: 0]; // User's names

  user(u) -> name(u, _);
  name(u, _) -> user(u);

  group(GroupId);                 // Ids of all groups
  name(GroupId, String) [key: 0]; // Group's names

  group(g) -> name(g, _);
  name(g, _) -> group(g);
}

The mutable relation variable name in UsersAndGroups is polymorphic, as name is an attribute of both users and groups. The first instance of name stores the names of the users, and the second one the names of the groups. Polymorphic relation variables are particularly useful in the presence of inheritance hierarchies: we'll see plenty of examples later.

Polymorphic mutable relation variables are subject to a number of constraints: for starters, they all must have the same arity, and their signatures have to be disjoint. They also must have the same set of keys, and for each key the types of the columns that form it must also be disjoint. name for example has a key on the first column, and that's fine because the types of such column (UserId and GroupId) are disjoint. But if you placed a key on the second column of both, like so:

name(UserId,  String) [key: 0, key: 1];
name(GroupId, String) [key: 0, key: 1];

or so:

name(UserId,  String) [key: 1];
name(GroupId, String) [key: 1];

the compiler would reject the code, because the type of the second column, String, is the same in both cases. That's because the meaning of such a key would be ambiguous: should the values in the second column unique across the entire relation, or only within each instance? In other words, should a user and a group be allowed to have the same name? In this case the answer is obviously yes, but in other circumstances (for example when dealing with inheritance hierarchies) but we would probably want the opposite. The compiler avoids the ambiguity by rejecting such a key entirely, at least for now.

In order to keep the implementation of relational automata reasonably simple and efficient, the compiler also places some restrictions on the combined use of polymorphic relations and foreign keys. We won't explain the details here, both because they're extremely convoluted and because they're still a work in progress. This problem should arise only in very unusual circumstances, but if you come across one of such cases, the compiler will tell you exactly which foreign keys you should forgo, even though it won't be able to explain exactly why.

Symmetric relations

Sometimes entities of a given type are linked together by relationships that are intrinsically symmetric. An obvious example is online friendship between user of chat apps or some social networks. Encoding a symmetric relationship with a non-symmetric relation creates a number of problems. Let's extend the UsersAndGroups schema above so that we can keep track of which users are online friends and since when:

schema UsersAndGroups {
  ... // Same as before

  are_friends(UserId, UserId);
  are_friends(u1, u2) -> user(u1), user(u2), friends_since(u1, u2, _);

  friends_since(UserId, UserId, Date);
  friends_since(u1, u2, _) -> are_friends(u1, u2);
}

We've now several different ways to encode the same set of facts. These two values for are_friends are conceptually equivalent:

[ user_id(0), user_id(1);
  user_id(0), user_id(2);
  user_id(2), user_id(3)
]

[ user_id(1), user_id(0);
  user_id(2), user_id(0);
  user_id(2), user_id(3);
  user_id(3), user_id(2)
]

but the fact that they can be represented differently creates a number of problems. If you want to check if two given users are friends, you'll have to check both possibilities (u1, u2 and u2, u1), because that specific fact could be encoded in either way. The same would happen when retrieving all friends of a given user, or when deleting a friendship from the dataset. The problem is even worse for attributes of a symmetric relationship. This, for example, is a legal value for friends_since:

[ user_id(0), user_id(1), date(12, 4, 2017);
  user_id(1), user_id(0), date( 7, 2, 2018)
]

but it's clearly inconsistent, because you've two different start dates for the same friendships. In order to avoid these problems, Cell provides symmetric relations. Here's how you can use them in UsersAndGroups:

schema UsersAndGroups {
  ... // Same as before

  // Symmetric binary relation. The two arguments are equivalent
  are_friends(UserId | UserId);
  are_friends(u1, u2) -> user(u1), friends_since(u1, u2, _);

  // Partially symmetric ternary relation. The first two arguments
  // are equivalent, but the third one is distinct
  friends_since(UserId | UserId, Date);
  friends_since(u1, u2, _) -> are_friends(u1, u2);
}

The column that separates the two arguments of are_friends marks it as a symmetric relation. Now it doesn't matter what order the two arguments are inserted in: the compiler takes care of checking both possibilities whenever you search the relation, or when you delete data from it. The same happens with the first two arguments of friends_since. Even the foreign keys are simpler now. These two foreign keys are now equivalent, so you need to declare only one of them:

are_friends(u, _) -> user(u);
are_friends(_, u) -> user(u);

The compiler expects the arguments of a symmetric binary relation to be of the same type. The same goes for the first two arguments of a partially symmetric ternary relation.

Performance

A naive implementation of this data model is, of course, not going to perform particularly well. Luckily, that's not the case for a more optimized one. Mutable binary relations can actually be implemented very efficiently, especially if they have a key, in terms of both speed and memory footprint. There are trade-offs to be made between access speed, update speed and memory consumption for mutable ternary relations, and their implementation is more complex, but they too can be made to perform just as well as informationally equivalent, hand-optimized data structures. And an efficient implementation of mutable unary relations/sets is of course trivial. This is made possible by (among other things) the strict separation that we have in Cell between the logical definition of a data structure and its physical, under-the-hood implementation. We won't go into detail yet, since the implementation of those low-level data structures is not complete yet (it will be completed with the next release of the compiler, 0.4), except to say that the final data structures will essentially be a more general version of those used in the Entity/Component architecture adopted by many videogames.