The Cell Programming Language

Frequently Asked Questions

Why yet another programming language?

Because there's currently no high-level programming language for writing stateful software that I'm aware of. That's the short answer.

To elaborate a bit on that we need to discuss separately the three main types of what we generically refer to as "computation": pure computation, state and updates, and I/O.

Pure computation is the process of computing a result from some input data. That's what pure functions do. Some types of software, like compilers or scripts that perform scientific calculations do hardly anything else.

But in many cases pure computation is only a (relatively small) part of what software does. Most applications are stateful: they sit there, idle or semi-idle, and wait for some sort of input from the external world. When that happens they change their state in response (and usually also perform some I/O). Think, for example, of desktop, web or mobile application, which respond to user input, or some types of server applications, which respond to requests issued by the client across the network.

Finally, there's I/O, which is what happens when a piece of software interacts with the external world.

For pure computation, there's certainly a number of languages that can be regarded as high-level, at least in some specific niches. Languages like Matlab and Mathematica, for example, are very effective for things like numerical and scientific computation, and functional languages like Haskell do a decent enough job (although there's certainly a lot of room for improvement there) when implementing things like parsers, compilers and other types of data-transformation software.

But when it comes to managing state, we're still living in the stone age. The programming paradigm that has been by far the most popular for the last twenty years or so, OOP, is a complete disaster, hampered by, among (many) other things, a very low-level way of representing information, one that makes data difficult to inspect, manipulate and update with the result that code ends up being far more complex that it should be, and much less reliable. Functional programming's sins, on the other hand, are mainly of omission: in its pure form it simply doesn't have a satisfactory way to deal with state, which I suspect is the main reason it never expanded outside a few niches. Impure functional programming is probably the best thing we've at the moment, but it's still far from ideal.

What's worse, innovation seems to be dead. With only one (admittedly important) exception I'm aware of, there's been no real progress towards higher-level languages for this type of software for the last 30 years or so.

The situation is even worse, if possible, for I/O. Even task that could be carried out automatically in any kind of sane programming language/environment, often end up being a huge drag on developers. Think for example of how much much effort goes into writing code that only moves data into and out of databases. Or how much more complex it is to implement a client/server application, compared to a local application with similar functionalities, and how tedious and time consuming it often is to implement even trivial things like sending data from the client to the server and vice-versa. Yet the majority of these task could be performed automatically by the compiler under the hood, if one starts with a properly designed language and network architecture.

So while there's certainly a lot that could be done to improve the ability of today's programming languages to perform pure computation, it's rather obvious that the low-hanging fruit in language design is in the other two areas, state and I/O.

Cell is not about pure computation, and it's a rather conventional language in that regard: it's basically a combination of ideas from Matlab and functional programming. Its aim instead is to improve the way state is managed, and to enable the compiler and runtime environment to automatically perform many I/O tasks that are managed by the developer in conventional languages.

How does Cell improve on existing languages?

The single most important improvement is in the way information is represented. Cell's data model combines a staple of functional programming, algebraic data types, with relations and other ideas from relational databases. That alone goes a long way toward simplifying (drastically, in some cases) both the data structures used to encode the state of an application and the code that manipulates them. The data model and type systems are also entirely structural, which brings a number of benefits that are discussed elsewhere, and most importantly enables one of the language's major features: orthogonal persistence.

There's also a strict separation between pure computation and state updates, and the developer is encouraged to shift as much complexity as possible into the purely functional part of the code, which is intrinsically less problematic to deal with. That provides many of the benefits of functional programming, while avoiding the difficulties the latter paradigm has in dealing with state. That's made more effective by the fact that the state of the application can be partitioned in separate components (called automata) that do not share any mutable state and can be safely updated concurrently. Future versions of the language will provide additional tools that will help simplifying the part of the codebase that mutates the application's internal state.

The combination of all the above properties, plus the fact that I/O is also separate from pure computation and updates, enables a number of features that are not found in conventional languages.

The first one is the ability to "replay" the execution of a Cell program. One can easily reconstruct the exact state of a Cell program at any point in time. Obviously that's very useful for debugging, but there are others, more interesting ways of taking advantage of that. It's for example possible to run identical replicas of the same Cell process on different machines over a network, and that's something that will be crucial in Cell's future network architecture.

The second one is orthogonal persistence, that is, the ability to take a snapshot of the state of the application (or part of it), which can then later be used to create with minimal effort an identical copy of it, that will behave in exactly the same way. In Cell, you don't save your data the way you do it in conventional languages. Instead, you simply declare which parts of your application's data structures have to be persisted, and the compiler generates for you all the code needed to save and load them.

The third one is a very robust error-handling mechanism, based on transaction. If an exception is thrown during an update, or if your code violates any of the integrity constraints that are placed on your data model, the entire update is simply rolled back and discarded, and the state of your application is left untouched.

Then there's support for reactive programming, which is still in its infancy. At the moment all the language provides is reactive automata, which can be used to implement certain specific types of reactive software, but reactive programming itself is something much more general than that, and a lot more is coming in the future.

The last major part of the language will be its network model/architecture, which hasn't been implemented yet.

What is functional/relational programming?

The term "functional/relational programming" was coined (I believe) in a paper that came out more than a decade ago, Out of the Tar Pit. It refers to a programming paradigm that combines functional programming with some version of the relational model (in the case of Cell one that is very different from the flavor used in SQL databases), with the former providing the general purpose programming capabilities, and the latter the ability to encode, navigate, query and update large and complex application states or datasets.

The starting point of functional/relational programming is the observation that the primary source of complexity and unreliability in our programs is the presence of way too much mutable state. So in order to address that problem, the first step is to avoid state altogether whenever possible. That means doing all pure computation using the functional part of the language, and clearly separating pure code from code that has side effects.

The next step is to identify the parts of an application's state that are redundant, and to get rid of them. A piece of information is redundant if it can be derived from other pieces of information that are already present in the system, or, in other words, if it can always be reconstructed after having been deleted. Once you've removed all redundant information what you're left with is the essential state: at this point, there's nothing else you can remove without permanently losing information. That leaves you with a much smaller state, which is therefore a lot easier to deal with. Then you need to design the data structures that encode the essential state so as to avoid invalid/inconsistent states as much as possible. These are not new ideas: they have been conventional wisdom in the database community for nearly half a century.

For this latter step to be possible, you need a high-level data model, and here the relational model, in one form or another, is the only game in town. A high level data model also has the added benefit of making information easier to search, navigate and manipulate.

Why is data/state representation so important?

This is the kind of thing that is difficult to explain in the abstract, but which becomes obvious after you've implemented a nontrivial piece of software with a language that provides a high-level data model.

But as an example, compare the mathematical definition of a set with its implementation in an imperative or even functional language. A set in mathematics is an extremely simple entity: it's just a collection of elements with no order and no duplicates. Operations like union, intersection, difference or cartesian product can be defined in a single very short line.

Now compare that to an implementation in any imperative or even functional language that uses, say, red-black trees. Those data structures contain a lot of low-level "information" that has nothing to do with the abstract notion of set, and are subject to complex invariants that have to be carefully maintain every time they are updated. A typical implementation in an imperative language consists of around a thousand lines of code, and it's very difficult to get right.

And while red-black trees may be an especially complex example, the data structures used to encode the state of any imperative (or, to a lesser extent, functional) program are every bit as low level and suffer from the same problems, like the presence of an overwhelming amount of low-level noise and complex, hard to maintain dependencies between their components, only on a much larger scale.

What is reactive programming?

I'm not sure there's a generally agreed upon definition, as reactive programming has been implemented in many different forms. One thing they all have in common, though, is the automatic propagation of change. It is designed to solve one specific problem: very often, in a stateful application, you end up having two (or more) variables or, more generally, pieces of the application's state (let's call them A and B in what follows) that depend on each other, in the sense that whenever one of them changes, the other has to change as well.

The simplest form of dependency is when one of the two can be derived from the other, that is, B = f(A), where f(..) is a pure functions. One large scale example of this is the The Elm Architecture (which is basically a functional version of the Model/View/Controller architecture), where A is the model, B the HTML code that describe the (current state of the) UI, and f(..) is the view function. Every time the model changes, the view function is automatically recalculated and UI is refreshed.

In less trivial cases the dependency between A and B may take the form B' = f(A, B) (with B and B' being the old and new values of B respectively), that is, every time A changes, the value of B is updated but not entirely reset. As an example, say that you're implementing a software thermostat that controls your air conditioner. You want to switch it on when the temperature exceeds 28°C, and to switch it off when it goes back below 24°C. The state of the air conditioner is not entirely determined by the current temperature, because between 24°C and 28°C it could be either on or off, depending on the the history of the temperature signal/variable. Here's the Cell code for that:

reactive Thermostat {
  input:
    temperature: Float;

  output:
    on: Bool;

  state:
    // When the system is initialized, on is true if
    // and only if the current temperature exceeds 28°C
    on: Bool = temperature > 28.0;

  rules:
    // Switching on the air conditioner when
    // the temperature exceeds 28°C
    on = true when temperature > 28.0;

    // Switching it off when it falls below 24°C
    on = false when temperature < 24.0;
}

These dependencies can become a lot more complex that this, of course. The dependency between A and B can, for example, be mutual, with a change to A triggering an update/recalculation of B and vice-versa. This is not something the reactive layer of Cell is equipped to deal with at the moment, but this kind of situation will be arise in the future network architecture, and the language will need special constructs to deal with that.

Note that reactive programming is not in and of itself a general-purpose programming paradigms, but rather a set of capabilities that augment an existing paradigm. It works a lot better when the paradigm it's applied to is a functional or declarative one, because in languages with side effects and pointer aliasing it becomes impossible to figure out statically when a variable has changed, and in what order the recalculations should be done, and in order to compensate for that, reactive frameworks for imperative languages have to introduce a number of restrictions and/or resort to baroque architectures that make the code significantly more complex, and obscure the much simpler underlying logic.

If you want to learn more, check out the Wikipedia pages on reactive and functional reactive programming.

Are functional/relational programming and relational automata only for CRUD applications?

Absolutely not, although you might be excused for getting that impression.

The vast majority of developers regard the relational model as some sort of legacy, obsolete technology, and believe that modeling a given problem or domain in terms of objects/classes is somehow superior to modeling it using relations. In fact, that's completely backwards. Objects are a clumsy and very low-level technology, and I'm not aware of any type of information that cannot be modeled much better with the relational model than with objects (Why relations are better than objects explains this claim in more detail).

In all likelihood, the relational model gets it bad rap from the fact that the only implementation of it developers are familiar with is the one found in SQL databases. That particular implementation was very lame to begin with and has not really evolved since its initial design nearly half a century ago. On top of that, it was designed specifically for databases, and it's not well-suited for general purpose programming. But all those flaws and limitations are specific to a particular implementation, not the relational model itself.

The best way to illustrate the advantages of functional/relational programming over OOP is through a case study. Once version 0.5 is out and the basic set of features of the language is finally complete, we'll be comparing the implementations of a number of small but realistic applications in the two paradigms (for a problems that are universally regarded as especially well-suited to OOP) and that will show how much shorter, simpler and more elegant the functional/relational implementation can be, even when such a comparison is made on OOP's home turf.