The Cell Programming Language

Imperative code

Functions can also be written using a procedural syntax. The following function returns the number of non-zero elements in a sequence of integers:

Int count_non_zero(Int* xs) {
  count = 0;
  for x <- xs
    if x != 0
      count = count + 1;
  return count;
}

When using the procedural syntax, the body of the function is, instead of an expression, just a number of statements written between braces. Here we see four types of statements: assignments, returns, conditionals and for loops. The body of a composite statements is enclosed in braces, unless it consists of a single statements, in which case the braces are optional. Simple if statements can also written in postfix notation: the count_non_zero(..) function can be rewritten as follows:

Int count_non_zero(Int* xs) {
  count = 0;
  for x <- xs
    count = count + 1 if x != 0;
  return count;
}

The for loop iterates through the sequence xs storing the current element in x. For loops can make use of index variables and tuple destructuring just like sequence comprehension. The following function for example takes a sequence of integer pairs, and returns the index of the first pair whose elements are equal, wrapped as a maybe type:

Maybe[Nat] index_of_first_equal_pair((Int, Int)* ps) {
  for x, y @ i <- ps
    return :just(i) if x == y;
  return :nothing;
}

You can also iterate over sets and relations, but in that case the iteration order is implementation-defined and the index of the element/tuple is not available, since those types of collections are not ordered. The syntax is also slightly different. This is the implementation of the polymorphic function accumulate(..) that we used in the previous chapter:

Int accumulate([T] s, (T -> Int) f) {
  sum = 0;
  for e <- s
    sum = sum + f(e);
  return sum;
}

Int accumulate([+T, T] r, (T -> Int) f) {
  sum = 0;
  for a b <- r // No comma between a and b in this case
    sum = sum + f(a) + f(b);
  return sum;
}

Int accumulate([+T, T, T] r, (T -> Int) f) {
  sum = 0;
  for a b c <- r // No comma between a, b and c
    sum = sum + f(a) + f(b) + f(c);
  return sum;
}

If you don't need one of the loop variables, you can replace it with an underscore, just like in comprehension expressions:

Int sum_left((Int, Any)* ps) {
  sum = 0;
  for n, _ <- ps
    sum = sum + n;
  return sum;
}

Float right_col_prod([Any, Float] r) {
  prod = 1.0;
  for _ x <- r
    prod = prod * x;
  return prod;
}

All the above for loops are actually "foreach" loops: they traverse a collection and execute the body for each of its elements or entries. You can also have standard "for" loops, which iterate over a range of integer. Here's how you can define a Haskell-style foldr(..) function:

B foldr(A* xs, B y0, (A B -> B) f) {
  y = y0;
  len = |xs|;
  for i < len {
    x = xs(len-i-1);
    y = f(x, y);
  }
  return y;
}

The body of the for loop above is executed len times, with the value of the loop variable i going from 0 to len-1. It's basically the equivalent of for (int i=0 ; i < len ; i++) in C or Java. The for loop has a number of different forms, listed here with their C/Java equivalent:

for i <= count {    // for (i=0 ; i <= count ; i++) {
  ...               //   ...
}                   // }

for i = m..n {      // for (i=m ; i < n ; i++) {
  ...               //   ...
}                   // }

for i = m...n {     // for (i=m ; i <= n ; i++) {
  ...               //   ...
}                   // }

Both "for" and "foreach" loops can be combined: this is a (lame and very inefficient) function that looks for duplicates in a sequence:

T* duplicates(T* s) {
  ds = ();
  for x @ i <- s; j = i+1..|s| {
    y = s(j);
    if x == y {
      ds = (ds | x);
      break;
    }
  }
  return ds;
}

It's completely equivalent to this second version, which doesn't combine loops:

T* duplicates(T* s) {
  ds = ();
  for x @ i <- s
    for j = i+1..|s| {
      y = s(j);
      if x == y {
        ds = (ds | x);
        break;
      }
    }
  return ds;
}

Any number of loops can combined, not just two. Note that in both cases, the break statement only terminates the innermost loops.

While loops behave as expected. This is a function that merges two sequences of sorted integers into a single sorted sequence:

Int* merge(Int* xs, Int* ys) {
  zs = ();
  i = 0;
  j = 0;
  while i < |xs| or j < |ys| {
    if i == |xs| or (j < |ys| and ys(j) < xs(i)) {
      zs = (zs | ys(j));
      j = j + 1;
    }
    else {
      zs = (zs | xs(i));
      i = i + 1;
    }
  }
  return zs;
}

There's also special syntax for infinite loops, that have to be explicitly terminated by a break or a return. The following function repeatedly applies a function to a value until it reaches a fixpoint:

T fixpoint(T x, (T -> T) f) {
  last = x;
  loop {
    next = f(last);
    return next if next == last;
  }
}

The fail statement is for statements what undefined is for expressions: it throws an exception. It doesn't take any argument. The following function returns the index of the first occurrence of an element in the sequence, or fails if the sequence doesn't contain such an element:

Nat index_first(Any e, Seq s) {
  for x @ i <- s
    return i if x == e;
  fail;
}

The assert and print statements are used for debugging. They do what you would expect: the first one evaluates a boolean expression and fails if it's false, the second one just prints the value of any given expression using the standard textual representation of values in Cell. Any value printed in such way can be copied and pasted directly inside Cell source code, but it usually has to be enclosed between #{...}. A few examples:

assert i > 0;
assert p(x);

print f(x, y);
print [g(x) : x <- xs];

Whenever an assert statement fails, it also prints out or writes to a file the entire call stack, including the values of all function call arguments, and the value of all variables in scope.

Assignments can also be used to unpack tuples:

(B, A) swap((A, B) p) {
  a, b = p;
  return (b, a);
}

That comes in handy when you need a function to return multiple values:

// type RandGen = ...
// (Int, RandGen) random_int(RandGen g) = ...

(Int*, RandGen) random_int_seq(RandGen init_gen, Nat length) {
  r = ();
  g = init_gen;
  for i < length
    x, g = random_int(g);
  return (r, g);
}

Here too you can replace variables you don't use with an underscore:

A left((A, B) pair) {
  a, _ = pair;
  return a;
}

Just another type of expressions

A sequence of statements enclosed in braces is an expression in its own right, and can appear anywhere an expression is expected. Here's an example:

Int* h(Int* xs) = ({
  y = f(x);
  return if y >= 0 then y else g(y);
} : x <- xs);

Note that the return statement doesn't cause the entire function to return, it just terminates (and provides a value for) the enclosing procedural block. The above function could have just been written, with less clarity and efficiency, like this:

Int* h(Int* xs) = (if f(x) >= 0 then f(x) else g(f(x)) : x <- xs);

The syntax for "procedural" functions we've seen so far is just a tiny bit of syntactic sugar. The following two function definitions are completely equivalent, and are represented internally in exactly the same way by the compiler:

Int sum(Int* xs) {
  s = 0;
  for x <- xs
    s = s + x;
  return s;
}

Int sum(Int* xs) = {
  s = 0;
  for x <- xs
    s = s + x;
  return s;
};

Local variables declaration

Normally you don't need to declare the variables used in a procedural block, as the compiler will automatically infer their type. Their are times, though, when the compiler is not (yet?) able to do so, and in those (very rare) cases, you'll have to explicitly declare the type of some variables. We'll say more about that when we get to the topic of typechecking, but this is an example of a function that would not typecheck without the explicit variable declaration:

type List[T] = empty_list, list(T, List[T]);

List[T] seq2list(T* xs) {
  // Explicitly declaring the type of
  // local variable "l" as "List[T]"
  l : List[T];

  l = :empty_list;
  for i < |xs| {
    x = xs(|xs|-i-1);
    l = :list(x, l);
  }
  return l;
}

Local variables declarations can also be useful to make sense of the error messages produced by the typechecker, which can be pretty confusing at times if one relies entirely on automatic type inference, and they can also make the code easier to read when dealing with complex data structures.

Local functions

Inside the body of functions written in a procedural style, you can also define local functions, that is, functions that are visible only inside the top-level function that contains them. Apart from their restricted scope, they are just like any other function. In particular, they have no access to the arguments or the local variables of the outer function. They have to be defined at the bottom of the enclosing block of statements, and are not available when such a block is used as a normal expression. This is yet another way to define the h(..) function above:

Int* h(Int* xs) {
  return (l(x) : x <- xs);

  Int l(Int x) {
    y = f(x);
    return if y >= 0 then y else g(y);
  }
}

Imperative programming

Cell also includes a couple of pseudo-imperative features, as some algorithms are just a lot easier to write in an imperative style. We've already seen the first one, the functional insert operation. It generally takes amortized O(1), with a pretty small constant factor. Here's how it works: every time a new sequence is created, the runtime allocates some extra memory at the end of the array. When you try to insert a new element at the end, the runtime tries to make use of that extra space to store the new element. Existing references to the original sequence are not affected, as each of them stores not just a pointer to the array, but also its own copy of the length of the sequence, so they just ignore the extra elements and keep seeing only the original part of the sequence.

The same thing, by the way, happens with the _slice_(..) builtin function (which is aliased in the standard library by the function slice(..), and used by other standard functions like take(..) or drop(..)): creating a subsequence is fast, because the newly created object only needs to store a pointer to the first element of the subsequence and its length.

Going back to the functional insertion, the runtime needs to make a copy of the original sequence only in two cases: when it runs out of space at the end, in which case the entire sequence is reallocated with extra space, or when the next slot is already taken, likely because it was used by a previous insertion operation. You should take care to avoid the latter case, as much as possible. The following (buggy) function runs in O((N1+N2)^2), where N1 and N2 are the lengths of s1 and s2 respectively, because of the (useless, in this case) unused = (res | x); instruction:

T* concat(T* s1, T* s2) {
  res = s1;
  for x <- s2 {
    unused = (res | x);
    // The following instruction causes the reallocation
    // of the underlying array at every iteration of the
    // loop because the previous (useless) instruction
    // has already allocated the next slot
    res = (res | x);
  }
  return res;
}

Once you remove the faulty instruction, the function runs in O(N1+N2), as you would expect. The & operator, which is used for concatenating sequences and strings, works in a similar way, so if you build a long sequence/string by appending shorter sequences/strings to its right, the whole process runs in linear time, with respect to the length of the resulting sequence. This is how the join(..) function is defined in the standard library:

T* join(T** seqs) {
  res = ();
  for s <- seqs
    res = res & s;
  return res;
}

If you were to concatenate those shorter sequences to the left, on the other hand, performance would be disastrous. An example:

T* very_bad_join(T** seqs) {
  n = |seqs|;
  res = ();
  for i = 1...n {
    s = seqs(n-i);
    res = s & res;
  }
  return res;
}

Imperative update of sequences

In some very specific circumstances, it's also possible to imperatively update sequences, without giving up any of the desirable properties of functional programming. Here's an example:

// reverse_permutation((1, 2, 4, 0, 3)) == (3, 0, 1, 4, 2)
// reverse_permutation((3, 0, 1, 4, 2)) == (1, 2, 4, 0, 3)
Nat* reverse_permutation(Nat* p) {
  rp : Nat*;

  rp = (0 : i < |p|);
  for x @ i <- p
    rp(x) := i;
  return rp;
}

A statement of the form xs(i) := x; is a sequence update: conceptually, it is completely equivalent to the following code (provided of course that there's no existing variable named xj or j in the current scope):

fail if i < 0 or i >= |xs|;
xs = (if j == i then x else xj : xj @ j <- xs);

But that's not how it is implemented. In a function like reverse_permutation(..) it's pretty easy for the compiler to figure out that at the time of the assignment rp is the only reference to that particular sequence, in which case the underlying data structures can be updated destructively, in O(1), without any side effects. The compiler accepts such an assignment only if it's sure that it is semantically equivalent to making a mutated copy of the target sequence.

The implementation of this feature is a work in progress, and at the moment the compiler is unnecessarily restrictive. A sequence can be updated imperatively only if it's a local variable (not a function/method argument) whose type has been explicitly declared. It has to be allocated using either a sequence comprehension expression or a range comprehension one like for example the following ones:

ys = (f(x) : x <- xs);
ys = (g(x, i) : x @ i <- xs, p(x, i));
ys = (f(i) : i < n);

Moreover, the only permitted operations are accessing its elements individually (ys(i)), reading its length (|ys|), appending elements to it (ys = (ys | y);) and of course updating it (ys(i) := ...;). Do anything else, like for example passing it as argument to another function, or using it to create another value, and from that point in the code on, you won't be able to mutate it any more.