Benchmarks

The benchmarks we'll be examining in this chapter are meant to tests the performance of code that updates and queries the state of relational automata, and compare it to equivalent hand-written Java or C# code (Benchmarks that test the functional subset of the language for purely computational tasks are available here).

As we'll see, there are times when the Java code generated by the Cell compiler outperforms the hand-written one. Now, any claims of being able to outperform Java (or any other language) while compiling to it should always been taken with a grain of salt. The Java or C# code used for comparison is written in an object-oriented style using the collection library provided with the language, which is the style of programming those languages were designed for. It can certainly be made faster, for example by writing custom collection classes, or by steering clear of OOP altogether and using data-oriented design techniques instead, which is basically what the Cell compiler does. Doing so will certainly yield code that is consistently faster than anything the Cell compiler could generates, as one could copy all the compiler's tricks, add one's own, and also avoid many of the inefficiencies of the generated code. But that's a bit like fighting the language instead of working with it, and certainly involves a significant additional effort on the part of the programmer.

Compared to garbage-collected, object-oriented languages like Java or C#, Cell has a number of both structural advantages and disadvantages in terms of performance. One major disadvantage is the deferred update model of relational automata: when an instruction that updates a member or mutable relation variable is executed, the update is not applied immediately, but stored in temporary memory until the message handler returns, which obviously involves a certain amount of overhead. A separate but related source of overhead is support for transactions. Another disadvantage is that in Cell, you can (and usually will) specify a set of integrity constraints (that is, keys and foreign keys) on the state of relational automata that need to be checked at runtime. A static analysis of the codebase can remove some of this overhead, but not all of it all the time. A more general problem is that mutability in Cell is subject to a number of restrictions that are bound to negatively affect performance in at least some situations.

On the other hand, relational automata naturally lend themselves to being implemented using techniques borrowed from the Entity/Component architecture and more generally from data-oriented design. One advantage of these architectures is that they vastly reduce the number of cache misses by improving the locality of the data. As we'll see later, they also seem to put less strain on the garbage collector, which is not suprising: having, say, a million instances of a class with five member variables in OOP means having a million distinct blocks of memory to keep track of, but with the E/C architecture you would end up with five arrays of one million elements each plus a hashtable (another array), which are much easier for the garbage collector to deal with.

While the code generated by the Cell compiler is already pretty fast, remember that optimization is still a work in progress: there's a lot left to do and, presumably, at least some margin for improvement. Among other things, the data structures used to implement ternary relations, while reasonably fast, are still too generic and don't take advantage of the specific usage of each individual relation variable and the constraints that are placed on it; many-to-many binary relations seem to have performance problems under deletion that haven't been investigated yet; the generated code still performs many useless integrity checks that could be eliminated by a more sophisticated static analysis; sending a message to an automaton currently involves more overhead that a method call in an imperative language, but in many cases that overhead could be eliminated; and several of the optimizations the compiler already performs would be more effective if combined with aggressive inlining of methods. The effects of these shortcomings are clearly visible in the example we will be examining.

The other thing to consider is that compiling to Java is far from being ideal. No garbage-collected, memory-safe language I know of would be ideal of course, but the lack of lightweight, non-heap-allocated data structures probably makes Java worst than most. Only compiling to C/C++ could fully take advantage, in terms of performance, of Cell's unique design, but the current C++ code generator hasn't been updated since version 0.1, is completely obsolete, uses inefficient data structures that were meant to be temporary, and does not optimize the generated code at all, although it still manages to beat the Java version for some workloads. It will be brought up to date again before the release of the next (0.6) version of the language.

The IMDB movie database

This example will make use of a a subset of the data in the IMDB movie database. The dataset is contained in a number of CSV files of moderate size (~140 MB in total). The various versions of the code will parse and load the dataset into memory, run a number of queries on it and finally update it in various ways. Both the code and the dataset are available on github (the relevant files are main.cell and imdb.cell for Cell, imdb.java for Java and imdb.cs for C#). If you have any questions or comments or if there's anything you would like to see added to this set of benchmarks, just leave a message in the forum.

All tests were run on a dual-core Intel Celeron 1005M running Lubuntu Linux 18.10, using either the Java HotSpot Server VM (version 12.0.1) or the .NET Core SDK (version 2.2.301). We will examine the single-core results first, since all versions of the code are single-threaded, and discuss dual-core performance later.

Each test was run 25 times, but the first 5 iteractions were discarded, so as to give the JVM/CLR time to warm up, and the remaining 20 were averaged. Cell took longer than Java to warm up (it's just more code) and C# took barely any time at all.

Here's the visual representation of the structure of input dataset:

and this is the definition of the relational automaton we'll be using to store it:

schema MoviesDB {
  actor(Actor)
    first_name  : String,
    last_name   : String,
    gender      : Gender;

  movie(Movie)
    name    : String,
    year    : Int,
    rank    : Float,
    genre*  : Genre;

  director(Director)
    director_first_name : String,
    director_last_name  : String;

  acted_in(Actor, Movie)
    role* : String;

  directed(Director, Movie);

  acted_in(a, m) -> actor(a), movie(m);
  directed(d, m) -> director(d), movie(m);

  // More stuff here
  ...
}

Loading the dataset

The first test consists of parsing each of the CSV files and inserting the data into the corresponding relation variables in the case of Cell, or creating and wiring the corresponding set of objects for Java and C#. The results are show in the following table (all times are in milliseconds):

Cell Java C# Cell/Java Cell/C#
movies.csv 402 294 1047 1.36 0.38
actors.csv 1936 2478 3994 0.78 0.48
directors.csv 82 98 92 0.83 0.89
movies_directors.csv 258 208 369 1.24 0.69
movies_genres.csv 233 126 511 1.84 0.45
roles.csv 8136 5096 9150 1.59 0.88
Total 11047 8300 15163 1.33 0.72

The first three columns show the loading times in milliseconds for each of the languages being tested, and the last two show the performance of the Cell code relative to Java and C# respectively (obviously a number greater than one means that Cell is slower, and conversely a number smaller than one that Cell is faster). Each row correspond to a different CSV file, save for the last one which shows the totals. As you can see Cell is generally slower than Java, and consistently faster than C#, but in either case the difference is not huge.

Note that the individual numbers in the above tables are not particularly reliable (only the totals are), because a lot depends on when the garbage collector decides to kick in, which varies depending on how the tests are set up. Since this phase of the test is dominated by object creation, which puts a lot of pressure on the garbage collector, it's actually interesting to see what the numbers would be in an "infinite memory" scenario, which can be approximated by setting (with the -Xms and -Xmx options) the amount of memory available to the JVM to a sufficiently large amount, and by explicitly invoking the garbage collector before loading each individual file (obviously, the times spent doing the preventive garbage collection is not included in the figures shown here):

Cell Java Cell/Java
movies.csv 297 136 2.18
actors.csv 1659 1102 1.50
directors.csv 51 26 1.96
movies_directors.csv 227 141 1.60
movies_genres.csv 184 81 2.27
roles.csv 7857 2247 3.49
Total 10275 3733 2.75

As you can see, with the garbage collector out of the way, Java is consistently and significantly faster than Cell, which is not particularly surprising: object creation is extremely fast in Java, while the data structures used by Cell are more complex, and updates are also slowed down by a number of factors (transactions, integrity checks, etc.) mentioned in the introduction. But much of that advantage is then lost by overloading the garbage collector.

This also shows where Cell lags the most compared to Java: loading the data in roles.csv. The vast majority of that time is spent updating the role ternary relation. As already explained in the introduction, optimization of ternary relations is still a work in progress. Fixing that should be enough to close most of the performance gap between Cell and Java in this particular test.

Updating the dataset

We'll now be performing a number of update operation, that are described here:

  • U1: Increments the rank of all movies released before a given year by a specified amount.
  • U2: For each actor, computes the average rank of the movies they appear in, and stores the result in the actor_rank relation variable or in the avgMoviesRank member variable.
  • U3: For each director, computes the average rank of the movies they directed, and stores the result in the director_rank relation variable or in the avgMoviesRank member variable.
  • U4: Randomly generates a list of movie ids, and for each of them increments their rank and updates the rank of all actors and directors that worked on it.
  • U5: Deletes all movies below a given rank.
  • U6: Deletes all actors that do not appear in at least one of the remaining movies.
  • U7: Deletes all directors that did not work on any of the remaining movies.

The results are shown here:

Cell Java C# Cell/Java Cell/C#
U1 47 36 45 1.30 1.04
U2 256 228 361 1.12 0.70
U3 25 26 29 0.96 0.86
U4 262 348 351 0.75 0.74
U5 2717 1224 1899 2.21 1.43
U6 201 70 66 2.87 3.04
U7 7 11 7 0.63 1.00

As you can see, the performance of all three languages is in the same ballpark, except for updates U5 and U6 where Cell is significantly slower. For U5 what seems to slow down the Cell implementation is deleting from acted_in (a many-to-many binary relation) and role (a ternary relation). Both problems have already been mentioned in the introduction. The causes of the poor performance of U6 haven't been investigated yet.

Queries

Let's now compare the performance of computation that does not update the dataset. We'll be running the following set of queries:

  • Q1: Counting the number of movies whose rank is no less that a given threshold. Repeated for several different values of the threshold.
  • Q2: Counting the number of actors who played in at least one movie whose rank was no less than a given threshold. Repeated for several different values of the threshold.
  • Q3: For each actor, finding all co-actors in movies with a given minimum rank.
  • Q4: Calculating a histogram of movie ages.
  • Q5: Calculating the average age of all movies with a given minimum rank.
  • Q6: Calculating the sum of the ages of all movies in the dataset.
  • Q7: For each movie in a randomly generated set, finding all movies that have actors in common with it.
  • Q8: For each actor in a randomly generated set, retrieving the last names of all actors with the same first name. Duplicates are eliminated.
  • Q9: Same as Q3, but also returning the number of movies they appeared in together.
  • Q10: Same as Q8, but duplicates are not eliminated.
  • Q11: For each director, checking if they ever also worked as actors.
  • Q12: For each actor, computing the full name by concatenating first and last names.

Some of the queries are a bit bizarre, but they actually manage to test all the basic operations on relations: attribute lookups, navigation, searches and linear scans. The results are shown here:

Cell Java C# Cell/Java Cell/C#
Q1 223 438 1142 0.50 0.19
Q2 5705 9328 10174 0.61 0.56
Q3 5149 6135 10191 0.83 0.50
Q4 201 237 530 0.84 0.37
Q5 176 235 526 0.74 0.33
Q6 537 807 1972 0.66 0.27
Q7 3859 4131 4877 0.93 0.79
Q8 4941 3945 5049 1.25 0.97
Q9 3538 1663 3119 2.12 1.13
Q10 4210 2386 3071 1.76 1.37
Q11 360 282 159 1.27 2.26
Q12 511 220 94 2.32 5.43

Cell is significantly slower than Java only for queries Q9, Q10 and Q12. Let's take a look at each of them in turn. Q9 makes use of a map/dictionary which is repeatedly mutated. In Cell maps only support functional updates, which are reasonably fast, but cannot compete with Java's mutable, hashtable-based maps. In Q10, an optimization made by the Cell compiler clashes with the limitation of Java's object model. The same problem would not present itself when compiling to C/C++. In this specific case though, it could be solved with a more sophisticated analysis of the source code. Q12 is a special case: the cause of the bad performance is just a weird problem with string concatenation, that was discovered too late to be fixed in version 0.4. C# is also significantly faster for Q11, but the reasons for that haven't been properly investigated yet.

We can also run the same queries on the much smaller dataset that is the results of applying the updates described in the previous paragraph:

Cell Java C# Cell/Java Cell/C#
Q1 188 368 454 0.51 0.41
Q2 3178 4444 4230 0.71 0.75
Q3 6618 7138 11461 0.92 0.57
Q4 186 203 250 0.91 0.74
Q5 177 196 243 0.9 0.72
Q6 621 557 855 1.11 0.72
Q7 144 142 158 1.01 0.91
Q8 882 1198 1696 0.73 0.52
Q9 2480 1026 2001 2.41 1.23
Q10 793 655 894 1.21 0.88
Q11 85 98 49 0.86 1.73
Q12 303 131 76 2.31 3.98

The numbers vary a bit, but there aren't any big surprises.

Dual-core results

The results we saw in the previous chapter were obtained by running the benchmarks on a single core, using the taskset command. Even though none of the programs is multithreaded, we'll now take a look at what happens when we allow them to take advantage of both cores. Here are the loading times:

Cell Java C# Cell/Java Cell/C#
movies.csv 413 825 395 0.50 1.04
actors.csv 2033 3124 3384 0.65 0.60
directors.csv 68 51 213 1.33 0.31
movies_directors.csv 276 2255 331 0.12 0.83
movies_genres.csv 239 164 249 1.45 0.95
roles.csv 8489 8794 6758 0.96 1.25
Total 11518 15213 11330 0.75 1.01

Cell is slightly slower in this case, and that's just what you would expect from a single-threaded program that is moved from core to core. C# is a lot faster, to the point that it achieves speed parity with Cell. It might be that it's able to run the garbage collector on the second core (as already explained, this part of the test puts a lot of strain on the garbage collector). Java, on the other hands, inexplicably takes nearly twice as long to complete the task, and is now the slowest of the three. This I've no idea what could be causing this behavior (if you know, or can guess, please leave a message on the forum). Let's now check the updates:

Cell Java C# Cell/Java Cell/C#
U1 57 103 46 0.55 1.23
U2 245 403 347 0.60 0.70
U3 25 24 28 1.04 0.89
U4 295 430 359 0.68 0.82
U5 2968 4278 1888 0.69 1.57
U6 265 325 65 0.81 4.07
U7 9 25 7 0.36 1.28

Here again, Cell seems to be a little bit slower, on average. There's no discernible difference for C#. Java, again, is significantly slower on average. Queries seem to follow the same patterns, whether they are run on the entire dataset:

Cell Java C# Cell/Java Cell/C#
Q1 225 1228 1155 0.18 0.19
Q2 5951 13282 9858 0.44 0.60
Q3 5307 8577 10032 0.61 0.52
Q4 199 1000 529 0.19 0.37
Q5 173 941 525 0.18 0.32
Q6 537 1773 1960 0.30 0.27
Q7 3830 5236 4620 0.73 0.82
Q8 4724 4884 4974 0.96 0.94
Q9 3788 2459 3196 1.54 1.18
Q10 4206 3324 2997 1.26 1.40
Q11 307 216 157 1.42 1.95
Q12 507 180 88 2.81 5.76

or the part of it that is left after the updates:

Cell Java C# Cell/Java Cell/C#
Q1 186 666 455 0.27 0.40
Q2 3119 6893 4169 0.45 0.74
Q3 6517 8600 11356 0.75 0.57
Q4 184 397 248 0.46 0.74
Q5 173 354 241 0.48 0.71
Q6 614 654 853 0.93 0.71
Q7 141 185 157 0.76 0.89
Q8 827 1425 1518 0.58 0.54
Q9 2625 1316 1946 1.99 1.34
Q10 795 945 886 0.84 0.89
Q11 71 77 50 0.92 1.42
Q12 287 106 76 2.70 3.77

Embedded mode

Cell is primarily meant to be used as an embedded language, so we'll now take a look at its performance for that specific use case. As one might expect, the generated code can sometimes be slower in embedded mode, because of all the data conversion and validation that takes place at the boundary between handwritten and generated code.

The performance hit varies considerably, and it's impossible to quantify exactly, because it depends on the "surface to volume ratio" of a given method or message handler, that is, the ratio between the amount of data that needs to be converted back and forth and the amount of actual work the method/handler needs to perform.

In order to maximize performance it's important to carefully design the public interface of relational automata in order to minimize the amount of data that has to be transferred between the hand-written and the generated parts of the codebase. Guidelines and examples of how to do so will be published soon, but in practice that's easier than it sounds. That's because, even though it can be used to replace an embedded database, Cell is not a database but a fully-featured programming language, and computation that needs to access the data held by the Cell side of the code can, in the vast majority of cases, be done directly in Cell, without involving the host language. Most of the time, you'll need to transfer only data that is used for I/O operations, which cannot be done directly in Cell: for example, when such data has to be displayed to the user or sent to another computer or application. In those cases, the computational costs or the intrinsic latency of the I/O will typically dwarf the costs of the data conversions.

Time to look at some numbers. Let's start with the loading times:

Standalone Embedded Ratio
movies.csv 402 519 1.29
actors.csv 1936 1844 0.95
directors.csv 82 75 0.91
movies_directors.csv 258 238 0.92
movies_genres.csv 233 190 0.81
roles.csv 8136 8940 1.09
Total 11047 11806 1.06

The first column shows the results when running the Cell-only program, and are exactly the same numbers we saw previously, the second column shows the result of running the tests in embedded mode, and the last one the ratio between the two. Not much of a difference. Updates are shown here:

Standalone Embedded Ratio
U1 47 40 0.85
U2 256 248 0.96
U3 25 23 0.92
U4 262 257 0.98
U5 2717 3752 1.38
U6 201 119 0.59
U7 7 8 1.14

It looks like U5 has become somewhat slower, but that's probably an illusion. It's more likely that the performance hit is taken during the loading phase, and it only appears to happen here because the garbage collector decides to kick in at this particular time. Finally, these are the results for our set of queries:

Standalone Embedded Ratio
Q1 223 226 1.01
Q2 5705 5680 0.99
Q3 5149 5972 1.15
Q4 201 198 0.98
Q5 176 180 1.02
Q6 537 526 0.97
Q7 3859 3927 1.01
Q8 4941 8665 1.75
Q9 3538 4159 1.17
Q10 4210 15406 3.65
Q11 360 796 2.21
Q12 511 1492 2.91

Apart from Q11 the queries that exhibit the worst slowdowns are Q8, Q10, Q12, which is not surprising, since they all do very little work but return a lot of data that needs to be converted. Q11 on the other hand is an example of a computation that is unnecessarily split between the Cell and Java sides of the code. The overhead almost entirely disappears if it is performed entirely in Cell.

Saving and loading the dataset

The last thing we need to check is how long it takes to load and save the dataset using the generated void load(Reader) and void save(Writer) methods. Those methods use the standard literal representation of Cell data, which is rather verbose: the ~140 MB of CSV files used as input increase in size to ~400 MB when saved in the standard Cell format. The updated version of the dataset, which is smaller, results in a file size of ~170 MB.

Saving the entire dataset takes between 10s and 11s on first run, which goes down to around 6.5s once the JVM warms up on the machine used these tests. In order to get those results though, it's necessary to manually increase (with the -Xms and -Xms command line flags) the amount of memory used by the JVM, otherwise the number stays at ~10s even after the JVM has warmed up. Saving the updated version of the dataset takes between 6s and 7s on first run, which goes down to less than 3s after a couple iterations. There's no need to tweak the memory settings of the JVM in this case.

Loading the dataset takes around 25s on first run, and about 15s once the JVM has warmed up. The latter time is not too bad (it is, after all, more or less the same time it takes the handwritten C# code to load the data from the original CSV files), but since loading the data is typically one of the first things an application would do, in most cases you would be stuck with the cold-run performance. Loading the updated version of the dataset takes about 15s seconds on the first run, and a little over 6s seconds afterwards.

Here there's still room for improvement, for example by replacing java.io.Reader and java.io.Writer with the NIO API, and by switching to asynchronous I/O, and at least some of that will probably be implemented in the next release (0.5). The longer term solution though is to define a more compact binary format that is equivalent to the textual representation (with a tool to convert between the two) but faster to read and write.