Benchmarks
The benchmarks we'll be examining in this chapter are meant to test the performance of code that updates and queries the state of relational automata, and compare it to equivalent hand-written Java or C# code (Benchmarks that test the functional subset of the language for purely computational tasks are available here).
Compared to garbage-collected, object-oriented languages like Java or C#, Cell has a number of both structural advantages and disadvantages in terms of performance. One major disadvantage is the deferred update model of relational automata: when an instruction that updates a member or mutable relation variable is executed, the update is not applied immediately, but stored in temporary memory until the message handler returns, which obviously involves a certain amount of overhead. A separate but related source of overhead is support for transactions. Another disadvantage is that in Cell, you can (and usually will) specify a set of integrity constraints (that is, keys and foreign keys) on the state of relational automata that need to be checked at runtime. A static analysis of the codebase can remove some of this overhead, but not all of it all the time. A more general problem is that mutability in Cell is subject to a number of restrictions that are bound to negatively affect performance in at least some situations.
On the other hand, relational automata naturally lend themselves to being implemented using techniques borrowed from the Entity/Component architecture and more generally from data-oriented design. One advantage of these architectures is that they vastly reduce the number of cache misses by improving the locality of the data. As we'll see later, they also seem to put less strain on the garbage collector, which is not suprising: having, say, a million instances of a class with five member variables in OOP means having a million distinct blocks of memory to keep track of, but with the E/C architecture you would end up with five arrays of one million elements each plus a hashtable (another array), which are much easier for the garbage collector to deal with.
While the code generated by the Cell compiler is already very fast, remember that optimization is still a work in progress: there's a lot left to do and, presumably, at least some margin for improvement.
The IMDB movie database
This example will make use of a a subset of the data in the IMDB movie database. The dataset is contained in a number of CSV files of moderate size (~140 MB in total). The various versions of the code will parse and load the dataset into memory, run a number of queries on it and finally update it in various ways. Both the code and the dataset are available on github (the relevant files are main.cell and imdb.cell for Cell, imdb.java for Java and imdb.cs for C#). If you have any questions or comments or if there's anything you would like to see added to this set of benchmarks, just leave a message in the forum.
All tests were run on a dual-core Intel Celeron 1005M running Lubuntu Linux 18.10, using either the Java HotSpot Server VM (version 16.0.2) or the .NET Core SDK (version 3.1.426). The code was run on a single core, using the taskset
command. Data on multi-core performance will be added in the near future.
Each test was run 25 times, but the first 5 iteractions were discarded, so as to give the JVM/CLR time to warm up, and the remaining 20 were averaged. Cell took longer than Java to warm up (it's just more code) and C# took barely any time at all.
Here's the visual representation of the structure of input dataset:
and this is the definition of the relational automaton we'll be using to store it:
Loading the dataset
The first test consists of parsing each of the CSV files and inserting the data into the corresponding relation variables in the case of Cell, or creating and wiring the corresponding set of objects for Java and C#. The results are show in the following table (all times are in milliseconds):
Cell (via C++) | Java | C# | |
---|---|---|---|
movies.csv | 166 | 412 | 1099 |
actors.csv | 579 | 2244 | 4643 |
directors.csv | 33 | 151 | 350 |
movies_directors.csv | 113 | 232 | 212 |
movies_genres.csv | 108 | 344 | 456 |
roles.csv | 3033 | 4041 | 13031 |
Total | 4032 | 7424 | 19791 |
The first column show the loading times for the code produced by the Cell-to-C++ code generator, and the second and third ones the equivalent time for the handwritten, object-oriented Java and C# code respectively. Each row correspond to a different CSV file, save for the last one which shows the totals. The next table shows the same results normalized:
Cell (via C++) | Java | C# | |
---|---|---|---|
movies.csv | 1.00 | 2.48 | 6.62 |
actors.csv | 1.00 | 3.88 | 8.02 |
directors.csv | 1.00 | 4.58 | 10.61 |
movies_directors.csv | 1.00 | 2.05 | 1.88 |
movies_genres.csv | 1.00 | 3.19 | 4.22 |
roles.csv | 1.00 | 1.33 | 4.30 |
Mean (Geometric) | 1.00 | 2.70 | 5.17 |
For each row the times are scaled so that the best one becames 1.0. As you can see Cell is always faster, followed by Java, with C# a distant third.
Note that the individual numbers in the above tables are not particularly reliable (only the totals are), because a lot depends on when the garbage collector decides to kick in, which varies depending on how the tests are set up. A better analysis is definitely needed to factor this out, and it will be provided in the near future.
Since this phase of the test is dominated by object creation, which puts a lot of pressure on the garbage collector, it's actually interesting to see what the numbers would be in an "infinite memory" scenario, which can be approximated by setting (with the -Xms
and -Xmx
options) the amount of memory available to the JVM to a sufficiently large amount, and by explicitly invoking the garbage collector before loading each individual file (obviously, the times spent doing the preventive garbage collection is not included in the figures shown here):
Cell (via C++) | Java | |
---|---|---|
movies.csv | 166 | 127 |
actors.csv | 579 | 989 |
directors.csv | 33 | 25 |
movies_directors.csv | 113 | 139 |
movies_genres.csv | 108 | 78 |
roles.csv | 3033 | 2151 |
Total | 4032 | 3509 |
Here are the normalized figures:
Cell (via C++) | Java | |
---|---|---|
movies.csv | 1.31 | 1.00 |
actors.csv | 1.00 | 1.71 |
directors.csv | 1.32 | 1.00 |
movies_directors.csv | 1.00 | 1.23 |
movies_genres.csv | 1.38 | 1.00 |
roles.csv | 1.41 | 1.00 |
Mean (Geometric) | 1.22 | 1.13 |
As you can see, with the garbage collector out of the way, Java is still a bit faster than Cell, which is not particularly surprising: object creation is extremely fast in Java, while the data structures used by Cell are more complex, and updates are also slowed down by factors (transactions, integrity checks, etc.) that have no equivalent in OOP. But much of that advantage is then lost by overloading the garbage collector.
Updating the dataset
We'll now be performing a number of update operation, that are described here:
- U1: Increments the rank of all movies released before a given year by a specified amount.
- U2: For each actor, computes the average rank of the movies they appear in, and stores the result in the
actor_rank
relation variable or in theavgMoviesRank
member variable. - U3: For each director, computes the average rank of the movies they directed, and stores the result in the
director_rank
relation variable or in theavgMoviesRank
member variable. - U4: Randomly generates a list of movie ids, and for each of them increments their rank and updates the rank of all actors and directors that worked on it.
- U5: Deletes all movies below a given rank.
- U6: Deletes all actors that do not appear in at least one of the remaining movies.
- U7: Deletes all directors that did not work on any of the remaining movies.
The results are shown here:
Cell (via C++) | Java | C# | |
---|---|---|---|
U1 | 166 | 419 | 1198 |
U2 | 579 | 2360 | 4181 |
U3 | 33 | 197 | 43 |
U4 | 113 | 174 | 219 |
U5 | 107 | 302 | 861 |
U6 | 3054 | 4291 | 12825 |
U7 | 15 | 38 | 41 |
and here are the normalized ones:
Cell (via C++) | Java | C# | |
---|---|---|---|
U1 | 1.00 | 2.52 | 7.22 |
U2 | 1.00 | 4.08 | 7.22 |
U3 | 1.00 | 5.97 | 1.30 |
U4 | 1.00 | 1.54 | 1.94 |
U5 | 1.00 | 2.82 | 8.05 |
U6 | 1.00 | 1.41 | 4.20 |
U7 | 1.00 | 2.53 | 2.73 |
Mean (Geometric) | 1.00 | 2.37 | 3.06 |
As you can see, Cell is again clearly the fastest one, and C# the slowest.
Queries
Let's now compare the performance of computation that does not update the dataset. We'll be running the following set of queries:
- Q1: Counting the number of movies whose rank is no less that a given threshold. Repeated for several different values of the threshold.
- Q2: Counting the number of actors who played in at least one movie whose rank was no less than a given threshold. Repeated for several different values of the threshold.
- Q3: For each actor, finding all co-actors in movies with a given minimum rank.
- Q4: Calculating a histogram of movie ages.
- Q5: Calculating the average age of all movies with a given minimum rank.
- Q6: Calculating the sum of the ages of all movies in the dataset.
- Q7: For each movie in a randomly generated set, finding all movies that have actors in common with it.
- Q8: For each actor in a randomly generated set, retrieving the last names of all actors with the same first name. Duplicates are eliminated.
- Q9: Same as Q3, but also returning the number of movies they appeared in together.
- Q10: Same as Q8, but duplicates are not eliminated.
- Q11: For each director, checking if they ever also worked as actors.
- Q12: For each actor, computing the full name by concatenating first and last names.
Some of the queries are a bit bizarre, but they actually manage to test all the basic operations on relations: attribute lookups, navigation, searches and linear scans. The results are shown here in milliseconds:
Cell (via C++) | Java | C# | |
---|---|---|---|
Q1 | 78 | 416 | 1040 |
Q2 | 2346 | 9365 | 9080 |
Q3 | 3365 | 6339 | 9788 |
Q4 | 65 | 221 | 484 |
Q5 | 24 | 217 | 482 |
Q6 | 100 | 861 | 1858 |
Q7 | 2725 | 4070 | 4420 |
Q8 | 3881 | 3774 | 4577 |
Q9 | 5223 | 1724 | 3247 |
Q10 | 2781 | 2156 | 2689 |
Q11 | 88 | 304 | 160 |
Q12 | 148 | 224 | 86 |
and here is the normalized version:
Cell (via C++) | Java | C# | |
---|---|---|---|
Q1 | 1.00 | 5.33 | 13.33 |
Q2 | 1.00 | 3.99 | 3.87 |
Q3 | 1.00 | 1.88 | 2.91 |
Q4 | 1.00 | 3.40 | 7.45 |
Q5 | 1.00 | 9.04 | 20.08 |
Q6 | 1.00 | 8.61 | 18.58 |
Q7 | 1.00 | 1.49 | 1.62 |
Q8 | 1.03 | 1.00 | 1.21 |
Q9 | 3.03 | 1.00 | 1.88 |
Q10 | 1.29 | 1.00 | 1.25 |
Q11 | 1.00 | 3.45 | 1.82 |
Q12 | 1.72 | 2.60 | 1.00 |
Mean (Geometric) | 1.17 | 2.69 | 3.51 |
When compared to Java Cell is much slower for Q9
, which makes use of a map/dictionary which is repeatedly mutated. In Cell maps only support functional updates, which are reasonably fast, but cannot compete with Java's mutable, hashtable-based maps. It's also somewhat slower for query Q10
.
Compared to C#, Cell is again slower for Q9
, and also for Q12
, but the latter is a special case, as Q12
is dominated by string concatenation, which is heavily optimized in C# (and Java), but not in Cell.
Dual core results
All the figures we've seen so far are for a single-core configuration, and now we'll take a look at what happens with a dual-core one. For Cell nothing changes, as at this stage the Cell compiler doesn't take advantage of the second core, but the figures are different for Java and C#, even though neither program is multithreaded. Here are the results of the loading tests:
Cell | Java | C# | |
---|---|---|---|
movies.csv | 166 | 312 | 235 |
actors.csv | 579 | 2590 | 2709 |
directors.csv | 33 | 25 | 480 |
movies_directors.csv | 113 | 220 | 490 |
movies_genres.csv | 108 | 130 | 463 |
roles.csv | 3033 | 5972 | 6657 |
Total | 4032 | 9249 | 11034 |
An you can see Java is (surprisingly) slower when considering the overall loading times, by about 25%, and C# is much faster, by a whopping 80%, but it's still the slowest of the three. I've no explanation for the worsening of Java performance, but I guess the C# improvements might be due to the fact that it's able to run the garbage collector on the second core, which makes a lot of difference here since this set of tests put a lot of pressure on the garbage collector. The individual figures bounce around a lot for both Java and C#, but that's just an artifact in the data caused by the unpredictability of the garbage collector that kicks is at different times in this dual-core configuration.
These are the results for the update tests, both raw and normalized:
Cell | Java | C# | |
---|---|---|---|
U1 | 166 | 309 | 253 |
U2 | 579 | 2544 | 3485 |
U3 | 33 | 26 | 43 |
U4 | 113 | 208 | 217 |
U5 | 107 | 134 | 442 |
U6 | 3054 | 5995 | 6397 |
U7 | 15 | 75 | 41 |
Cell | Java | C# | |
---|---|---|---|
U1 | 1.00 | 1.86 | 1.52 |
U2 | 1.00 | 4.39 | 6.02 |
U3 | 1.27 | 1.00 | 1.65 |
U4 | 1.00 | 1.84 | 1.92 |
U5 | 1.00 | 1.25 | 4.13 |
U6 | 1.00 | 1.96 | 2.09 |
U7 | 1.00 | 5.00 | 2.73 |
Mean (Geometric) | 1.03 | 2.40 | 2.30 |
Here again the results are made less clear by the intervention of the garbage collector, but it looks like overall not much has changed for Java, while there's a clear (but much smaller than in the loading tests) improvement for C#.
The results of the query tests are more reliable, due to the fact that the garbage collector is not as much of a factor as it is in the other sets of tests. They show a clear deterioration of performance for Java, and no significant difference for C#:
Cell | Java | C# | |
---|---|---|---|
Q1 | 78 | 1158 | 1024 |
Q2 | 2346 | 12067 | 8939 |
Q3 | 3365 | 8188 | 9403 |
Q4 | 65 | 606 | 478 |
Q5 | 24 | 607 | 477 |
Q6 | 100 | 2319 | 1833 |
Q7 | 2725 | 5154 | 4359 |
Q8 | 3881 | 5039 | 4690 |
Q9 | 5223 | 2275 | 2997 |
Q10 | 2781 | 3422 | 2803 |
Q11 | 88 | 271 | 161 |
Q12 | 148 | 152 | 84 |
Cell | Java | C# | |
---|---|---|---|
Q1 | 1.00 | 14.85 | 13.13 |
Q2 | 1.00 | 5.14 | 3.81 |
Q3 | 1.00 | 2.43 | 2.79 |
Q4 | 1.00 | 9.32 | 7.35 |
Q5 | 1.00 | 25.29 | 19.88 |
Q6 | 1.00 | 23.19 | 18.33 |
Q7 | 1.00 | 1.89 | 1.60 |
Q8 | 1.00 | 1.30 | 1.21 |
Q9 | 2.30 | 1.00 | 1.32 |
Q10 | 1.00 | 1.23 | 1.01 |
Q11 | 1.00 | 3.08 | 1.83 |
Q12 | 1.76 | 1.81 | 1.00 |
Mean (Geometric) | 1.12 | 4.01 | 3.31 |
Embedded mode
Cell is primarily meant to be used as an embedded language, so we'll now take a look at its performance for that specific use case. As one might expect, the generated code can sometimes be slower in embedded mode, because of all the data conversion and validation that takes place at the boundary between handwritten and generated code.
The performance hit varies considerably, and it's impossible to quantify exactly, because it depends on the "surface to volume ratio" of a given method or message handler, that is, the ratio between the amount of data that needs to be converted back and forth and the amount of actual work the method/handler needs to perform.
In order to maximize performance it's important to carefully design the public interface of relational automata in order to minimize the amount of data that has to be transferred between the hand-written and the generated parts of the codebase. Guidelines and examples of how to do so will be published soon, but in practice that's easier than it sounds. That's because, even though it can be used to replace an embedded database, Cell is not a database but a fully-featured programming language, and computation that needs to access the data held by the Cell side of the code can, in the vast majority of cases, be done directly in Cell, without involving the host language. Most of the time, you'll need to transfer only data that is used for I/O operations, which cannot be done directly in Cell: for example, when such data has to be displayed to the user or sent to another computer or application. In those cases, the computational costs or the intrinsic latency of the I/O will typically dwarf the costs of the data conversions.
Time to look at some numbers. These are the loading times for Cell embedded into Java (the numbers are for version 0.4 of the compiler, which is somewhat slower than the most recent one):
Standalone | Embedded | Ratio | |
---|---|---|---|
movies.csv | 402 | 519 | 1.29 |
actors.csv | 1936 | 1844 | 0.95 |
directors.csv | 82 | 75 | 0.91 |
movies_directors.csv | 258 | 238 | 0.92 |
movies_genres.csv | 233 | 190 | 0.81 |
roles.csv | 8136 | 8940 | 1.09 |
Total | 11047 | 11806 | 1.06 |
The first column shows the results when running the Cell-only program, and are exactly the same numbers we saw previously, the second column shows the result of running the tests in embedded mode, and the last one the ratio between the two. Not much of a difference. Updates are shown here:
Standalone | Embedded | Ratio | |
---|---|---|---|
U1 | 47 | 40 | 0.85 |
U2 | 256 | 248 | 0.96 |
U3 | 25 | 23 | 0.92 |
U4 | 262 | 257 | 0.98 |
U5 | 2717 | 3752 | 1.38 |
U6 | 201 | 119 | 0.59 |
U7 | 7 | 8 | 1.14 |
It looks like U5 has become somewhat slower, but that's probably an illusion. It's more likely that the performance hit is taken during the loading phase, and it only appears to happen here because the garbage collector decides to kick in at this particular time. Finally, these are the results for our set of queries:
Standalone | Embedded | Ratio | |
---|---|---|---|
Q1 | 223 | 226 | 1.01 |
Q2 | 5705 | 5680 | 0.99 | Q3 | 5149 | 5972 | 1.15 |
Q4 | 201 | 198 | 0.98 |
Q5 | 176 | 180 | 1.02 |
Q6 | 537 | 526 | 0.97 |
Q7 | 3859 | 3927 | 1.01 |
Q8 | 4941 | 8665 | 1.75 |
Q9 | 3538 | 4159 | 1.17 |
Q10 | 4210 | 15406 | 3.65 |
Q11 | 360 | 796 | 2.21 |
Q12 | 511 | 1492 | 2.91 |
Apart from Q11 the queries that exhibit the worst slowdowns are Q8, Q10, Q12, which is not surprising, since they all do very little work but return a lot of data that needs to be converted. Q11 on the other hand is an example of a computation that is unnecessarily split between the Cell and Java sides of the code. The overhead almost entirely disappears if it is performed entirely in Cell.
Saving and loading the dataset
The last thing we need to check is how long it takes to load and save the dataset using the generated void load(Reader)
and void save(Writer)
methods. Those methods use the standard literal representation of Cell data, which is rather verbose: the ~140 MB of CSV files used as input increase in size to ~400 MB when saved in the standard Cell format. The tests have also been performed on smaller version of the dataset, which results in a file size of ~170 MB.
When compiling to Java (results for the C# code generator will be added soon) saving the entire dataset takes between 10s
and 11s
on first run, which goes down to around 6.5s
once the JVM warms up on the machine used these tests. In order to get those results though, it's necessary to manually increase (with the -Xms
and -Xms
command line flags) the amount of memory used by the JVM, otherwise the number stays at ~10s
even after the JVM has warmed up. Saving the smaller version of the dataset takes between 6s
and 7s
on first run, which goes down to less than 3s
after a couple iterations. There's no need to tweak the memory settings of the JVM in this case.
Loading the dataset takes around 25s
on first run, and about 15s
once the JVM has warmed up. The latter time is not too bad (it is, after all, more or less the same time it takes the handwritten C# code to load the data from the original CSV files), but since loading the data is typically one of the first things an application would do, in most cases you would be stuck with the cold-run performance. Loading the smaller dataset takes about 15s
seconds on the first run, and a little over 6s
seconds afterwards.
Here there's still room for improvement, for example by replacing java.io.Reader
and java.io.Writer
with the NIO API, and by switching to asynchronous I/O. The longer term solution though is to define a more compact binary format that is equivalent to the textual representation (with a tool to convert between the two) but faster to read and write.