Subject: Smalltalk/C Benmark results
From: Greg & Cindy Gritton
Reply-To: gritton.family@verizon.net
Newsgroups: comp.lang.smalltalk
Date: Sun, 10 Feb 2002 00:25:19 GMT
In honor of the Olympics (or the fact that I completed the runs), I am
presenting the results of some benchmarks run on various versions of
Smalltalk as well as in C. These are small benchmarks, originally written
in C and then ported to Smalltalk. They were taken from the Self project
with some alterations (see notes below).
These benchmarks don't represent how Smalltalk is currently used. Most often,
Smalltalk seems to be used in business logic and web applications. Its
main competitors there are C++, and perhaps Perl. In business applications
virtual functions, dynamic memory allocation, and an object-oriented design
style is often used, which results in similar speed implementations for
Smalltalk and C++.
These benchmarks are much simpler than business applications. They are
small programs that perform basic tasks such as sorting, matrix multiply,
or classical logic problems such as "8 queens". They don't use any
dynamic memory allocation, are not object oriented, generally hit the
cache, and run very fast in C. As such, they show Smalltalk in its
worst light.
With that disclaimer, they are still useful. They help show where
Smalltalk is weak in execution speed, and how weak each version is.
All benchmarks were run on a 700MHz Pentium-III notebook with 256MB
of memory running Windows 2000. Times are in milliseconds per run.
Runtime (ms)
Benchmark C (opt) C (debug) VW ST-X MT Dolphin Squeak
BubbleSort 1.7 3.8 24 39 30 88 96
IntMM 0.8 2.4 11 20 19 39 102
MM 1.0 2.7 52 71 23 74 150
Perm 1.9 11.3 13 16 14 76 87
Queens 1.5 3.0 9 9 8 42 52
Quicksort 1.5 3.7 10 11 12 38 46
Towers 1.4 10.0 15 19 26 85 117
Puzzle 8.0 14.9 74 117 67 252 455
FFT 1.3 3.6 103 114 493 173 318
WeightedTotal 13.0 43.4 205 282 330 626 984
Ratio to C 1.0 3.3 16 22 25 48 76
Ratio to optimized C
Benchmark C (opt) C (debug) VW ST-X MT Dolphin Squeak
BubbleSort 1.0 2.2 14 23 18 53 57
IntMM 1.0 3.1 14 26 25 51 134
MM 1.0 2.7 52 71 23 74 150
Perm 1.0 6.0 7 8 7 40 46
Queens 1.0 2.0 6 6 5 28 34
Quicksort 1.0 2.5 7 8 8 26 32
Towers 1.0 7.1 11 13 18 60 83
Puzzle 1.0 1.9 9 15 8 31 57
FFT 1.0 2.9 82 90 391 137 252
Median 1.0 2.7 11 15 18 51 58
Average 1.0 3.4 23 29 56 56 94
Geo. Mean 1.0 1.6 14 19 18 49 74
Versions:
Microsoft Visual C++ 6.0 Enterprise Edition
- Default "Release" and "Debug" settings
VisualWorks Non-Commercial 5i.4
Smalltalk-X 4.1.3
Smalltalk MT version 4.00 (Build 517) 9x Edition (ANSI)
- Optimized image
Dolphin 4.01.3 Professional
Squeak 3.0
Notes:
[1] I also tested evaluation versions of VisualAge and SmallScript.
Unfortunately, both of their license agreements seem to
prevent posting of their results.
[2] Most benchmarks are integer only, except MM and FFT, which
are floating point benchmarks.
[3] The benchmarks are mostly as they exist in the Self distribution
with the following changes.
[a] I created a common file-in format.
[b] I increased the number of runs to give more accurate timing
on newer computers. I am now running each benchmark 10 times
in Smalltalk and 1000 times in C. I verified that changing
the number of runs doesn't affect timing per run.
[c] I ported the FFT benchmark to Smalltalk. The port is a
straight-forward port from C and doesn't use a Smalltalk
style. It doesn't implement mathematical function in the
SComplex element.
[d] I Fixed Uniform11 in C to take pointers to integers instead
of integers. It was expecting to maintain the values between
calls but didn't.
[e] I Made the MM benchmark (Matrix Multiply) its own benchmark
with its own methods, and initialized the sum variable
in innerproductOf: to 0.0. It used to be a subclass of
IntMMBenchmark and initialized the sum variable to 0.
This defeated Smalltalk MT's float optimization in a
situation that wasn't very realistic.
[4] The exact formulation of a floating point test makes
a big difference to Smalltalk-MT, which tries to optimize
floating point. For example, changing the initialization
variable in MM from 0 to 0.0 allowed Smalltalk-MT to optimize
the inner loop, moving it from the slowest Smalltalk on that
benchmark to the fastest. Unfortunately, it can't optimize
the inner loop of FFT and remains the slowest Smalltalk
on that benchmark.
[5] The FFT benchmark contains a lot of redundant expressions
in its inner loop that is eliminated automatically by the
optimizing C compiler. I used the same code for Smalltalk.
The Smalltalk compiler can't optimize the expressions and
so the results of this benchmark are very poor. On the other
hand, I can nearly double the benchmark speed by optimizing
a few lines by hand to eliminate redundant expressions.
A real FFT kernel written in Smalltalk would have such optimizations.
[6] Reporting the runtimes of all tests is easy. Creating a single
performance number from a number of test results is trickier.
There are numerous ways of doing so.
(1) Sum of execution times.
This may be the most realistic as you are usually interested
in how long it takes to do something. The disadvantage is
that a baseline and scaling factor is needed. Some options are:
(a) Simply add up the execution time of the tests as-is.
Hopefully the tests are reasonably well balanced
so certain tests don't dominate.
(b) Get the execution time of each benchmark on some
reference system. Divide the time of each benchmark
by that reference's results.
(c) Get the execution time of each benchmark relative to
some reference on the current system. The execution
time of optimized C is a reasonable reference.
Tests that are slow relative to the reference will end
up emphasized in the results.
(2) Sum of speeds.
This has the same biasing issues as (1), plus it doesn't
reflect usage very good. It emphasizes very fast tests.
(3) Geometric mean.
Using the geometric mean gets rid of most weighting
problems - it is easy to weight tests evenly.
Using the geometric mean of times or speed
is equivalent. There still needs to be some reference
for reporting. Overly fast and slow test will have
about an equal effect on geometric mean.
(4) Median
This can be the absolute median, if the tests are
properly weighted, or the median relative to some
baseline such as optimized C.
In the end, I reported the totals as follows:
(1) Total time in milliseconds, totaling most benchmarks
but only 1/2 of the FFT and 1/3 of the Puzzle benchmarks.
These benchmarks take longer than the others and
would dominate otherwise.
(2) The ratio to optimized C of the total.
(3) The median test relative to optimized C.
(4) The average of the execution time relative to optimized C.
(5) The geometric mean of the execution time relative
to optimized C.
The various versions are reported fastest to slowest on
most measures.
[7] If anyone has other benchmarks that run in both Smalltalk and C
or C++, and that aren't too difficult to set up, could you
send me or post pointers? Thanks.
I hope that this is useful to people.
Greg Gritton
Subject: Re: Smalltalk/C Benmark results
From: Greg & Cindy Gritton
Reply-To: gritton.family@verizon.net
Newsgroups: comp.lang.smalltalk
Date: Mon, 11 Feb 2002 05:57:23 GMT
I don't normally follow up my own postings, but I added treesort
to the benchmark mix. This benchmark was already there but had
not been run in the suite before.
The key difference between treesort and the other benchmarks
is that the C version of the benchmark dynamically allocates
each element that it is inserting in the tree. Of course
the Smalltalk version does likewise. However, the presense
of dynamic allocation brings Smalltalk and C much closer.
VisualWorks is only 3 times slower than optimized C and
actually beats the C debug version.
Here are the updated results with treesort included.
Also, the geometric mean slowdown of "C-Debug" is corrected.
(It should have been 3.0)
Runtime (ms)
Benchmark C (opt) C (debug) VW ST-X MT Dolphin Squeak
BubbleSort 1.7 3.8 24 39 30 88 96
IntMM 0.8 2.4 11 20 19 39 102
MM 1.0 2.7 52 71 23 74 150
Perm 1.9 11.3 13 16 14 76 87
Queens 1.5 3.0 9 9 8 42 52
Quicksort 1.5 3.7 10 11 12 38 46
Treesort 5.2 25.2 17 19 28 69 144
Towers 1.4 10.0 15 19 26 85 117
Puzzle 8.0 14.9 74 117 67 252 455
FFT 1.3 3.6 103 114 493 173 318
WeightedTotal 18.0 68.1 225 296 427 673 1090
Ratio to C 1.0 3.8 13 17 24 38 61
Ratio to optimized C
Benchmark C (opt) C (debug) VW ST-X MT Dolphin Squeak
BubbleSort 1.0 2.2 14 23 18 53 57
IntMM 1.0 3.1 14 26 25 51 134
MM 1.0 2.7 52 71 23 74 150
Perm 1.0 6.0 7 8 7 40 46
Queens 1.0 2.0 6 6 5 28 34
Quicksort 1.0 2.5 7 8 8 26 32
Treesort 1.0 4.8 3 4 5 13 28
Towers 1.0 7.1 11 13 18 60 83
Puzzle 1.0 1.9 9 15 8 31 57
FFT 1.0 2.9 82 90 391 137 252
Median 1.0 2.8 10 14 13 46 57
Average 1.0 3.5 20 27 51 51 87
Geo. Mean 1.0 3.2 12 16 16 42 67
Also, I forgot to include some of my notes.
The original benchmarks included a second version of each
benchmark (except Puzzle and FFT), which instead of using message
that were instance messages of each benchmark found some type of
element to send the message to. It was supposed to be more object
oriented. As the results were generally similar to the main
benchmarks I didn't report the results of those versions.
Greg Gritton