CampSmalltalk: Greg Gritton Benchmarks

Subject: Smalltalk/C Benmark results
From: Greg & Cindy Gritton 
Reply-To: gritton.family@verizon.net
Newsgroups: comp.lang.smalltalk
Date: Sun, 10 Feb 2002 00:25:19 GMT

In honor of the Olympics (or the fact that I completed the runs), I am 
presenting the results of some benchmarks run on various versions of 
Smalltalk as well as in C.  These are small benchmarks, originally written
in C and then ported to Smalltalk.  They were taken from the Self project
with some alterations (see notes below).  

These benchmarks don't represent how Smalltalk is currently used.  Most often,
Smalltalk seems to be used in business logic and web applications.  Its
main competitors there are C++, and perhaps Perl.  In business applications
virtual functions, dynamic memory allocation, and an object-oriented design
style is often used, which results in similar speed implementations for 
Smalltalk and C++.  

These benchmarks are much simpler than business applications.  They are
small programs that perform basic tasks such as sorting, matrix multiply,
or classical logic problems such as "8 queens".  They don't use any 
dynamic memory allocation, are not object oriented, generally hit the 
cache, and run very fast in C.  As such, they show Smalltalk in its 
worst light.

With that disclaimer, they are still useful.  They help show where 
Smalltalk is weak in execution speed, and how weak each version is.

All benchmarks were run on a 700MHz Pentium-III notebook with 256MB
of memory running Windows 2000.  Times are in milliseconds per run.

                              Runtime (ms)

Benchmark   C (opt) C (debug)  VW    ST-X    MT  Dolphin Squeak
BubbleSort     1.7     3.8     24     39     30     88     96
IntMM          0.8     2.4     11     20     19     39    102
MM             1.0     2.7     52     71     23     74    150
Perm           1.9    11.3     13     16     14     76     87
Queens         1.5     3.0      9      9      8     42     52
Quicksort      1.5     3.7     10     11     12     38     46
Towers         1.4    10.0     15     19     26     85    117
Puzzle         8.0    14.9     74    117     67    252    455
FFT            1.3     3.6    103    114    493    173    318

WeightedTotal 13.0    43.4    205    282    330    626    984
Ratio to C     1.0     3.3     16     22     25     48     76

                          Ratio to optimized C

Benchmark   C (opt) C (debug)  VW    ST-X    MT  Dolphin Squeak
BubbleSort     1.0     2.2     14     23     18     53     57
IntMM          1.0     3.1     14     26     25     51    134
MM             1.0     2.7     52     71     23     74    150
Perm           1.0     6.0      7      8      7     40     46
Queens         1.0     2.0      6      6      5     28     34
Quicksort      1.0     2.5      7      8      8     26     32
Towers         1.0     7.1     11     13     18     60     83
Puzzle         1.0     1.9      9     15      8     31     57
FFT            1.0     2.9     82     90    391    137    252

Median         1.0     2.7     11     15     18     51     58
Average        1.0     3.4     23     29     56     56     94
Geo. Mean      1.0     1.6     14     19     18     49     74

Versions:
  Microsoft Visual C++ 6.0 Enterprise Edition
   - Default "Release" and "Debug" settings
  VisualWorks Non-Commercial 5i.4
  Smalltalk-X 4.1.3
  Smalltalk MT version 4.00 (Build 517) 9x Edition (ANSI)
   - Optimized image
  Dolphin 4.01.3 Professional
  Squeak 3.0

Notes:

[1] I also tested evaluation versions of VisualAge and SmallScript.
    Unfortunately, both of their license agreements seem to 
    prevent posting of their results.

[2] Most benchmarks are integer only, except MM and FFT, which 
    are floating point benchmarks.

[3] The benchmarks are mostly as they exist in the Self distribution
    with the following changes.
    [a] I created a common file-in format.
    [b] I increased the number of runs to give more accurate timing
        on newer computers.  I am now running each benchmark 10 times
        in Smalltalk and 1000 times in C.  I verified that changing
        the number of runs doesn't affect timing per run.
    [c] I ported the FFT benchmark to Smalltalk.  The port is a
        straight-forward port from C and doesn't use a Smalltalk
        style.  It doesn't implement mathematical function in the
        SComplex element.
    [d] I Fixed Uniform11 in C to take pointers to integers instead
        of integers.  It was expecting to maintain the values between
        calls but didn't.
    [e] I  Made the MM benchmark (Matrix Multiply) its own benchmark
        with its own methods, and initialized the sum variable
        in innerproductOf: to 0.0.  It used to be a subclass of
        IntMMBenchmark and initialized the sum variable to 0.
        This defeated Smalltalk MT's float optimization in a 
        situation that wasn't very realistic.

[4] The exact formulation of a floating point test makes
    a big difference to Smalltalk-MT, which tries to optimize
    floating point.  For example, changing the initialization 
    variable in MM from 0 to 0.0 allowed Smalltalk-MT to optimize 
    the inner loop, moving it from the slowest Smalltalk on that
    benchmark to the fastest.  Unfortunately, it can't optimize
    the inner loop of FFT and remains the slowest Smalltalk
    on that benchmark.

[5] The FFT benchmark contains a lot of redundant expressions
    in its inner loop that is eliminated automatically by the
    optimizing C compiler.  I used the same code for Smalltalk.
    The Smalltalk compiler can't optimize the expressions and
    so the results of this benchmark are very poor.  On the other
    hand, I can nearly double the benchmark speed by optimizing
    a few lines by hand to eliminate redundant expressions.
    A real FFT kernel written in Smalltalk would have such optimizations. 

[6] Reporting the runtimes of all tests is easy.  Creating a single
    performance number from a number of test results is trickier.
    There are numerous ways of doing so.
    (1) Sum of execution times.  
        This may be the most realistic as you are usually interested 
        in how long it takes to do something.  The disadvantage is 
        that a baseline and scaling factor is needed.  Some options are:
        (a) Simply add up the execution time of the tests as-is.
            Hopefully the tests are reasonably well balanced
            so certain tests don't dominate.
        (b) Get the execution time of each benchmark on some
            reference system.  Divide the time of each benchmark
            by that reference's results.
        (c) Get the execution time of each benchmark relative to
            some reference on the current system.  The execution
            time of optimized C is a reasonable reference.
        Tests that are slow relative to the reference will end
        up emphasized in the results.
    (2) Sum of speeds.
        This has the same biasing issues as (1), plus it doesn't
        reflect usage very good.  It emphasizes very fast tests.
    (3) Geometric mean.
        Using the geometric mean gets rid of most weighting 
        problems - it is easy to weight tests evenly.  
        Using the geometric mean of times or speed
        is equivalent.  There still needs to be some reference
        for reporting.  Overly fast and slow test will have
        about an equal effect on geometric mean.
    (4) Median
        This can be the absolute median, if the tests are
        properly weighted, or the median relative to some
        baseline such as optimized C.

    In the end, I reported the totals as follows:
    (1) Total time in milliseconds, totaling most benchmarks
        but only 1/2 of the FFT and 1/3 of the Puzzle benchmarks.
        These benchmarks take longer than the others and
        would dominate otherwise.
    (2) The ratio to optimized C of the total.
    (3) The median test relative to optimized C.
    (4) The average of the execution time relative to optimized C.
    (5) The geometric mean of the execution time relative
        to optimized C.

    The various versions are reported fastest to slowest on
    most measures.

[7] If anyone has other benchmarks that run in both Smalltalk and C
    or C++, and that aren't too difficult to set up, could you
    send me or post pointers?  Thanks.

I hope that this is useful to people.

Greg Gritton

Subject: Re: Smalltalk/C Benmark results
From: Greg & Cindy Gritton 
Reply-To: gritton.family@verizon.net
Newsgroups: comp.lang.smalltalk
Date: Mon, 11 Feb 2002 05:57:23 GMT

I don't normally follow up my own postings, but I added treesort
to the benchmark mix.  This benchmark was already there but had
not been run in the suite before.

The key difference between treesort and the other benchmarks
is that the C version of the benchmark dynamically allocates
each element that it is inserting in the tree.  Of course
the Smalltalk version does likewise.  However, the presense
of dynamic allocation brings Smalltalk and C much closer.
VisualWorks is only 3 times slower than optimized C and
actually beats the C debug version.

Here are the updated results with treesort included.
Also, the geometric mean slowdown of "C-Debug" is corrected.
(It should have been 3.0)

                              Runtime (ms)

Benchmark   C (opt) C (debug)  VW    ST-X    MT  Dolphin Squeak
BubbleSort     1.7     3.8     24     39     30     88     96
IntMM          0.8     2.4     11     20     19     39    102
MM             1.0     2.7     52     71     23     74    150
Perm           1.9    11.3     13     16     14     76     87
Queens         1.5     3.0      9      9      8     42     52
Quicksort      1.5     3.7     10     11     12     38     46
Treesort       5.2    25.2     17     19     28     69    144
Towers         1.4    10.0     15     19     26     85    117
Puzzle         8.0    14.9     74    117     67    252    455
FFT            1.3     3.6    103    114    493    173    318

WeightedTotal 18.0    68.1    225    296    427    673   1090
Ratio to C     1.0     3.8     13     17     24     38     61

                          Ratio to optimized C

Benchmark   C (opt) C (debug)  VW    ST-X    MT  Dolphin Squeak
BubbleSort     1.0     2.2     14     23     18     53     57
IntMM          1.0     3.1     14     26     25     51    134
MM             1.0     2.7     52     71     23     74    150
Perm           1.0     6.0      7      8      7     40     46
Queens         1.0     2.0      6      6      5     28     34
Quicksort      1.0     2.5      7      8      8     26     32
Treesort       1.0     4.8      3      4      5     13     28
Towers         1.0     7.1     11     13     18     60     83
Puzzle         1.0     1.9      9     15      8     31     57
FFT            1.0     2.9     82     90    391    137    252

Median         1.0     2.8     10     14     13     46     57
Average        1.0     3.5     20     27     51     51     87
Geo. Mean      1.0     3.2     12     16     16     42     67

Also, I forgot to include some of my notes.

The original benchmarks included a second version of each
benchmark (except Puzzle and FFT), which instead of using message 
that were instance messages of each benchmark found some type of 
element to send the message to.  It was supposed to be more object 
oriented.  As the results were generally similar to the main 
benchmarks I didn't report the results of those versions.

Greg Gritton

Edit

Rename

Upload

Download