Multi-cpu and ruby Threading [Ruby]

Prev: String Spliting/CSV question
Next: thread.exit

From: Regis d'Aubarede on 30 Jun 2010 11:20

Charles Nutter wrote:
>> If with --server on your system JRuby's still slower than IronRuby,...
> Maybe also worth showing an experimental dynopt flag for JRuby that seem to
> improve performance ....

Sorry for my bad english !!

My test consist to verify that symetric multi-core (SMP) is well use by
the VM. In this aspect, pure performence is not important.
the decrease of duration calculation with the increase off used threads
is my concern.

(http://programmingzen.com/2010/06/28/the-great-ruby-shootout-windows-edition/
show that JRuby is superior to IronRuby...)

For discrimination if the issue is in JRuby side or in JVM side, i run
same
JRubyCode, but invoke a pure Java traitement :
(1..nb_threads).map { Thread.new() { Calc.calc(p1,n1) } }
with

class Calc {
public static long calc(int a, int b) {
long res=0;
for (int i=0;i<a;i++)
for (int j=0;j<b;j++)
for (int k=0;k<1000;k++)
res+=i+j+k;
return(res);
}
}

c:\usr\ruby\local>jruby thread_bench2.rb
1.8.7, java, 2010-05-12
1000 iterations by 1 threads , Duration = 15404 ms
500 iterations by 2 threads , Duration = 8147 ms
333 iterations by 3 threads , Duration = 5812 ms
250 iterations by 4 threads , Duration = 4690 ms
200 iterations by 5 threads , Duration = 4648 ms
166 iterations by 6 threads , Duration = 4749 ms
142 iterations by 7 threads , Duration = 4371 ms
125 iterations by 8 threads , Duration = 4222 ms

So JVM scale right :)
And my intel core i7 has realy 4 core...

Attachments:
http://www.ruby-forum.com/attachment/4829/thread_bench2.rb

--
Posted via http://www.ruby-forum.com/.

From: Charles Oliver Nutter on 30 Jun 2010 13:57

On Wed, Jun 30, 2010 at 10:20 AM, Regis d'Aubarede
<regis.aubarede(a)gmail.com> wrote:
> For discrimination if the issue is in JRuby side or in JVM side, i run
> same
> JRubyCode, but invoke a pure Java traitement :
> Â Â (1..nb_threads).map { Â Thread.new() { Calc.calc(p1,n1) } }
> with
>
> class Calc {
> Â public static long calc(int a, int b) {
> Â Â long res=0;
> Â Â for (int i=0;i<a;i++)
> Â Â Â for (int j=0;j<b;j++)
> Â Â Â for (int k=0;k<1000;k++)
> Â Â Â res+=i+j+k;
> Â Â return(res);
> Â }
> }

Yes, this result is not surprising to me. In the original case, the
benchmark suffers mostly from all the objects being created. For
example:

* All the numeric loops (in JRuby) create at least one new Fixnum
object for every iteration
* All the math operations create Fixnum or Float objects as well

Running an allocation profile of your benchmark (which actually runs
pretty slow because there's *so much* allocation happening) shows the
amount of data that's being chewed up...it's very likely that the
bottleneck is in allocating all those closures and all those Fixnums
for this particular case:

~/projects/jruby â jruby -J-Xrunhprof thread_bench.rb
1.8.7, java, 2010-06-17
1000 iterations by 1 threads , Duration = 399267 ms
^CDumping Java heap ... allocation sites ... done.

~/projects/jruby â egrep "%|objs" java.hprof.txt | head -n 11
rank self accum bytes objs bytes objs trace name
1 65.18% 65.18% 13545024 423282 1133938432 35435576 302318
org.jruby.RubyFixnum
2 22.61% 87.79% 4697920 146810 381348672 11917146 302867
org.jruby.RubyFloat
3 1.32% 89.12% 274992 5350 274992 5350 300000 char[]
4 0.62% 89.74% 128488 5341 128488 5341 300000 java.lang.String
5 0.18% 89.92% 38184 1 38184 1 306423 short[]
6 0.18% 90.10% 38184 1 38184 1 306428 short[]
7 0.14% 90.24% 28720 718 29400 735 300521
java.util.WeakHashMap$Entry
8 0.13% 90.37% 27792 70 27792 70 300000 byte[]
9 0.13% 90.50% 26832 1118 35040 1460 300704
java.util.concurrent.ConcurrentHashMap$HashEntry
10 0.12% 90.63% 25232 166 25232 166 300557 org.jruby.MetaClass

Note that this is only after the 1000-iteration run, and during
execution over 1GB of memory was allocated and released, mostly in
Fixnum objects with a smaller amount (380MB+) in Float objects.
Running with verbose GC:

~/projects/jruby â jruby -J-verbose:gc thread_bench.rb
1.8.7, java, 2010-06-17
[GC 13184K->1128K(63936K), 0.0108696 secs]
[GC 14312K->2124K(63936K), 0.0077762 secs]
[GC 15308K->1445K(63936K), 0.0010409 secs]
[GC 14629K->1246K(63936K), 0.0031958 secs]
...

And adding up all the size changes (number of GC runs * difference in
live object size) produces roughly the same estimate; for the period
the 1000-iteration part of the bench runs, it allocates a *lot* of
objects.

IronRuby may do better here if they're able to treat Fixnum objects as
value types, which the CLR handles more efficiently than the JVM's
"every object is on the heap". Ultimately this is largely an
allocation-rate benchmark, at least on JRuby, since our Fixnum objects
are "real" objects (or to put it in MRI's favor...our Fixnum objects
are forced to be "real" objects with heap lifecycles).

The dynopt work is part of efforts in JRuby to bring math performance
closer to Java, largely by eliminating te excessive object churn and
layers of noise for math operations.

- Charlie

First | Prev |
Pages: 1 2
Prev: String Spliting/CSV question
Next: thread.exit