Prev: RSA Proof using CRTs
Next: learning again
From: Thomas Pornin on 23 Jun 2010 16:19 Hello all, sphlib-2.1 has been released: http://www.saphir2.com/sphlib/ sphlib is a library of implementations of hash functions, in both C and Java. The C version includes a variant optimized for small architectures (those with about 8 kB of L1 cache). The C code also includes a command-line tool which can act as a drop-in replacement for the md5sum / sha1sum / etc tools commonly found on Linux systems. The Java code is compatible with J2ME (the "reduced Java" for mobile phones). A flexible HMAC implementation is provided with the Java code. sphlib-2.1 includes implementations for the fourteen second-round SHA-3 candidates, as well as a bunch of pre-SHA-3 functions (including SHA-1 and SHA-2). The archive contains (in its 'doc/' subdirectory) a report on sphlib speed, as measured on a variety of architectures. sphlib-2.1 is opensource (MIT-like license). --Thomas Pornin
From: Maaartin on 23 Jun 2010 18:32 Nice work! I managed to achieve a substantial speed up for Cubehash in Java in a very trivial but surprising way: I extracted 2 methods from the method sixteenRounds() by simply putting the first and second halves each into a separate method. It seems like the JIT can't deal with very large methods. I ran both versions several times on my Phenom II X4 and gained always a factor of about 1.25. original: long messages -> 44.27 MBytes/s my version: long messages -> 56.13 MBytes/s
From: Maaartin on 23 Jun 2010 18:55 I was a bit imprecise in my last post. Of course I didn't split the whole method sixteenRounds(), but the loop body (so I created one method for an even single round and one for an odd single round). Now, I continued the method extraction and got a big slowdown this time. By splitting the loop body into four parts (each corresponding with half a round), I've got long messages -> 35.65 MBytes/s Could somebody confirm it? This behavior of Java is suitable both for a bug report and for the DailyWTF. In case it matters, I'm using AMD Phenom(tm) II X4 920 Processor, 2800 MHz Win Professional XP64 Version 2003, SP2 Java(TM) SE Runtime Environment (build 1.6.0_13-b03) Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
From: Maaartin on 23 Jun 2010 19:19 Now I extracted the whole loop body, so I have private final void sixteenRounds() { for (int i = 0; i < 8; i ++) p(); } and I get again long messages -> 55.78 MBytes/s Maybe this extraction prevented the variable i from needlessly occupying a register, and this additionally available register allowed for much better pipeline utilization. I think some optimizations are possible by manually reordering the instructions, which could be useful for C as well. I haven't had a look at the code produced by the compiler, but I'd guess that it doesn't do the reordering well enough. As you wrote, there's a lot of parallelism available. I add, after the reordering there are many operations possible on a subset of registers before the subset has to be swapped to memory. But I haven't tried yet.
From: Thomas Pornin on 23 Jun 2010 19:52
According to Maaartin <grajcar1(a)seznam.cz>: > In case it matters, I'm using > AMD Phenom(tm) II X4 920 Processor, 2800 MHz > Win Professional XP64 Version 2003, SP2 > Java(TM) SE Runtime Environment (build 1.6.0_13-b03) > Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing) It would be interesting to bench the C code, too. "Morally", the Java version cannot be faster than the C code, but CubeHash is one of the functions where Java is closest to the C performance: on my Intel Q6660 (2.4 GHz), in 64-bit mode, I get 60 MB/s with the C code, and 45 MB/s with the Java implementation. That Java achieves 75% of the C speed is quite rare; for most functions (hash functions and other computation-heavy codes I have tried), the speed of Java is more typically between 30% and 50% of the speed achieved with optimized C code. (Note that I am talking here about computations occurring entirely in the L1 cache; in "normal" code, memory, network or disk bandwidth dominates, and Java fares as well as C or just any other language.) Anyway, your Phenom should run faster than my Q6600 and the 44 MB/s you get with my code is a bit too low. This looks like a misoptimization from the JVM. By the way, you may want to update your JVM. The current version from Sun (Oracle) appears to be 1.6.0_20, and it fixes some bugs; it may also include code generation improvements. Version 1.6.0_14 introduced "extensive performance updates to the HotSpot JIT compiler" (or at least so says Wikipedia). If you want to submit a bug report to Sun, the first thing they will ask is whether you use the latest version (that is, if they respond at all). --Thomas Pornin |