Prev: RSA Proof using CRTs
Next: learning again
From: Thomas Pornin on 24 Jun 2010 08:46 According to Maaartin <grajcar1(a)seznam.cz>: > Btw., did you try a native Java compiler? There is a couple of them > available... I tried the one I wrote -- which translates JVM bytecode to C, and then runs a C compiler. Performance was somewhat lower than what C code achieves, mainly due to the array bounds checks. > That's really bad news for Java. That's not news, and that's not really bad either. A factor of 2.5 or 3 is a worst case, when crunching data which is all in L1 cache. Most applications are limited by I/O (you cannot hash data faster than you actually get it, and 100baseT network tops at about 10 MB/s -- so even if CubeHash is not the fastest of hash functions, its performance is still adequate in many situations). In the few applications where pure processing speed matters, it actually matters for a very small portion of the code only. Java strives on the 99.9% of the code where you do not need cycle-to-cycle records. > Cubehash is very simple, but what normal algorithm would profit from > 32+ registers? CubeHash is an extreme case with regards to register usage. In the early 90's, when DEC developped the first Alpha processor, they ran some simulations, and found out that while having 32 registers instead of 16 was offered significant gains, going from 32 to a whooping 256 registers increased performance only marginally. Thus the Alpha got 32 registers. > Could you provide me the cubehash.s created by something like > gcc -S -O1 -m64 cubehash.c > ? I'm an unfortunate winblow$ user and I can't get -m64 using cygwin. You could try Visual C. If you have a 64-bit Windows, you can get the "Windows SDK" (for free) and it comes with the command-line versions of Visual C for 64-bit architectures. I will try to produce your file. > What do you think about the success chances of the manual instruction > reordering? Anything can happen. The problem of producing optimal code is hard so compilers tend to rely on heuristics which produce "good" code "most of the time". Those heuristics are rarely documented, so manual optimization looks like a hit-and-miss game until you happen to find a situation where the compiler does not do anything stupid. --Thomas Pornin
From: Maaartin on 26 Jun 2010 04:07
Ignoring the swaps, there are 6*32 operations in one round of CubeHash, which means at least 64 cycles on a 3-way superscalar processor. There are 16 rounds for each 32 input bytes, which makes 32 cycles per byte. On your 64-bit system, your C version runs with 39.84 cycles per byte and your Java version runs with 53.05 cycles per byte. On my computer it was slower before I applied the trivial optimization, now it's the same. I we managed to make the compiler do it right, it could run with about the same speed as in C. This should be doable, but there's the 32 cycles/byte limit, so you're quite close already. This limit makes me much less interested in the optimizations... Using XMM, it should run at about 12.5 cycles/byte (not exactly 4x as fast because of missing rotation instructions), but that's not what you're doing. This is valid only for CPUs capable of executing 3 XMM instructions per second, which is AFAIK not the case for AMD processors. |