From: tomstdenis@gmail.com on 28 Feb 2005 11:22 [quick blurb]. TomsFastMath is my public domain port of LibTomMath specifically for fast RSA/DH/ECC math. It has been ported to x86_32, x86_64, SSE2 and ARMv4 in previous releases. URL: http://libtomcrypt.org/tfm/index.html Today [just got back from France on Saturday ... hehehe] I've sped up my Squaring code by removing redundant "doubling" steps ... Essentially I was doing n^2 doublings when only n doublings were required. Doubling isn't hard [three adds] but it adds up [excuse the pun] and we're about speed here ;-) Some numbers on two Athlons [64 and XP] in cycles per operation [groups of 10000 operations take the min time] OLD 64-bit... Squaring: 256-bit: 89 512-bit: 234 1024-bit: 815 2048-bit: 2851 NEW 64-bit ... Squaring: 256-bit: 89 512-bit: 228 1024-bit: 691 2048-bit: 2228 OLD 32-bit Squaring: 256-bit: 327 512-bit: 1044 1024-bit: 3646 2048-bit: 17055 NEW 32-bit Squaring: 256-bit: 332 512-bit: 894 1024-bit: 2983 2048-bit: 15197 Amongst other things I want to speed up the generic and Karat code... but generally the 32/64-bit performance is good (1024-bit squaring for instance is used in CRT based RSA-2048). In particular I may add a 64x routine since that would speed up 2048-bit mul/sqr operations BIG TIME. The code expansion is huge but new cpus have huge caches for a reason [and TomsFastMath is tunable]. On the AMD64 I was *already* faster than OpenSSL by [iirc] around 38% in exptmod timings. Since squaring is the second most dominant step in exptmod (you do n-squarings for a n-bit exptmod) this will speed it up. on my AMD64 I just measured NEW Exptmod: 512-bit: 616,647 [~40% faster than OpenSSL] 1024-bit: 3,075,211 2048-bit: 19,613,160 Whereas with TFM 0.02 I got Exptmod: 512: 641,743 1024: 3,167,406 2048: 20,158,609 I can likely get better when I tune the generic routines a bit more. Basically if you move outside of the nice powers of two the comba routines slow down significantly [e.g. 320-bit squaring is much slower than a 256-bit squaring and not much faster than a 512-bit squaring if at all...] What I'm looking for at the moment is a shell account on a P4 running a recent distro of linux + GCC [preferably GCC 3.4.3 ;-)]. I can test the code on my AMD64 with SSE2 but it's obviously not the same as testing on a real P4 [for speed I mean]. Similarly if anyone has access to an ARM box I can use that would be great. Otherwise I'll try setting up a test on my gameboy again [which is annoying cuz I have to go into windows...] Right now I have to write and test the SSE2 and ARMv4 ports of the new squaring code. Then I'll clean up the code a bit and release TFM 0.03 ;-) Tom
From: Jean-Luc Cooke on 28 Feb 2005 13:26 tomstdenis(a)gmail.com <tomstdenis(a)gmail.com> wrote: > [quick blurb]. TomsFastMath is my public domain port of LibTomMath > specifically for fast RSA/DH/ECC math. It has been ported to x86_32, > x86_64, SSE2 and ARMv4 in previous releases. URL: > http://libtomcrypt.org/tfm/index.html > Today [just got back from France on Saturday ... hehehe] I've sped up > my Squaring code by removing redundant "doubling" steps ... Essentially > I was doing n^2 doublings when only n doublings were required. > Doubling isn't hard [three adds] but it adds up [excuse the pun] and > we're about speed here ;-) Shouldn't it be only 2 adds? Or a shift? Or am I thinking too hard about this? > Some numbers on two Athlons [64 and XP] in cycles per operation [groups > of 10000 operations take the min time] > OLD 64-bit... > Squaring: > 256-bit: 89 > 512-bit: 234 > 1024-bit: 815 > 2048-bit: 2851 > NEW 64-bit ... > Squaring: > 256-bit: 89 > 512-bit: 228 > 1024-bit: 691 > 2048-bit: 2228 > OLD 32-bit > Squaring: > 256-bit: 327 > 512-bit: 1044 > 1024-bit: 3646 > 2048-bit: 17055 > NEW 32-bit > Squaring: > 256-bit: 332 > 512-bit: 894 > 1024-bit: 2983 > 2048-bit: 15197 > Amongst other things I want to speed up the generic and Karat code... > but generally the 32/64-bit performance is good (1024-bit squaring for > instance is used in CRT based RSA-2048). > In particular I may add a 64x routine since that would speed up > 2048-bit mul/sqr operations BIG TIME. The code expansion is huge but > new cpus have huge caches for a reason [and TomsFastMath is tunable]. > On the AMD64 I was *already* faster than OpenSSL by [iirc] around 38% > in exptmod timings. Since squaring is the second most dominant step in > exptmod (you do n-squarings for a n-bit exptmod) this will speed it up. > on my AMD64 I just measured > NEW > Exptmod: > 512-bit: 616,647 [~40% faster than OpenSSL] > 1024-bit: 3,075,211 > 2048-bit: 19,613,160 > Whereas with TFM 0.02 I got > Exptmod: > 512: 641,743 > 1024: 3,167,406 > 2048: 20,158,609 > I can likely get better when I tune the generic routines a bit more. > Basically if you move outside of the nice powers of two the comba > routines slow down significantly [e.g. 320-bit squaring is much slower > than a 256-bit squaring and not much faster than a 512-bit squaring if > at all...] > What I'm looking for at the moment is a shell account on a P4 running a > recent distro of linux + GCC [preferably GCC 3.4.3 ;-)]. I can test > the code on my AMD64 with SSE2 but it's obviously not the same as > testing on a real P4 [for speed I mean]. Similarly if anyone has > access to an ARM box I can use that would be great. Otherwise I'll try > setting up a test on my gameboy again [which is annoying cuz I have to > go into windows...] > Right now I have to write and test the SSE2 and ARMv4 ports of the new > squaring code. Then I'll clean up the code a bit and release TFM 0.03 > ;-) > Tom --
From: tomstdenis@gmail.com on 28 Feb 2005 13:39 Jean-Luc Cooke wrote: > tomstdenis(a)gmail.com <tomstdenis(a)gmail.com> wrote: > > [quick blurb]. TomsFastMath is my public domain port of LibTomMath > > specifically for fast RSA/DH/ECC math. It has been ported to x86_32, > > x86_64, SSE2 and ARMv4 in previous releases. URL: > > http://libtomcrypt.org/tfm/index.html > > > Today [just got back from France on Saturday ... hehehe] I've sped up > > my Squaring code by removing redundant "doubling" steps ... Essentially > > I was doing n^2 doublings when only n doublings were required. > > Doubling isn't hard [three adds] but it adds up [excuse the pun] and > > we're about speed here ;-) > > Shouldn't it be only 2 adds? Or a shift? Or am I thinking too hard > about this? It's a three-word carry that acts as an open ended shift register throughout the product. In LTM i do two-word ops but I also don't fill the variable [e.g. 28-bit digits]. In TFM I use asm inlines so I can easily use carries and hence three word carries. Tom
From: Kevin Drapel on 2 Mar 2005 09:16 < Otherwise I'll try > setting up a test on my gameboy again [which is annoying cuz I have to > go into windows...] > GBA or old Gameboy ? There are flash tools for the GBA on Linux : http://www.gameboy-advance.net/fal_soft/gba_flash_2_advance_linux.htm
From: tomstdenis@gmail.com on 2 Mar 2005 19:50 Kevin Drapel wrote: > < Otherwise I'll try > > setting up a test on my gameboy again [which is annoying cuz I have to > > go into windows...] > > > > GBA or old Gameboy ? There are flash tools for the GBA on Linux : > > http://www.gameboy-advance.net/fal_soft/gba_flash_2_advance_linux.htm I found most linux tools don't work well out of the box [mostly the crt/libc isn't built for the memory model of an actual GBA]. So I use an older win32 build of devkitadv for my GBA dev. Tom
|
Pages: 1 Prev: Can you break this code? Next: Need help decoding cipher |