From: Skybuck Flying on 2 Mar 2010 09:24 Hello, First the bad news: The bad news is probably that with my current design, too many passes are needed, which means lot's of shader switching, texture switching, framebuffer switching and possibly even vertex buffer switching. (Which would need to happen for each pass). The number of passes is something like: for Rounds := 1 to 100 do PassA for Cycles := 1 to 80000 do PassB for Warriors := 1 to 10 do PassQ PassX PassY PassZ end PassC end PassD end So that's roughly: 100 x 80000 x 10 x (4 + ?) = 320.000.000 passes. I did a little benchmark test with just the cpu, and some almost empty routines... and the cpu does it in about 2 seconds or so... I did a benchmark test some time ago with opengl... and opengl would be somewhere between 300 and 300.000 fps... so that's no where near the number that's necessary... so I am pretty much losing faith in the current design. So that's the bad news ;) :) (I guess this is also nvidia's dirty little secret ?! maybe cuda suffers the same faith... and needs to switch a lot between "kernels/warps" for more advanced algorithms... this could be the reason why nvidia might be developing a cpu to be integrated into the gpu... so that these "pass" switches could be done by a "special cpu-like" chip on the gpu... to accelerate it... how it would work... I have almost no idea except...; I guess it would upload some kind of special cpu like program... to it... just like a shader would go to gpu. This has been predicted by Tom's Hardware recent article ;) :)) Ok such a bad news cannot go without good news ! LOL. Unfortunately... I have now to make up some good news... so I am going to go back over all the numbers to see if I can design a new design/algorithm to take more adventage of the gpu itself... without requiring so many fricking passes ! The general idea is to use the registers of the cores to keep track of data value's like a tiny minimum cpu cache. However I am not sure if each core inside the gpu has it's own registers ?!? I would guess so... However... these registers need to be freeed... before the core can go to next input/output data ??? (Or maybe the gpu has a stack for the registers ??? I think not... but not sure... ;)) So this would mean... that by the end of the shader... the registers need to be outputted to the output. So this means I have to go back to one of my postings/idea's in the past... where I wrote something like: "I can see a possible future for gpu computing" ;) :) The idea was to treat the cores of the gpu and the registers of the shaders... like little tiny intelligent cells... which only do work if it was addressed to them. So the idea (here) I guess is to "attach" intelligent little cells to each pixel/index. Which then all get iterated by a single pass/shader which is executed hopefully for 300.000 times or so... or 3000 times or 300 times... but at least it will just be one shader, maybe two or so but that's it... maybe vertex shader, plus fragment shader.. maybe two times. So the idea is as follows: Each pixel has the following data attached: Core[Pixel].Instruction Warrior[0].P-Space[Pixel].Value Warrior[1].P-Space[Pixel].Value Warrior[0].ProcessQueue[Pixel].Value Warrior[1].ProcessQueue[Pixel].Value Additional pixels could keep track of: Simulator[S].WarriorsAlive Simulator[S].Warrior[W].Score Ultimately the pixel shaders/vertex shader could simply treat the input memory as one gigant byte array in rgba32 format... by using unpack/pack functions to extract and store bytes in them. Then two textures exist... an input texture and an output texture which are more or less the same... These would be close to 256 MB each to fit in the 512 MB ram... If the shaders are to do updates on these textures then it could take possibly lot's of bandwidth but that keep be prevented by doing smart updates... but let's see what a dumb algorithm would be like if one had to copy this all the time: roughly 50*1024 MB/sec available / 256 MB per cycle = 200 cycles... woopsie ! ;) :) That's very bad lol. Let's see how much bandwidth we can actually burn to achieve cpu like capabilities: 100 x 80.000 x 10 = 80000000 50*1024 * 1024 * 1024 B / sec / 80.000.000 = 53687091200 / 80000000 = 671 bytes. Wow that's not a lot... quite surprising really... Then again per cycle 5x6 bytes of updates needed for instructions or so... is 30 bytes... for pspace maybe another 2 bytes... for process queue maybe another 4 bytes... plus some additional process queue overhead/head/tail/processes thingies... so this is well within range... however address would need to specified as well... then this stuff copied... but it should be doable... However I said "cpu like capabilities"... so this means the gpu is actually as fast as the cpu at this point... which is quite weird... So all in all... this means something like 671 / ? = ? performance benefit over cpu... So this number could actually be quite important... I estimated the number to be 58, this would need to be copied twice so that's 116. So 671 / 116 = 5.7 speed up over cpu. This is worst case scenerio though... actual performance might be better. Though this is kinda sadning ;) :) Lesson learned during this posting: GPU GB/sec / CPU GB/sec = Speedup of GPU over CPU. Now according to the figures/specs: the gpu is roughly: 50 GB/sec the cpu is roughly: 16 GB/sec So speedup when both running at maximum efficiency: 50 / 16 = 3.125 This does not include bandwidth limit of pci-express lane... which could bottleneck the gpu for some bigger algorithms... if that's the case the opposite could happen: CPU speed up: 16 / 2 = 8x... cpu could be 8x times faster if pci-express is bottle neck ?! ;) :) and data remains inside cpu which is unlikely ?!? so maybe not fair comparision ;) With main memory cpu could be 2 GB/sec so speedup of gpu is: 50 / 2 = 25x max. But since it needs to do double/swapping etc... 25 / 2 = 12.5x max. Which is kinda the number reported by others... Access time seems to be a total different matter though... here the gpu could prevail... once the memory is uploaded. So I guess there is very little good news... and only some lessons learned... However the good news could be: I learned a lot about opengl/cg shaders and the gpu and it's capabilities... I could now give up on corewars executor on gpu because it would probably not run fast ?!? (opengl does not achieve high enough iterations/passes, and bandwidth could be issue too ;)) 5x time speed up is not enough for the effort, 12x speed up is not enough for the effort, 30x speed up is not enough for the effort ! ;) :) I want 9000x speed up ! LOL ;) :) So I think it's time to give up on this pipedream of corewars on the gpu... I mainly did it because others on a forum dreamed about it and I thought it would be nice if I could make their dream a reality... however it remains a pipe dream me thinks... ;) :) I can now go back to my older software and focus on that instead... I could also create new other software... and/or I could also try and create new ai for my game which is what actually got me into corewars in the first place ! ;) :) However I think I have spent enough time on this "entertainment/educational" thing ;) :) So I think it's now time for me to switch to something else... and leave it for now ! =D Bye, Skybuck ;) =D
From: Skybuck Flying on 2 Mar 2010 14:51 Oh well... I came this far... I have been positively surprised by some performance benchmarks... maybe there is some magic in there after all.. like caching effects... And going back to just 2 cores kinda sux anyway... So maybe I will continue the project... and finish it... just to see what the final performance would be ;) :) Bye, Skybuck =D
|
Pages: 1 Prev: GPGPU Programming/Architecture Next: Women looking for affairs |