The next step: A way to produce flexible gallium arsenid wafers in quantity has been found [Design]

Prev: Shielded Banana Plugs
Next: Which Charging Method is Best?

From: Jan Panteltje on 27 May 2010 14:59

On a sunny day (Thu, 27 May 2010 12:00:04 -0500) it happened "Tim Williams"
<tmoranwms(a)charter.net> wrote in <htm8oh$vqf$1(a)news.eternal-september.org>:

>"Jan Panteltje" <pNaonStpealmtje(a)yahoo.com> wrote in message
>news:htm259$o5f$1(a)news.albasani.net...
>>>How long do you figure the original software took to write? 300 days? If
>>>they had designed it for 300-core operation from the get-go, they wouldn't
>>>have had that problem.
>>>
>>>Sounds like a failure of management to me :)
>>
>> You are an idiot, I can hardly decrypt your rant.
>
>Strange, it's perfect American English.

Let's see, let's use something simple right, ??? a H264 encoder.
Now just, in some simple words, how would you split that up over 300 cores?
I think you have no clue what you are on about, honestly, have you ever written a video stream manipulation tool?
Perhaps like this:
http://panteltje.com/panteltje/subtitles/index.html
And that then is a 'filter' (I do not like the word filter, as filters 'remove' something,
and this adds something) will have to fit the API of yet an other stream processor
(transcode in this case) that has to use about every codec available.
You can split things up a bit, maybe over 5 or 6 cores, but after that splitting becomes next to impossible.
What does help is dedicated hardware for things like this, say use a graphics card GPU for encoding.
Multicore? 300?????????????????????? forget it.

>> The original soft was written when there WERE no multicores.
>> And I wrote large parts of it,
>> AND it cannot be split up in more then say 6 threads if you wanted to.
>
>Sounds like you aren't trying hard enough. Design constraints chosen early
>on, like algorithmic methods, can severely impact the final solution.
>Drawing pixels, sure, put a core on each and let them chug. Embarrassingly
>parallel applications are trivial to split up. If there's some higher level
>structure to it that prevents multicore execution, that would be the thing
>to look at.

blabber.
Write it, publish it if you have the guts,

>And yes, it may result in rewriting the whole damn program. Which was my
>point, it may be necessary to reinvent the entire program, in order to
>accommodate new design constraints as early as possible.

Insanity tries to invent the impossible.

Many have gone before you, and none are remembered.
It is like a plane with one half wing missing, great design, but it will NOT fly.

>> But, OK, I guess somebody could use a core for each pixel,
>
>GPUs have been doing it for decades.
>
>> plus do multiple data single instruction perhaps,
>> will it be YOU who writes all that? Intel has a job for you!
>
>No. SIMD is an instruction thing, not a core thing. Not at all parallel.

Actually it *is* parallel, I have some nice code here that does decryption that way,
but it is only possible if you need to do the same operations .
What happens is that in one, say 32 bit wide, instruction you do a logical operation
on 32 single bit streams.

>SIMD tends to be cache limited, like any other instruction set, running on
>any other core. The only way to beat the bottleneck is with more cores
>running on more cache lines.

Dunno what you are saying here.
The whole thing is limited by cache, if you can execute in cache without
reloading from memory it will be faster.
Of course moving data from one to an other core via whateverhaveyou infrastructure is a big bottleneck.
Architectures exist, many, none is perfect.

>FWIW, maybe you've noticed here before, I've done a bit of x86 assembly
>before. I'm not unfamiliar with some of the enhanced instructions that've
>been added, like SIMD. However, I've never used anything newer than 8086,
>not directly in assembly.
SIMD is not an instruction, it is a way of processing, although there are specific
instruction like MMX in X86 that use that.

>More and more, especially with APIs and objects and abstraction and
>RAM-limited bandwidth, assembly is less and less practical. Only compiler
>designers really need to know about it. The days of hand-assembled inner
>loops went away some time after 2000.

Well, you should really try to see the difference between ffmpeg with asm optimisation
enabled (default), or compile it with flag C only (for processors where the time
critical parts written in asm are not available), it is shocking, all of the sudden
it hugs the system, I have tried to compile it both ways on x86, and VERY quickly
went back to the asm version.
FYI ffmpeg is sort of the Linux Swiss codec knife.

>> I'd love to see a 300 GHz gallium arsenide x86 :-)
>> I would buy one.
>
>If Cray were still around, I bet they would actually be crazy enough to make
>John's GaAs RTL monster.
>
>Tim

If these wafers work, maybe it could be done, I dunno much about semiconductor manufacturing,
so maybe it needs totally different processes, and at that speed for sure VERY short connections,
but with GaAs some expertise exists, maybe they can easily do optical on chip too?

From: Tim Williams on 27 May 2010 17:37

"Jan Panteltje" <pNaonStpealmtje(a)yahoo.com> wrote in message
news:htmffu$eoa$1(a)news.albasani.net...
> Let's see, let's use something simple right, ??? a H264 encoder.
> Now just, in some simple words, how would you split that up over 300
> cores?
> I think you have no clue what you are on about, honestly, have you ever
> written a video stream manipulation tool?
^^^^^^
^^^^^^
Well there's your problem, using streams. So not only is your problem an
early design constraint, it's a fundamental construct of your operating
system. Pipes and streams, such ludicrosity.

Now, if you download the whole damn file and work on it with random access,
you can split it into 300, 3000, however many pieces you want, as fine as
the block level, maybe even frame by frame.

It's my understanding that most video formats have frame-to-frame coherence,
with a total refresh every couple of seconds maybe (hence those awful, awful
videos where the error builds up and not a damn thing looks right, then
WHAM, in comes a refresh and everything looks ok again, for a while). So
the most you could reasonably work with, in that case, is a block. Still,
if there's a block every 2 seconds, and you assign each block to a seperate
core, there's 7200 cores you can chug to an average movie. That movie might
be ~4GB, which might transfer in a few seconds over the computer's bus.
More than likely, it would take longer to send to all the cores than it
would take for all of them to do their computations.

Obviously, there is little point in >1k cores in such a bandwidth limited
application which only takes a few minutes on ordinary systems anyway, so
obviously, you use such systems to solve much more complex problems, like
quantum mechanics. folding(a)home work units range in size from a few megs to
hundreds, and they all take about the same processing time (a day or two on
modern processors). Such activities are clearly not
storage-bandwidth-limited, and would gain a lot from a multicore approach.
Which is, after all, how they are implemented. From the ground up.

Tim

--
Deep Friar: a very philosophical monk.
Website: http://webpages.charter.net/dawill/tmoranwms

From: Paul Keinanen on 28 May 2010 03:57

On Thu, 27 May 2010 18:59:58 GMT, Jan Panteltje
<pNaonStpealmtje(a)yahoo.com> wrote:

>Let's see, let's use something simple right, ??? a H264 encoder.
>Now just, in some simple words, how would you split that up over 300 cores?

If we talk about video encoding and compression in general, there are
several cases in which a huge parallel processing power would be
useful.

In modern compression systems, there are numerous options how to
encode a sequence. It is hard to predict in advance which method will
give the best result in terms of quality and transfer or storage
requirements. With sufficient computing power, each encoding option
can be executed in parallel for the same uncompressed material and
after encoding, select the method, which gives the best result on a
second by second basis.

Generating motion vectors requires that an object is detected and also
detecting were it has moved in the next picture (or were it was in the
previous picture). Performing the search in all directions around the
current location can be performed in parallel and then use the best
match to generate the motion vector.

Video sequences consist of several pictures, each of which could be
processed with a separate groups of processors at lt least until
intra-coding is used.

For instance the HDTV 1920x1080 picture can be divided into 8100
macro-blocks of 16x16 each. With only 300 cores, each core would have
to handle a slice of macro-blocks :-).

From: Jan Panteltje on 28 May 2010 05:31

On a sunny day (Fri, 28 May 2010 10:57:50 +0300) it happened Paul Keinanen
<keinanen(a)sci.fi> wrote in <hmsuv5p6rs8fh469a0oj32ou4s17l8208b(a)4ax.com>:

>On Thu, 27 May 2010 18:59:58 GMT, Jan Panteltje
><pNaonStpealmtje(a)yahoo.com> wrote:
>
>>Let's see, let's use something simple right, ??? a H264 encoder.
>>Now just, in some simple words, how would you split that up over 300 cores?
>
>If we talk about video encoding and compression in general, there are
>several cases in which a huge parallel processing power would be
>useful.
>
>In modern compression systems, there are numerous options how to
>encode a sequence. It is hard to predict in advance which method will
>give the best result in terms of quality and transfer or storage
>requirements. With sufficient computing power, each encoding option
>can be executed in parallel for the same uncompressed material and
>after encoding, select the method, which gives the best result on a
>second by second basis.
>
>Generating motion vectors requires that an object is detected and also
>detecting were it has moved in the next picture (or were it was in the
>previous picture). Performing the search in all directions around the
>current location can be performed in parallel and then use the best
>match to generate the motion vector.
>
>Video sequences consist of several pictures, each of which could be
>processed with a separate groups of processors at lt least until
>intra-coding is used.

Thank you for the deep insight.
Yes does not work for all frame types hehe.

>For instance the HDTV 1920x1080 picture can be divided into 8100
>macro-blocks of 16x16 each. With only 300 cores, each core would have
>to handle a slice of macro-blocks :-).

Thank you for the deep insight.
Just a quick question:
How do you transfer data between those 300 cores?

I am just eager to see all the theoretical advantages of a 300 core,
resulting in a real product that beats a 300x clock single core.

It will never happen.
Publish the code!
Your chance to fame!

From: Jan Panteltje on 28 May 2010 05:45

On a sunny day (Thu, 27 May 2010 16:37:00 -0500) it happened "Tim Williams"
<tmoranwms(a)charter.net> wrote in <htmom2$f3$1(a)news.eternal-september.org>:

>"Jan Panteltje" <pNaonStpealmtje(a)yahoo.com> wrote in message
>news:htmffu$eoa$1(a)news.albasani.net...
>> Let's see, let's use something simple right, ??? a H264 encoder.
>> Now just, in some simple words, how would you split that up over 300
>> cores?
>> I think you have no clue what you are on about, honestly, have you ever
>> written a video stream manipulation tool?
> ^^^^^^
> ^^^^^^
>Well there's your problem, using streams. So not only is your problem an
>early design constraint, it's a fundamental construct of your operating
>system. Pipes and streams, such ludicrosity.

Often video comes in ONE FRAME AT THE TIME.
So I also do real time procesing.
In a real broadcast say HD environment you have several HD cameras streaming,
encoding is needed, recording is needed.

I give you that you could indeed chop a stream up once it is recorded and work on sections of that.
not a bad idea actually.
Bu tnot always easy to do, say you render a sequence with Blender?
But for pure transcoding it could work.

As to pipes and streams, it is the best way I know to do multiple operations on signals.
It is the Unix way, many have critised it, and it has always won in the end.
First it was text only, filters via grep or awk or sed or whatever,
then it was audio, then as speed increased it was video, I was the one of first to use it for video I think,
more then 10 years ago.
The system has proved itself, I wrote the C code with Moore's law in mind,
knowing it would run faster and faster and finally real time.

Now yo ugo and write the 300 core parallel processing stuff, I am waiting.

>Now, if you download the whole damn file and work on it with random access,
>you can split it into 300, 3000, however many pieces you want, as fine as
>the block level, maybe even frame by frame.
>
>It's my understanding that most video formats have frame-to-frame coherence,
>with a total refresh every couple of seconds maybe (hence those awful, awful
>videos where the error builds up and not a damn thing looks right, then
>WHAM, in comes a refresh and everything looks ok again, for a while). So
>the most you could reasonably work with, in that case, is a block. Still,
>if there's a block every 2 seconds, and you assign each block to a seperate
>core, there's 7200 cores you can chug to an average movie. That movie might
>be ~4GB, which might transfer in a few seconds over the computer's bus.
>More than likely, it would take longer to send to all the cores than it
>would take for all of them to do their computations.
>
>Obviously, there is little point in >1k cores in such a bandwidth limited
>application which only takes a few minutes on ordinary systems anyway, so
>obviously, you use such systems to solve much more complex problems, like
>quantum mechanics. folding(a)home work units range in size from a few megs to
>hundreds, and they all take about the same processing time (a day or two on
>modern processors). Such activities are clearly not
>storage-bandwidth-limited, and would gain a lot from a multicore approach.
>Which is, after all, how they are implemented. From the ground up.
>
>Tim

You are so clever, I am amazed.
I am just waiting for the programs.
OTOH I aint'not buying no more then a 6 core I think.
In fact I am not buying anything in the form of computer now,
but I will go for the 300 GHz single core.
Been fixing up the house lately :-)
Tell you one thing, computahs are much cheaper.

>--
>Deep Friar: a very philosophical monk.
>Website: http://webpages.charter.net/dawill/tmoranwms
>
>
>

First | Prev | Next | Last
Pages: 1 2 3 4 5 6
Prev: Shielded Banana Plugs
Next: Which Charging Method is Best?