From: John Larkin on
On 7 Aug 2008 07:47:13 GMT, nmm1(a)cus.cam.ac.uk (Nick Maclaren) wrote:

>
>In article <b0pk941drmfvmlr4osre4evus6dlpu2iq4(a)4ax.com>,
>John Larkin <jjlarkin(a)highNOTlandTHIStechnologyPART.com> writes:
>|> On Tue, 5 Aug 2008 12:54:14 -0700, "Chris M. Thomasson"
>|> <no(a)spam.invalid> wrote:
>|> >"John Larkin" <jjlarkin(a)highNOTlandTHIStechnologyPART.com> wrote in message
>|> >news:rtrg9458spr43ss941mq9p040b2lp6hbgg(a)4ax.com...
>|> >
>|> >> This has got to affect OS design.
>|> >
>|> >They need to completely rethink their multi-threaded synchronization
>|> >algorihtms. I have a feeling that efficient distributed non-blocking
>|> >algorihtms, which are comfortable running under a very weak cache coherency
>|> >model will be all the rage. Getting rid of atomic RMW or StoreLoad style
>|> >memory barriers is the first step.
>|>
>|> Run one process per CPU. Run the OS kernal, and nothing else, on one
>|> CPU. Never context switch. Never swap. Never crash.
>
>Been there - done that :-)
>
>That is precisely how the early SMP systems worked, and it works
>for dinky little SMP systems of 4-8 cores. But the kernel becomes
>the bottleneck for many workloads even on those, and it doesn't
>scale to large numbers of cores. So you HAVE to multi-thread the
>kernel.

Why? All it has to do is grant run permissions and look at the big
picture. It certainly wouldn't do I/O or networking or file
management. If memory allocation becomes a burden, it can set up four
(or fourteen) memory-allocation cores and let them do the crunching.
Why multi-thread *anything* when hundreds or thousands of CPUs are
available?

Using multicore properly will require undoing about 60 years of
thinking, 60 years of believing that CPUs are expensive.

John


From: Nick Maclaren on

In article <d10m94d7etb6sfcem3hmdl3hk8qnels3kg(a)4ax.com>,
John Larkin <jjlarkin(a)highNOTlandTHIStechnologyPART.com> writes:
|>
|> >|> Run one process per CPU. Run the OS kernal, and nothing else, on one
|> >|> CPU. Never context switch. Never swap. Never crash.
|> >
|> >Been there - done that :-)
|> >
|> >That is precisely how the early SMP systems worked, and it works
|> >for dinky little SMP systems of 4-8 cores. But the kernel becomes
|> >the bottleneck for many workloads even on those, and it doesn't
|> >scale to large numbers of cores. So you HAVE to multi-thread the
|> >kernel.
|>
|> Why? All it has to do is grant run permissions and look at the big
|> picture. It certainly wouldn't do I/O or networking or file
|> management. If memory allocation becomes a burden, it can set up four
|> (or fourteen) memory-allocation cores and let them do the crunching.
|> Why multi-thread *anything* when hundreds or thousands of CPUs are
|> available?

I don't have time to describe 40 years of experience to you, and
it is better written up in books, anyway. Microkernels of the sort
you mention were trendy a decade or two back (look up Mach), but
introduced too many bottlenecks.

In theory, the kernel doesn't have to do I/O or networking, but
have you ever used a system where they were outside it? I have.

The reason that exporting them to multiple CPUs doesn't solve the
scalability problems is that the interaction rate goes up more
than linearly with the number of CPUs. And the same problem
applies to memory management, if you are going to allow shared
memory - or even virtual shared memory, as in PGAS languages.

And so it goes. TANSTAAFL.

|> Using multicore properly will require undoing about 60 years of
|> thinking, 60 years of believing that CPUs are expensive.

Now, THAT is true.


Regards,
Nick Maclaren.
From: Chris M. Thomasson on
"John Larkin" <jjlarkin(a)highNOTlandTHIStechnologyPART.com> wrote in message
news:d10m94d7etb6sfcem3hmdl3hk8qnels3kg(a)4ax.com...
> On 7 Aug 2008 07:47:13 GMT, nmm1(a)cus.cam.ac.uk (Nick Maclaren) wrote:
>
>>
>>In article <b0pk941drmfvmlr4osre4evus6dlpu2iq4(a)4ax.com>,
>>John Larkin <jjlarkin(a)highNOTlandTHIStechnologyPART.com> writes:
>>|> On Tue, 5 Aug 2008 12:54:14 -0700, "Chris M. Thomasson"
>>|> <no(a)spam.invalid> wrote:
>>|> >"John Larkin" <jjlarkin(a)highNOTlandTHIStechnologyPART.com> wrote in
>>message
>>|> >news:rtrg9458spr43ss941mq9p040b2lp6hbgg(a)4ax.com...
>>|> >
>>|> >> This has got to affect OS design.
>>|> >
>>|> >They need to completely rethink their multi-threaded synchronization
>>|> >algorihtms. I have a feeling that efficient distributed non-blocking
>>|> >algorihtms, which are comfortable running under a very weak cache
>>coherency
>>|> >model will be all the rage. Getting rid of atomic RMW or StoreLoad
>>style
>>|> >memory barriers is the first step.
>>|>
>>|> Run one process per CPU. Run the OS kernal, and nothing else, on one
>>|> CPU. Never context switch. Never swap. Never crash.
>>
>>Been there - done that :-)
>>
>>That is precisely how the early SMP systems worked, and it works
>>for dinky little SMP systems of 4-8 cores. But the kernel becomes
>>the bottleneck for many workloads even on those, and it doesn't
>>scale to large numbers of cores. So you HAVE to multi-thread the
>>kernel.
>
> Why? All it has to do is grant run permissions and look at the big
> picture. It certainly wouldn't do I/O or networking or file
> management. If memory allocation becomes a burden, it can set up four
> (or fourteen) memory-allocation cores and let them do the crunching.


FWIW, I have a memory allocation algorithm which can scale because its based
on per-thread/core/node heaps:

http://groups.google.com/group/comp.arch/browse_frm/thread/24c40d42a04ee855

AFAICT, there is absolutely no need for memory-allocation cores. Each thread
can have a private heap such that local allocations do not need any
synchronization. Also, thread local deallocations of memory do not need any
sync. Local meaning that Thread A allocates memory M which is subsequently
freed by Thread A. When a threads memory pool is exhausted, it then tries to
allocate from the core local heap. If that fails, then it asks the system,
and perhaps virtual memory comes into play.


A scaleable high-level memory allocation algorithm for a super-computer
could look something like:
_____________________________________________________________
void* malloc(size_t sz) {
void* mem;

/* level 1 - thread local */
if ((! mem = Per_Thread_Try_Allocate(sz))) {

/* level 2 - core local */
if ((! mem = Per_Core_Try_Allocate(sz))) {

/* level 3 - physical chip local */
if ((! mem = Per_Chip_Try_Allocate(sz))) {

/* level 4 - node local */
if ((! mem = Per_Node_Try_Allocate(sz))) {

/* level 5 - system-wide */
if ((! mem = System_Try_Allocate(sz))) {

/* level 6 - failure */
Report_Allocation_Failure(sz);
return NULL;
}
}
}
}
}

return mem;
}
_____________________________________________________________



Level 1 does not need any atomic RMW OR membars at all.

Level 2 does not need membars, but needs atomic RMW.

Level 3 would need membars and atomic RMW.

Level 4 is same as level 3

Level 5 is worst case senerio, may need MPI...

Level 6 is total memory exhaustion! Ouch...



All local frees have same overhead while all remote frees need atomic RMW
and possibly membars.


This algorithm can scale to very large numbers of cores, chips and nodes.




> Using multicore properly will require undoing about 60 years of
> thinking, 60 years of believing that CPUs are expensive.

The bottleneck is the cache-coherency system. Luckily, there is years of
experience is dealing with weak cache schemes... Think RCU.




> Why multi-thread *anything* when hundreds or thousands of CPUs are
> available?

You don't think there is any need for communication between cores on a chip?

From: Chris M. Thomasson on

"Chris M. Thomasson" <no(a)spam.invalid> wrote in message
news:PNDmk.8961$Bt6.3201(a)newsfe04.iad...
> "John Larkin" <jjlarkin(a)highNOTlandTHIStechnologyPART.com> wrote in
> message news:d10m94d7etb6sfcem3hmdl3hk8qnels3kg(a)4ax.com...
[...]
>> Using multicore properly will require undoing about 60 years of
>> thinking, 60 years of believing that CPUs are expensive.
>
> The bottleneck is the cache-coherency system.

I meant to say:

/One/ bottleneck is the cache-coherency system.



> Luckily, there is years of experience is dealing with weak cache
> schemes... Think RCU.


From: Jan Panteltje on
On a sunny day (Thu, 07 Aug 2008 07:08:52 -0700) it happened John Larkin
<jjlarkin(a)highNOTlandTHIStechnologyPART.com> wrote in
<d10m94d7etb6sfcem3hmdl3hk8qnels3kg(a)4ax.com>:

>>Been there - done that :-)
>>
>>That is precisely how the early SMP systems worked, and it works
>>for dinky little SMP systems of 4-8 cores. But the kernel becomes
>>the bottleneck for many workloads even on those, and it doesn't
>>scale to large numbers of cores. So you HAVE to multi-thread the
>>kernel.
>
>Why? All it has to do is grant run permissions and look at the big
>picture. It certainly wouldn't do I/O or networking or file
>management. If memory allocation becomes a burden, it can set up four
>(or fourteen) memory-allocation cores and let them do the crunching.
>Why multi-thread *anything* when hundreds or thousands of CPUs are
>available?
>
>Using multicore properly will require undoing about 60 years of
>thinking, 60 years of believing that CPUs are expensive.
>
>John

Ah, and this all reminds me about when 'object oriented programming' was going to
change everything.
It did lead to such language disasters as C++ (and of course MS went for it),
where the compiler writers at one time did not even know how to implement things.
Now the next big thing is 'think an object for every core' LOL.
Days of future wasted.
All the little things have to communicate and deliver data at the right time to the right place.
Sounds a bit like Intel made a bigger version of Cell.
And Cell is a beast to program (for optimum speed).
Maybe it will work for graphics, as things are sort of fixed, like to see real numbers though.
Couple of PS3s together make great rendering, there is a demo on youtube.




First  |  Prev  |  Next  |  Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13
Prev: LM3478 design gets insanely hot
Next: 89C51ED2