Prev: LM3478 design gets insanely hot
Next: 89C51ED2
From: John Larkin on 7 Aug 2008 10:08 On 7 Aug 2008 07:47:13 GMT, nmm1(a)cus.cam.ac.uk (Nick Maclaren) wrote: > >In article <b0pk941drmfvmlr4osre4evus6dlpu2iq4(a)4ax.com>, >John Larkin <jjlarkin(a)highNOTlandTHIStechnologyPART.com> writes: >|> On Tue, 5 Aug 2008 12:54:14 -0700, "Chris M. Thomasson" >|> <no(a)spam.invalid> wrote: >|> >"John Larkin" <jjlarkin(a)highNOTlandTHIStechnologyPART.com> wrote in message >|> >news:rtrg9458spr43ss941mq9p040b2lp6hbgg(a)4ax.com... >|> > >|> >> This has got to affect OS design. >|> > >|> >They need to completely rethink their multi-threaded synchronization >|> >algorihtms. I have a feeling that efficient distributed non-blocking >|> >algorihtms, which are comfortable running under a very weak cache coherency >|> >model will be all the rage. Getting rid of atomic RMW or StoreLoad style >|> >memory barriers is the first step. >|> >|> Run one process per CPU. Run the OS kernal, and nothing else, on one >|> CPU. Never context switch. Never swap. Never crash. > >Been there - done that :-) > >That is precisely how the early SMP systems worked, and it works >for dinky little SMP systems of 4-8 cores. But the kernel becomes >the bottleneck for many workloads even on those, and it doesn't >scale to large numbers of cores. So you HAVE to multi-thread the >kernel. Why? All it has to do is grant run permissions and look at the big picture. It certainly wouldn't do I/O or networking or file management. If memory allocation becomes a burden, it can set up four (or fourteen) memory-allocation cores and let them do the crunching. Why multi-thread *anything* when hundreds or thousands of CPUs are available? Using multicore properly will require undoing about 60 years of thinking, 60 years of believing that CPUs are expensive. John
From: Nick Maclaren on 7 Aug 2008 10:25 In article <d10m94d7etb6sfcem3hmdl3hk8qnels3kg(a)4ax.com>, John Larkin <jjlarkin(a)highNOTlandTHIStechnologyPART.com> writes: |> |> >|> Run one process per CPU. Run the OS kernal, and nothing else, on one |> >|> CPU. Never context switch. Never swap. Never crash. |> > |> >Been there - done that :-) |> > |> >That is precisely how the early SMP systems worked, and it works |> >for dinky little SMP systems of 4-8 cores. But the kernel becomes |> >the bottleneck for many workloads even on those, and it doesn't |> >scale to large numbers of cores. So you HAVE to multi-thread the |> >kernel. |> |> Why? All it has to do is grant run permissions and look at the big |> picture. It certainly wouldn't do I/O or networking or file |> management. If memory allocation becomes a burden, it can set up four |> (or fourteen) memory-allocation cores and let them do the crunching. |> Why multi-thread *anything* when hundreds or thousands of CPUs are |> available? I don't have time to describe 40 years of experience to you, and it is better written up in books, anyway. Microkernels of the sort you mention were trendy a decade or two back (look up Mach), but introduced too many bottlenecks. In theory, the kernel doesn't have to do I/O or networking, but have you ever used a system where they were outside it? I have. The reason that exporting them to multiple CPUs doesn't solve the scalability problems is that the interaction rate goes up more than linearly with the number of CPUs. And the same problem applies to memory management, if you are going to allow shared memory - or even virtual shared memory, as in PGAS languages. And so it goes. TANSTAAFL. |> Using multicore properly will require undoing about 60 years of |> thinking, 60 years of believing that CPUs are expensive. Now, THAT is true. Regards, Nick Maclaren.
From: Chris M. Thomasson on 7 Aug 2008 10:42 "John Larkin" <jjlarkin(a)highNOTlandTHIStechnologyPART.com> wrote in message news:d10m94d7etb6sfcem3hmdl3hk8qnels3kg(a)4ax.com... > On 7 Aug 2008 07:47:13 GMT, nmm1(a)cus.cam.ac.uk (Nick Maclaren) wrote: > >> >>In article <b0pk941drmfvmlr4osre4evus6dlpu2iq4(a)4ax.com>, >>John Larkin <jjlarkin(a)highNOTlandTHIStechnologyPART.com> writes: >>|> On Tue, 5 Aug 2008 12:54:14 -0700, "Chris M. Thomasson" >>|> <no(a)spam.invalid> wrote: >>|> >"John Larkin" <jjlarkin(a)highNOTlandTHIStechnologyPART.com> wrote in >>message >>|> >news:rtrg9458spr43ss941mq9p040b2lp6hbgg(a)4ax.com... >>|> > >>|> >> This has got to affect OS design. >>|> > >>|> >They need to completely rethink their multi-threaded synchronization >>|> >algorihtms. I have a feeling that efficient distributed non-blocking >>|> >algorihtms, which are comfortable running under a very weak cache >>coherency >>|> >model will be all the rage. Getting rid of atomic RMW or StoreLoad >>style >>|> >memory barriers is the first step. >>|> >>|> Run one process per CPU. Run the OS kernal, and nothing else, on one >>|> CPU. Never context switch. Never swap. Never crash. >> >>Been there - done that :-) >> >>That is precisely how the early SMP systems worked, and it works >>for dinky little SMP systems of 4-8 cores. But the kernel becomes >>the bottleneck for many workloads even on those, and it doesn't >>scale to large numbers of cores. So you HAVE to multi-thread the >>kernel. > > Why? All it has to do is grant run permissions and look at the big > picture. It certainly wouldn't do I/O or networking or file > management. If memory allocation becomes a burden, it can set up four > (or fourteen) memory-allocation cores and let them do the crunching. FWIW, I have a memory allocation algorithm which can scale because its based on per-thread/core/node heaps: http://groups.google.com/group/comp.arch/browse_frm/thread/24c40d42a04ee855 AFAICT, there is absolutely no need for memory-allocation cores. Each thread can have a private heap such that local allocations do not need any synchronization. Also, thread local deallocations of memory do not need any sync. Local meaning that Thread A allocates memory M which is subsequently freed by Thread A. When a threads memory pool is exhausted, it then tries to allocate from the core local heap. If that fails, then it asks the system, and perhaps virtual memory comes into play. A scaleable high-level memory allocation algorithm for a super-computer could look something like: _____________________________________________________________ void* malloc(size_t sz) { void* mem; /* level 1 - thread local */ if ((! mem = Per_Thread_Try_Allocate(sz))) { /* level 2 - core local */ if ((! mem = Per_Core_Try_Allocate(sz))) { /* level 3 - physical chip local */ if ((! mem = Per_Chip_Try_Allocate(sz))) { /* level 4 - node local */ if ((! mem = Per_Node_Try_Allocate(sz))) { /* level 5 - system-wide */ if ((! mem = System_Try_Allocate(sz))) { /* level 6 - failure */ Report_Allocation_Failure(sz); return NULL; } } } } } return mem; } _____________________________________________________________ Level 1 does not need any atomic RMW OR membars at all. Level 2 does not need membars, but needs atomic RMW. Level 3 would need membars and atomic RMW. Level 4 is same as level 3 Level 5 is worst case senerio, may need MPI... Level 6 is total memory exhaustion! Ouch... All local frees have same overhead while all remote frees need atomic RMW and possibly membars. This algorithm can scale to very large numbers of cores, chips and nodes. > Using multicore properly will require undoing about 60 years of > thinking, 60 years of believing that CPUs are expensive. The bottleneck is the cache-coherency system. Luckily, there is years of experience is dealing with weak cache schemes... Think RCU. > Why multi-thread *anything* when hundreds or thousands of CPUs are > available? You don't think there is any need for communication between cores on a chip?
From: Chris M. Thomasson on 7 Aug 2008 10:44 "Chris M. Thomasson" <no(a)spam.invalid> wrote in message news:PNDmk.8961$Bt6.3201(a)newsfe04.iad... > "John Larkin" <jjlarkin(a)highNOTlandTHIStechnologyPART.com> wrote in > message news:d10m94d7etb6sfcem3hmdl3hk8qnels3kg(a)4ax.com... [...] >> Using multicore properly will require undoing about 60 years of >> thinking, 60 years of believing that CPUs are expensive. > > The bottleneck is the cache-coherency system. I meant to say: /One/ bottleneck is the cache-coherency system. > Luckily, there is years of experience is dealing with weak cache > schemes... Think RCU.
From: Jan Panteltje on 7 Aug 2008 10:51
On a sunny day (Thu, 07 Aug 2008 07:08:52 -0700) it happened John Larkin <jjlarkin(a)highNOTlandTHIStechnologyPART.com> wrote in <d10m94d7etb6sfcem3hmdl3hk8qnels3kg(a)4ax.com>: >>Been there - done that :-) >> >>That is precisely how the early SMP systems worked, and it works >>for dinky little SMP systems of 4-8 cores. But the kernel becomes >>the bottleneck for many workloads even on those, and it doesn't >>scale to large numbers of cores. So you HAVE to multi-thread the >>kernel. > >Why? All it has to do is grant run permissions and look at the big >picture. It certainly wouldn't do I/O or networking or file >management. If memory allocation becomes a burden, it can set up four >(or fourteen) memory-allocation cores and let them do the crunching. >Why multi-thread *anything* when hundreds or thousands of CPUs are >available? > >Using multicore properly will require undoing about 60 years of >thinking, 60 years of believing that CPUs are expensive. > >John Ah, and this all reminds me about when 'object oriented programming' was going to change everything. It did lead to such language disasters as C++ (and of course MS went for it), where the compiler writers at one time did not even know how to implement things. Now the next big thing is 'think an object for every core' LOL. Days of future wasted. All the little things have to communicate and deliver data at the right time to the right place. Sounds a bit like Intel made a bigger version of Cell. And Cell is a beast to program (for optimum speed). Maybe it will work for graphics, as things are sort of fixed, like to see real numbers though. Couple of PS3s together make great rendering, there is a demo on youtube. |