Prev: Fujitsu SPARC VIII-fx HPC-ACE adds instruction prefixes: variablelength instructions for RISC!!!
Next: Fujitsu SPARC VIII-fx HPC-ACE adds instruction prefixes: variable?length instructions for RISC!!!
From: Mayan Moudgill on 26 Dec 2009 12:57 nmm1(a)cam.ac.uk wrote: > Yes. How many have been published in places you can find them, or > even written up suitable for publication, I don't know. I know that > mine weren't. Pity > Note that the situation involves more than just the synchronisation > operations, because a lot of it is about scheduling. If you are > trying to parallelise code with a 10 microsecond grain, having to do > ANY interaction with the system scheduler runs the risk of a major > problem. That is one of the main reasons that almost all HPC codes > rely on gang scheduling, with all threads running all the time. > Agreed. BTW: my experience is with systems where we're synchronizing on less than 100 cycle granulatrity - at that granularity, you're basically programming against bare metal, with fixed thread mappings, all-or-none thread scheduling and no "system software" to speak of.
From: nmm1 on 26 Dec 2009 13:04 In article <hIGdnRxV8ZyP1qvWnZ2dnUVZ_uSdnZ2d(a)bestweb.net>, Mayan Moudgill <mayan(a)bestweb.net> wrote: > >More heavyweight synchronization operations (such as a lock with suspend > on the lock if already locked) *can* be more expensive - but the cost >is due to all the additional function in the operation. Its not clear >that tweaking the underlying hardware primitives is going to do much for >this. It's not clear, I agree, but one problem with existing ones is that they are usually privileged, which forces a system call. That isn't what you want, for many reasons. >BTW: my experience is with systems where we're synchronizing on less >than 100 cycle granulatrity - at that granularity, you're basically >programming against bare metal, with fixed thread mappings, all-or-none >thread scheduling and no "system software" to speak of. That's largely because there are no adequate facilities for doing it any other way :-( Regards, Nick Maclaren.
From: Mayan Moudgill on 26 Dec 2009 13:28 nmm1(a)cam.ac.uk wrote: > In article <hIGdnRxV8ZyP1qvWnZ2dnUVZ_uSdnZ2d(a)bestweb.net>, > Mayan Moudgill <mayan(a)bestweb.net> wrote: > >>More heavyweight synchronization operations (such as a lock with suspend >> on the lock if already locked) *can* be more expensive - but the cost >>is due to all the additional function in the operation. Its not clear >>that tweaking the underlying hardware primitives is going to do much for >>this. > > > It's not clear, I agree, but one problem with existing ones is that > they are usually privileged, which forces a system call. That isn't > what you want, for many reasons. > Again, that supports my original point - the performance of synchronization has nothing to do with improving synchronization primitives, but with everything else in the system. The reason you need that system call, I assume, is to suspend a thread on a contended lock or to resume suspended threads. You could always use spin-locks and avoid that system call - but then you get into the issue of utilization.
From: nmm1 on 26 Dec 2009 13:36 In article <XIidncgyP_VLyKvWnZ2dnUVZ_hKdnZ2d(a)bestweb.net>, Mayan Moudgill <mayan(a)bestweb.net> wrote: >> >> It's not clear, I agree, but one problem with existing ones is that >> they are usually privileged, which forces a system call. That isn't >> what you want, for many reasons. > >Again, that supports my original point - the performance of >synchronization has nothing to do with improving synchronization >primitives, but with everything else in the system. "Nothing to to with" is too strong - part of the reason that the rest of a system gets it wrong is that the hardware primitives do. Only a part, I agree. >The reason you need that system call, I assume, is to suspend a thread >on a contended lock or to resume suspended threads. You could always use >spin-locks and avoid that system call - but then you get into the issue >of utilization. It's worse than that :-( Let's say that thread A wants to suspend itself in favour of thread B, until the latter next suspends itself. If thread A uses a spin-loop for its wait, thread B may never get to run, so thread A will wait for ever .... There are lots of important threading paradigms, which are known to be useful, which are close to infeasible to use on modern systems. Regards, Nick Maclaren.
From: EricP on 26 Dec 2009 13:38
Mayan Moudgill wrote: > > So core 1 writes some data, core 1&2 synchronize, and core 2 reads the > data. What actually happens post-synchronization? > > Well, cache lines get copied from dcache-CPU-1 to dcache-CPU-2. This > takes time. This time will be proportional to the shared data. The cost > can actually be higher than in the case of an explicit message passing > system. > > The synchronization, by contrast, can involve the tranfer of exactly one > cache-line [e.g. if you're doing an atomic-increment]. > > More heavyweight synchronization operations (such as a lock with suspend > on the lock if already locked) *can* be more expensive - but the cost > is due to all the additional function in the operation. Its not clear > that tweaking the underlying hardware primitives is going to do much for > this. I believe Mitch is referring to potential new hardware functionality like AMD's Advanced Synchronization Facility proposal. I can't seem to find any info on it on the AMD website as the proposal seems to have degenerated into just a registered trademark notice. Having the ability to perform a LoadLocked/StoreConditional on up to 4 separate memory locations would eliminate much of the need to escalate to the heavyweight OS synchronization ops. Eric |