Prev: Fujitsu SPARC VIII-fx HPC-ACE adds instruction prefixes: variablelength instructions for RISC!!!
Next: Fujitsu SPARC VIII-fx HPC-ACE adds instruction prefixes: variable?length instructions for RISC!!!
From: EricP on 26 Dec 2009 15:39 EricP wrote: > > Having the ability to perform a LoadLocked/StoreConditional on > up to 4 separate memory locations would eliminate much of the > need to escalate to the heavyweight OS synchronization ops. This appears to require a method of establishing a global first-come-first-served ordering for the cpus that is independent of the physical memory locations involved. In the unlikely event of cache line ownership contention then the first cpu to begin its multiple update sequence wins and the other rolls back. The trick is for it be a low cost mechanism (ideally the cost of a single cache miss to establish the order) that works within the existing cpu hardware, bus and coherency protocol. For that I'm thinking that maybe a global device located at some special physical memory location would establish the global order at the start of a multiple update sequence. Then using Invalidate bus ops to a set of other special physical memory locations could communicate that ordering to other cpus and they can associate that with the bus id. So in this example the overhead cost would be 3 bus ops to Read the global device, an Invalidate indicate my order to the peers, and an Invalidate at the end of the sequence. Eric
From: Bart Swinnen on 26 Dec 2009 18:52 EricP wrote: > I believe Mitch is referring to potential new hardware functionality > like AMD's Advanced Synchronization Facility proposal. > I can't seem to find any info on it on the AMD website as the proposal > seems to have degenerated into just a registered trademark notice. Huray for Google? http://developer.amd.com/CPU/ASF/Pages/default.aspx
From: "Andy "Krazy" Glew" on 26 Dec 2009 23:10 nmm1(a)cam.ac.uk wrote: > In article <hIGdnRxV8ZyP1qvWnZ2dnUVZ_uSdnZ2d(a)bestweb.net>, > Mayan Moudgill <mayan(a)bestweb.net> wrote: > >> BTW: my experience is with systems where we're synchronizing on less >> than 100 cycle granulatrity - at that granularity, you're basically >> programming against bare metal, with fixed thread mappings, all-or-none >> thread scheduling and no "system software" to speak of. > > That's largely because there are no adequate facilities for doing it > any other way :-( I'm not so sure. I tend to agree with sonething that is implicit in what Mayan is saying. I have seen two main classes of system: (1) Systems that allow thread and process migration from CPU to CPU. (2) Systems that don't. When you allow migration, if you are cache coherent you need do nothing much. Maybe a fence, but no cache flushes. However, if you use caches but are not cache coherent, then on a process or thread migration you need to flush all of the cache footprint of the thread being migrated. Not necessarily all the way to memory, but at least to the shared cache level. E.g. in a multicore chip, typically the L1 and L2 caches are private, but the LLC is shared. So if you migrate between processors on the same chip, you need to flush to the LLC. If you migrate between chips, further out, typically to memory. With AMD Bulldozer's MCMT, private L1's but shared L2s, sometimes you would need to flush to L2, sometimes to L3, and sometimes all the way to memory. Unfortunately, x86 doesn't have all the varieties of cache flush you would need to do this. Unfortunately, it's chicken and egg: the vast majority of shared memory cache coherent systems don't need this. So it isn't even provided, let alone made fast. Because of this, there seems to be a clustering of usage models: (1) General purpose cache coherent systems that allow thread migration (2) Special purpose systems that are not necessarily cache coherent, which disallow, or at least highly deprecate, thread migration. These latter I call embedded systems. Even if they are in big supercomputers. Unfortunately, high degrees of parallelism require thread migration for load balancing. But thread migration in the absence of cache coherency is really slow, unless somebody creates the appropriate cache flush operations. Which are easy to do, but hard to justify in today's incremental climate.
From: EricP on 26 Dec 2009 23:26 EricP wrote: > > I was suggesting in my other message that it looks to me that > in order to guarantee that _one_ contender makes forward progress, > one need only establish a global order for aborts in the face of > contention. That does not mean fairness or eliminate livelock > as an individual can still wind up being continuously victimized. > To guarantee fairness there must be some sort of queue to > fall back upon. Actually I was being a bit dumb here. I was thinking that each locker would acquire an 'order ticket' and if there is a collision, the one with the higher (older) order would abort, acquire a new ticket, and retry. And that does lead to potential retry starvation. However if the loser just aborts but retains its ticket for the next try then that is a fair queue. And since the winner knows who it collided with, it can send a message to the loser letting it know when it is ok to retry. So the loser can watch for a 'Done' message or have a timer (just in case) and retry the multiple update sequence. That doesn't guarantee success on the next try, but does minimize latencies due timer delay retry loops. Eric
From: Mayan Moudgill on 27 Dec 2009 01:07
EricP wrote: > Mayan Moudgill wrote: >> More heavyweight synchronization operations (such as a lock with >> suspend on the lock if already locked) *can* be more expensive - but >> the cost is due to all the additional function in the operation. Its >> not clear that tweaking the underlying hardware primitives is going to >> do much for this. > > > I believe Mitch is referring to potential new hardware functionality > like AMD's Advanced Synchronization Facility proposal. > I can't seem to find any info on it on the AMD website as the proposal > seems to have degenerated into just a registered trademark notice. > > Having the ability to perform a LoadLocked/StoreConditional on > up to 4 separate memory locations would eliminate much of the > need to escalate to the heavyweight OS synchronization ops. > > Eric Let me ask the following question: assume that we have a N-process barrier, how do you avoid using N system calls (use any hardware synchronization you want)? The fundamental problem becomes that, in the general case, the first N-1 processes that arrive at the barrier will need to (have the kernel) put themselves on a waiting queue, and the last process to arrive will need to (have the kernel) put the N-1 waiting processes back on the ready queue. I doubt that software transactional memory solves this problem. |