From: Ken Hagan on 26 Jan 2010 04:57 On Mon, 25 Jan 2010 21:48:20 -0000, Stephen Fuld <SFuld(a)alumni.cmu.edu.invalid> wrote: > ISTM that the time lag is pretty small at least for clusters within the > same room and incurring that delay is worth the guarantee of > sequentiality. As for the scaling issues, even with current technology, > given that a single CPU register can be accessed easily multiple times > per ns, I just don't see a scaling issue for any reasonable usage. Does > anyone have a feel for how often the clock/counter needs to be accessed > for any typical/reasonable use? Depends whether it is a clock or a counter. If you are timing code, the chances are you will be comparing two values read on the same CPU, so the issue really doesn't arise. If you want a counter to assign a unique order to transactions then I'd have thought it was really rather likely that two transactions might be pulled from the "inbox" in quick succession, dispatched to separate processors, then take roughly the same length of time to get started, and consequently both request their "order number" at the roughly same time. Then it is just a matter of practice before the system turns "roughly" into "exactly". This is one of those "inevitable co-incidences" that make real parallel systems so much more exciting than time-sliced ones.
From: nmm1 on 26 Jan 2010 05:41 In article <op.u64wluxbss38k4(a)khagan.ttx>, Ken Hagan <K.Hagan(a)thermoteknix.com> wrote: >On Mon, 25 Jan 2010 21:48:20 -0000, Stephen Fuld ><SFuld(a)alumni.cmu.edu.invalid> wrote: > >> ISTM that the time lag is pretty small at least for clusters within the >> same room and incurring that delay is worth the guarantee of >> sequentiality. As for the scaling issues, even with current technology, >> given that a single CPU register can be accessed easily multiple times >> per ns, I just don't see a scaling issue for any reasonable usage. Does >> anyone have a feel for how often the clock/counter needs to be accessed >> for any typical/reasonable use? > >Depends whether it is a clock or a counter. If you are timing code, the >chances are you will be comparing two values read on the same CPU, so the >issue really doesn't arise. Once you start to program in parallel, and not merely tack serial code together with a bit of parallel glue, that ceases to be the case. You can't do any serious tuning (or even quite a lot of debugging) of parallel code without knowing whether events in one domain[*] happened before events in another. The problem is that the question is easy to pose, but can't be answered without (a) a more precise definition of what "happened before" means and (b) accepting that parallel time is not a monotonic scalar. For example, one of the basic questions to answer when parallel code is running far slower than expected is whether the problem is the time taken to communicate data from one domain to another. I had to write a clock synchroniser to track down once such problem with MPI. When I located it, I realised that I had completely misunderstood where the problem was. [*] The word "domain" means core, thread or other concept, depending on the program and system details. Regards, Nick Maclaren.
From: Morten Reistad on 26 Jan 2010 05:41 In article <g2j237-the1.ln1(a)ntp.tmsw.no>, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote: >Morten Reistad wrote: >> In article<u5d137-vtc1.ln1(a)ntp.tmsw.no>, >> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote: >>> Stephen Fuld wrote: >> >> Voila, "synchronous" clock on a lot of cpus. From my old theory, such latches >> and registers should be able to run about at a fourth of the fundamental >> switching speed, which would beat even the fastest instruction >> engines at least by a factor of two. So every read would get a >> different value. > >Such a global reference, possibly in the form of a PPS (or higher >frequency) signal going to each cpu is a way to sync up the individual >counters, it still allows two "simultaneous" reads on two independent >cores/cpus/boards to return the identical value, right? Yes, that is possible. I just try to point out that it does not require all that much in hardware design to build slave clocks that run at speeds faster than what cpu's can realistically utilise. >This means that you still need to add a cpu ID number to make them >globally unique, and at that point you're really back to the same >problem (and solution), although your high-frequency reference signal >allows for much easier synchronization of all the individual timers. I am usually advocating software solutions, but this question just screams out for a simple hardware solution. >The key is that as soon as you can sync them all to better than the >maximum read speed, there is no way to make two near-simultaneous reads >in two locations and still be able to guarantee that they will be >different, so you need to handle this is some way. That is correct. The raw switching speeds of the transistors in modern computers should be in or close to single digit picoseconds. So, building elementary logic like shift registers and clock drivers to operate at the 100 ps / 0.1 ns / 10 GHz level should be doable without jumping through too many hoops. That should still be an order of two faster than the best cpu speeds, which means the clock is better than the instruction decode that should evaluate it. Adding a cpu id beyond this precision should not be problematic. The time jitter for instruction decodes would be orders of magnitude higher. At these speeds the "clock event horison" between each tick is around 30 centimeters. So, if the distance between processors is bigger than that you cannot have causality between events, that is prohibited by relativity. Database ordering kind of breaks down from there. >Appending a cpu ID is by far the easiest solution. :-) Yep. You can define an sequence from a clock synchronised from an single source, and uniquely time-staggered units. This give us sequence, but it becomes physically rather meaningless. It does give us unique transaction IDs though. -- mrr
From: Larry on 26 Jan 2010 10:14 On Jan 20, 9:33 am, n...(a)cam.ac.uk wrote: > In article <b6gj27-5bn....(a)ntp.tmsw.no>, > Terje Mathisen <"terje.mathisen at tmsw.no"> wrote: > > >n...(a)cam.ac.uk wrote: > >> In article<n7dj27-n7n....(a)ntp.tmsw.no>, > >> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote: > >>> If you instead use a memory-mapped timer chip register, then you've > >>> still got the cost of a real bus transaction instead of a couple of > >>> core-local instructions. > > >> Eh? But how are you going to keep a thousand cores synchronised? > >> You can't do THAT with a couple of core-local instructions! > > >You and I have both written NTP-type code, so as I wrote in another > >message: Separate motherboards should use NTP to stay in sync, with or > >without hw assists like ethernet timing hw and/or a global PPS source. > > Yes, but I thinking of a motherboard with a thousand cores on it. > While it could use NTP-like protocols between cores, and for each > core to maintain its ownclock, that's a fairly crazy approach. > > All right, realistically, it would be 64 groups of 16 cores, or > whatever, but the point stands. Having to use TWO separate > protocols on a single board isn't nice. > > Regards, > Nick Maclaren. For what it's worth, the SiCortex machines, even at the 5800 core level, had synchronous and nearly synchronized cycle counters. The entire system, ultimately, ran off a single 100 MHz or so clock, with PLLs on each multicore chip to upconvert that to the proper internal rates. On those cores, the cycle counters started from zero when reset was released, so they were not synchronized at boot time. There was a low level timestamp-the-interconnect scheme that would then synchronize all the cycle counters within a few counts, giving ~10 nanosecond synchronization across all 5832 cores. This was used to create MPI_WTIME and other system wide timestamps, and very handy for large scale performance tuning, but not useful for UIDs. By the way, once your applications get to large scale (over 1000 cores), problems of synchronization and load balancing start to dominate, and in that regime, I suspect variable speed clocks make the situation worse. Better to turn off cores to save power than to let them run at variable speed. -Larry (ex SiCortex)
From: nmm1 on 27 Jan 2010 06:11
In article <0b40dbdb-53c0-4c5c-a19b-e68316f3d9c4(a)p17g2000vbl.googlegroups.com>, Larry <lstewart2(a)gmail.com> wrote: > >For what it's worth, the SiCortex machines, even at the 5800 core >level, had synchronous and nearly synchronized cycle counters. > >The entire system, ultimately, ran off a single 100 MHz or so clock, >with PLLs on each multicore chip to upconvert that to the proper >internal rates. On those cores, the cycle counters started from zero >when reset was released, so they were not synchronized at boot time. >There was a low level timestamp-the-interconnect scheme that would >then synchronize all the cycle counters within a few counts, giving >~10 nanosecond synchronization across all 5832 cores. That's impressive. The demise of SiCortex was very sad :-( >This was used to create MPI_WTIME and other system wide timestamps, >and very handy for large scale performance tuning, but not useful for >UIDs. Yes. >By the way, once your applications get to large scale (over 1000 >cores), problems of synchronization and load balancing start to >dominate, and in that regime, I suspect variable speed clocks make the >situation worse. Better to turn off cores to save power than to let >them run at variable speed. Oh, gosh, YES! The more I think about tuning parallel codes in a variable clock context, the more I think that I don't want to go there. And that's independent of whether I have an application or an implementor hat on. Regards, Nick Maclaren. |