From: Stephen Fuld on 25 Jan 2010 16:48 On 1/25/2010 12:56 PM, Terje Mathisen wrote: > Stephen Fuld wrote: >> >> Perhaps I am missing something, but I don't think that, by itself works. >> If you have multiple timers, doesn't that require a much smaller >> granularity timer? i.e. say 10 ns versus 1 us. If you stuck with the 1 >> us granularity, nothing prevents two calls within the same us from the >> same processor from getting the same value. But if you try to maintain >> multiple clocks in different chips in sync with each other within 10 ns, >> you run into other problems which makes that hard. > > The trick is simply to add a bunch of bits below the least significant > timer bit, and then use those as a cpu/core ID. > > I.e. each time cpu 0 and cpu 1 happens to record exactly the same real > timestamp, cpu 0 will be considered to have happened before cpu 1, since > those trailing bits will be ...000 for cpu 0 and ...001 for cpu 1. > > With 16 such bits you can handle a 64K cluster and still guarantee that > all timestamps will be globally unique. Yes, I understand that. But I am still missing something. If the idea of this is to guarantee uniqueness across a cluster when using a single clock register, then the other mechanisms seem to provide than and more. If the idea is to allow multiple clock/registers (perhaps one per board) in order to reduce potential scaling issues or reduce the time lag required to go across the interconnect network, then ISTM that you lose the guarantee of sequentiality, that is that a timer call that occurs before a second one gets a lower number. That is, the numbers will be unique, but not necessarily in time order. ISTM that the time lag is pretty small at least for clusters within the same room and incurring that delay is worth the guarantee of sequentiality. As for the scaling issues, even with current technology, given that a single CPU register can be accessed easily multiple times per ns, I just don't see a scaling issue for any reasonable usage. Does anyone have a feel for how often the clock/counter needs to be accessed for any typical/reasonable use? So, in summary, the static ID seems to me to be a sub optimal solution in all situations. -- - Stephen Fuld (e-mail address disguised to prevent spam)
From: Morten Reistad on 25 Jan 2010 20:15 In article <u5d137-vtc1.ln1(a)ntp.tmsw.no>, Terje Mathisen <"terje.mathisen at tmsw.no"> wrote: >Stephen Fuld wrote: >> >> Perhaps I am missing something, but I don't think that, by itself works. >> If you have multiple timers, doesn't that require a much smaller >> granularity timer? i.e. say 10 ns versus 1 us. If you stuck with the 1 >> us granularity, nothing prevents two calls within the same us from the >> same processor from getting the same value. But if you try to maintain >> multiple clocks in different chips in sync with each other within 10 ns, >> you run into other problems which makes that hard. > >The trick is simply to add a bunch of bits below the least significant >timer bit, and then use those as a cpu/core ID. > >I.e. each time cpu 0 and cpu 1 happens to record exactly the same real >timestamp, cpu 0 will be considered to have happened before cpu 1, since >those trailing bits will be ...000 for cpu 0 and ...001 for cpu 1. > >With 16 such bits you can handle a 64K cluster and still guarantee that >all timestamps will be globally unique. And, just distributing a master clock using a serial wire with some RLL-like code that ticks out timing plus some info to major epochs, like seconds, should be a reasonably trivial thing to implement. Just make all the wires have the same delay, and you are set. The wire drives a counter, a shift register, and two latches. Every tick, the counter increments and loads the first latch. At a much lower speed, the shift register loads the absolute value that is shifted into the latches on an event pulse, say once a second. The lower bits are cpu-id. Voila, "synchronous" clock on a lot of cpus. From my old theory, such latches and registers should be able to run about at a fourth of the fundamental switching speed, which would beat even the fastest instruction engines at least by a factor of two. So every read would get a different value. -- mrr Yes, the cable will contain data. Lots of data. But it is junior to the transatlantic cables, which actually contain a few hundred gigabytes, in transit across the ocean.
From: Stefan Monnier on 25 Jan 2010 21:15 > Perhaps I am missing something, but I don't think that, by itself works. If > you have multiple timers, doesn't that require a much smaller granularity > timer? i.e. say 10 ns versus 1 us. Yes, of course. Stefan
From: Terje Mathisen "terje.mathisen at on 26 Jan 2010 02:36 Stephen Fuld wrote: > On 1/25/2010 12:56 PM, Terje Mathisen wrote: >> With 16 such bits you can handle a 64K cluster and still guarantee that >> all timestamps will be globally unique. > > Yes, I understand that. But I am still missing something. If the idea of > this is to guarantee uniqueness across a cluster when using a single > clock register, then the other mechanisms seem to provide than and more. > If the idea is to allow multiple clock/registers (perhaps one per board) > in order to reduce potential scaling issues or reduce the time lag > required to go across the interconnect network, then ISTM that you lose > the guarantee of sequentiality, that is that a timer call that occurs > before a second one gets a lower number. That is, the numbers will be > unique, but not necessarily in time order. That is a feature, not a bug! When we time sufficiently small intervals, i.e. smaller than the minimum time to get from one node to the nearest neighbor, then there is no way to globally determine the "real" order, simply because using such a global timer would have to give less resolution than what each cpu/core-based counter can do. Adding the core ID makes each timestamp unique, so the idea is simply to be able to compare them after the fact, and using the numeric order as the effective time order. > So, in summary, the static ID seems to me to be a sub optimal solution > in all situations. Except that it carries an order of magnitude less overhead, scales perfectly, and allows timing resolution down to whatever the local core can do (~ns). Otherwise it might besub optimal. :-) Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Terje Mathisen "terje.mathisen at on 26 Jan 2010 02:43
Morten Reistad wrote: > In article<u5d137-vtc1.ln1(a)ntp.tmsw.no>, > Terje Mathisen<"terje.mathisen at tmsw.no"> wrote: >> Stephen Fuld wrote: >>> >>> Perhaps I am missing something, but I don't think that, by itself works. >>> If you have multiple timers, doesn't that require a much smaller >>> granularity timer? i.e. say 10 ns versus 1 us. If you stuck with the 1 >>> us granularity, nothing prevents two calls within the same us from the >>> same processor from getting the same value. But if you try to maintain >>> multiple clocks in different chips in sync with each other within 10 ns, >>> you run into other problems which makes that hard. >> >> The trick is simply to add a bunch of bits below the least significant >> timer bit, and then use those as a cpu/core ID. >> >> I.e. each time cpu 0 and cpu 1 happens to record exactly the same real >> timestamp, cpu 0 will be considered to have happened before cpu 1, since >> those trailing bits will be ...000 for cpu 0 and ...001 for cpu 1. >> >> With 16 such bits you can handle a 64K cluster and still guarantee that >> all timestamps will be globally unique. > > And, just distributing a master clock using a serial wire with some > RLL-like code that ticks out timing plus some info to major epochs, > like seconds, should be a reasonably trivial thing to implement. Just > make all the wires have the same delay, and you are set. The wire > drives a counter, a shift register, and two latches. Every tick, > the counter increments and loads the first latch. At a much lower > speed, the shift register loads the absolute value that is shifted > into the latches on an event pulse, say once a second. The lower bits > are cpu-id. > > Voila, "synchronous" clock on a lot of cpus. From my old theory, such latches > and registers should be able to run about at a fourth of the fundamental > switching speed, which would beat even the fastest instruction > engines at least by a factor of two. So every read would get a > different value. Such a global reference, possibly in the form of a PPS (or higher frequency) signal going to each cpu is a way to sync up the individual counters, it still allows two "simultaneous" reads on two independent cores/cpus/boards to return the identical value, right? This means that you still need to add a cpu ID number to make them globally unique, and at that point you're really back to the same problem (and solution), although your high-frequency reference signal allows for much easier synchronization of all the individual timers. The key is that as soon as you can sync them all to better than the maximum read speed, there is no way to make two near-simultaneous reads in two locations and still be able to guarantee that they will be different, so you need to handle this is some way. Appending a cpu ID is by far the easiest solution. :-) Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching" |