Prev: Cheap Wholesale Air Force One 25 Women
Next: x86 simulator (Was Re: RISC load-store verses x86 Add from memory.)
From: Andy 'Krazy' Glew on 26 Jun 2010 14:51 https://semipublic.comp-arch.net/wiki/Picking_N-th_ready_element One common operation in schedulers is picking the 1st, or 1st and 2nd, or ... N-th ... ready elements to schedule. Without loss of generality we will assume that we are picking in a physical order. E.g. we assume that an array looks like 0 1 2 .... N-1 from top to bottom, and we are trying to find the highest (lowest numbered) element that has a ready bit set. If we want to pick according to some permuted, priority order we might simply send the ready bits through a [[permutation circuits]], perhaps a [[permutation matrix circuit]], select in that order, and then transform back to determine the original locations. = Picking 1 = Finding the first, highest priority, lowest index number ready entry in the array is straightforward: use a find-first-set bit circuit ([[FFS]]). E.g. a ripple carry circuit, or possibly transformed via any of a large number of techniques for parallel prefix, such as carry lookahead. = Picking 2 = Picking 2 is slightly harder. We could use two [[FFS]] circuits, cascading one into the other. This adds delay in most design styles, and scales poorly beyond picking N > 2. A common trick is to start two FFS circuits at opposite ends of the array, in opposite directions. Unfortunately, if the array is sorted by priority, this gives you the highest priority element and the lowest priority element. Not the first and second highest priority elements. And it does not scale past N=2. = Picking M = The following technique has been suggested to pick any of the top M elements: * for every entry K in the array, calculate Ready_Before_Me[K] = the number of ready entries I<K such that Ready[I] is true. ** This can be calculated using a well-known [[parallel prefix circuit]]. It allows considerable sharing of logic. log(N) delay, N*log(N) hardware * at every entry in the array, perform the comparison Ready_Before_Me[K] = 1,2,3 ... ** use this to select which of the M output ports an entry should be connected to ** this is basically the well-known [[sum addressed memory decoder]] circuit, which is carry free ** i.e. this whole computation - calculating Ready_Before_Me and selecting - is [[carry propagate free]] Hardware complexity N*log(N) + M*N - which is not good if M is large, comparable to N. But good if M is small.
From: MitchAlsup on 26 Jun 2010 21:35 On Jun 26, 1:51 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote: > https://semipublic.comp-arch.net/wiki/Picking_N-th_ready_element > > One common operation in schedulers is picking the 1st, or 1st and 2nd, or ... N-th ... ready elements to schedule. > > Without loss of generality we will assume that we are picking in a physical order. E.g. we assume that an array looks like > > 0 > 1 > 2 > ... > N-1 > > from top to bottom, and we are trying to find the highest (lowest numbered) element that has a ready bit set. > > If we want to pick according to some permuted, priority order we might simply send the ready bits through a > [[permutation circuits]], perhaps a [[permutation matrix circuit]], select in that order, and then transform back to > determine the original locations. > > = Picking 1 = > > Finding the first, highest priority, lowest index number ready entry in the array is straightforward: use a > find-first-set bit circuit ([[FFS]]). E.g. a ripple carry circuit, or possibly transformed via any of a large number > of techniques for parallel prefix, such as carry lookahead. With proper access to true and complement signals from latching points, one can pick the first element of 4 in one gate delay, 9 in 2 gate delays, and 24 in 3 gate delays. None of this violates 4-input fan-in rules, but does apply some heavy loads to some inverter fan- outs. I have the schematics if anyone is interested. For 2 more gate delays, one can make this whole process circular with a different starting point each cycle (if desired) and a window limit if desired. > Picking 2 is slightly harder. > > We could use two [[FFS]] circuits, cascading one into the other. This adds delay in most design styles, and scales > poorly beyond picking N > 2. So, for 7 gate delays (3+1+3) one can pick 2 from at least 24 entries. For 11 gate delays (3+1+3+1+3) one can pick 3 from at least 24 entries. For 14 gates, one can pick 4 from at least 24 entries. In all of these cases, on is picking the oldest ready element in the vector of candidates. That is Oldest reads is picked first, second oldest is picked second,... > A common trick is to start two FFS circuits at opposite ends of the array, in opposite directions. Unfortunately, if > the array is sorted by priority, this gives you the highest priority element and the lowest priority element. Not the > first and second highest priority elements. And it does not scale past N=2. And not exactly what you want to do anyways. This kind of picker can pick 'instructions' in unexpected orders and lead to other instruction order issues. Pickers are a seriously degenerate form of logic that uses something resembling a carry chain from an integer adder. In the picker sense, the carry chain is a single AND gate while in the adder sense the carry chain is a Or-And gate (manchester) per carry stage. The same kinds of lookaheads, selectors, and tree arrangements that are appropriate for adders work with pickers, except that the picker carry chains are a lot simpler. I strongly suspect, as Andy describes, that when attempting to pick more than 4-ish from (maybe) up to 60-ish elements, that multiplication array techniques would be in (logic) order; using circuits that smell a lot like carry save adder cells. For pickers faster than described above (14 gates for 4 from 24), simply take all of the cones of logic, express them, and send the whole kaboodle through a logic minimizer. Assign a logic engineer and don't let him give up before the minimization is complete, and the minimization can be easily expressed in the logic library one happens to have at hand. Mitch
From: Terje Mathisen "terje.mathisen at on 27 Jun 2010 11:25 MitchAlsup wrote: > On Jun 26, 1:51 pm, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> > wrote: >> https://semipublic.comp-arch.net/wiki/Picking_N-th_ready_element >> >> One common operation in schedulers is picking the 1st, or 1st and >> 2nd, or ... N-th ... ready elements to schedule. [snip] > I strongly suspect, as Andy describes, that when attempting to pick > more than 4-ish from (maybe) up to 60-ish elements, that > multiplication array techniques would be in (logic) order; using > circuits that smell a lot like carry save adder cells. > > For pickers faster than described above (14 gates for 4 from 24), > simply take all of the cones of logic, express them, and send the > whole kaboodle through a logic minimizer. Assign a logic engineer > and don't let him give up before the minimization is complete, and > the minimization can be easily expressed in the logic library one > happens to have at hand. For large N and M, would it be appropriate to look at probabilistic methods? I.e. a single counter which knows how many ready (R) elements there are at a given time, when asking for ready(M) we start at (approximately) element M*N/R. What I'm asking is if anyone have modeled probabilistic pickers? BTW, it is well-known that you can pick the Nth percentile out of an unordered array in log(N) time, using a non-recursive half quick sort which always picks the relevant partition to drill further into. I assume this will always to out of the question though, since it would throw away nearly all the effort expended in partly sorting the array. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Andy 'Krazy' Glew on 27 Jun 2010 12:40 On 6/27/2010 8:25 AM, Terje Mathisen wrote: > MitchAlsup wrote: >> On Jun 26, 1:51 pm, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> >> wrote: >>> https://semipublic.comp-arch.net/wiki/Picking_N-th_ready_element >>> >>> One common operation in schedulers is picking the 1st, or 1st and >>> 2nd, or ... N-th ... ready elements to schedule. > > For large N and M, would it be appropriate to look at probabilistic > methods? > > I.e. a single counter which knows how many ready (R) elements there are > at a given time, when asking for ready(M) we start at (approximately) > element M*N/R. > > What I'm asking is if anyone have modeled probabilistic pickers? > > BTW, it is well-known that you can pick the Nth percentile out of an > unordered array in log(N) time, using a non-recursive half quick sort > which always picks the relevant partition to drill further into. > > I assume this will always to out of the question though, since it would > throw away nearly all the effort expended in partly sorting the array. Terje, I don't really know of any work in probailistic pickers in hardware. I encourage you to think/talk/work on them some more. A complication: your first suggestion, starting at element M*N/R, I think is making the implcit assumption that the array is sorted. This is often not true in hardware schedulers. I did in that first post talk about running the array ready bits through a permutation, that allows us to look at the ready bits in order even though the array itself is out of order, that really only works for small arrays. The quicksort idea is interesting. I wonder if we get into the usual heapsort vs. quicksort argument, deterministic vs. probabilistic, where deterministic corresponds to current synchronous clocking, and probailistic corresponds to asynchronous clocking? As I have discussed here, I have long suspected that asynchronous circuis are one road to the future. Forgetting that, most present OOO schedulers must be pipelineable, delivering back to back instructions in successive clocks. There has been little work that I am aware of in multicycle loop schedulers. Of course I have been thinking about it. E.g. in my multicycle work. Some of the best aspects are figuring out how to make use of a big slow outer scheduler. Key: make the big slow outer scheduler work in units of multiple instructions, instruction groups of 8, 16, or more, where the latency through the execution group is related to, ideally at least, the latency of the scheduler. Make the outer scheduler approximate: don't schedule based on all inputs, only on one or two or a few: e.g. worst case a 16 instruction group might have 32 or even 48 inputs, but you might only provide 2 or 3. Don't stop filling an instruction group when the number of inputs it can handle is exceeded: instead just select a subset of the most important inputs, e.g. the last arriving. (Although there may be better choices still.) Don't restrict instruction groups to dependency chains.
From: Andy 'Krazy' Glew on 27 Jun 2010 13:11
On 6/27/2010 9:40 AM, Andy 'Krazy' Glew wrote: > On 6/27/2010 8:25 AM, Terje Mathisen wrote: >> MitchAlsup wrote: >>> On Jun 26, 1:51 pm, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> >>> wrote: >>>> https://semipublic.comp-arch.net/wiki/Picking_N-th_ready_element >>>> >>>> One common operation in schedulers is picking the 1st, or 1st and >>>> 2nd, or ... N-th ... ready elements to schedule. > >> For large N and M, would it be appropriate to look at probabilistic >> methods? >> >> I.e. a single counter which knows how many ready (R) elements there are >> at a given time, when asking for ready(M) we start at (approximately) >> element M*N/R. >> >> What I'm asking is if anyone have modeled probabilistic pickers? > Terje, I don't really know of any work in probabilistic pickers in > hardware. I encourage you to think/talk/work on them some more. Tweaking my earlier answer: I am not aware of anyone building probabilistic pickers for the main scheduling task in an OOO scheduler: picking N ready instructions to execute out of a large array. I am aware of people building probabilistic arbitrators: given M>K ready instructions, using probability to choose which K to execute first. I am quite a big advocate of using probabilistic methods between threads, since it is the best way I know of to provide nice properties like fair-share scheduling (e.g. each thread getting 25% of the machine, or one getting 60% and the other two getting 20% each). I am not aware of much work on probabilistic methods within a single thread. Butler and Patt do evaluate a random scheduler in the paper whose reference is attached below. However, I do not attach much credence to that paper, since it failed to reproduce the scheduler anomalies seen in the real world, the eddies that we encountered on the P6 schedulers, and the tornadoes that plagued Willamette. They say "We have found for the scheduling techniques modeled, performance is largely independent of the scheduling discipline." Since real world experience has indicated the contrary, I suspect (a) they didn't model the right scheduler techniques, (b) their modelling technology (simulator) was suspect, (3) they weren't looking at a good selection of workloads, and/or (4) their analysis was faulty. Wrt the last, I recall onP6 that we were getting reasonable average performance across workloads; we only noticed the scheduler anomalies when looking into smaller regions, fixing of which helped overall. Guri Sohi is/was always insistent on his students modeling random as a policy to compare against. I'm not sure if any serious work went into modeling random seriously as a goal. --- Butler, M. and Patt, Y. 1992. An investigation of the performance of various dynamic scheduling techniques. In Proceedings of the 25th Annual international Symposium on Microarchitecture (Portland, Oregon, United States, December 01 - 04, 1992). International Symposium on Microarchitecture. IEEE Computer Society Press, Los Alamitos, CA, 1-9. DOI= http://doi.acm.org/10.1145/144953.144974 |