Prev: Question about platform independent code and ANSI C compliantcode
Next: HTML5, geolocation and privacy implications
From: Mok-Kong Shen on 4 Jun 2010 04:54 This a a layman's question: If I don't err, multi-precision integer arithmetics are done with algorithms employing one accumulator. Would parallel processing exploiting multiple accumulators be advantageous and, if so, could someone give some references? Thanks. M. K. Shen
From: Walter Banks on 4 Jun 2010 09:51 Mok-Kong Shen wrote: > > This a a layman's question: If I don't err, multi-precision integer > arithmetics are done with algorithms employing one accumulator. Would > parallel processing exploiting multiple accumulators be advantageous > and, if so, could someone give some references? > The questions doesn't have a completely clear answer. My experience suggests that with the current processors multi-precision integer arithmetic would normally be done on a single ALU. This is not a direct answer to your question. It may however use different processor registers for LS and MS calculations if there is a time advantage to do so. The main advantage of using multiple registers connected to the same ALU is to prepare the second set of calculations while waiting for the first set of calculation results With current processors multiprecesion calculations would stay with a single ALU because it would take too much time to transfer the condition code information between ALU's. There are quite a few considerations to be made when deciding if creating a parallel execution. The first one essentially is making a fundamental decision when paralleling an application to parallelize a program at the function level or at the instruction level. (Fine or Coarse grain parallelism) For the last decade or so many researchers have been working on fine grain parallelism where applications distribute a function for execution on several execution units. This can happen in two places, during software creation where the code generation tools create co-operating execution threads on more than one processor. It can happen in a similar way automatically in some processors during code execution. In the later case for example when a conditional branch is executed both paths are executed on separate processors until the branch is resolved when the results of one of the execution paths is discarded. Randy Allen and others have published several papers and books on identifying blocking cases for separating out close grained execution. Coarse grained execution is essentially breaking an application at a function level and making function calls between processors. At one level we see that happen when a system is implemented that uses a co-processor programmed to perform some specialized set of functions. There are several ways that this can be implemented. In the simplest case this can be done by creating separate application code for each processor. This was how we implemented the four PDP-11 computer network protocol simulator in the mid 70's (Yep its been around for a long time) By the mid 80's some of the personal computers were using separate processors to do detailed work on peripherals like the processors found in a keyboards or mice that processed key rollover, auto repeat and quadrature accumulation and button presses. Applications can be split is a similar way to execute in parallel either through an execution environment like ISO 61131, ISO 61499 block structured programming or through tools that create multiprocessor execution from a single source. A single source to a multiprocessor execution can be automatically done with some compiler tools. Byte Craft has developed a bunch of algorithms and processes to do this in some of our tools. In the automotive engine controllers the divisions between processors on a heterogeneous system are well defined and when compiling for one processor an interprocessor interface was exported as part of the build of a second processor. In many of the consumer applications the compilers used some form of geographical center of reference to automatically of semi automatically break an application over several processors. Byte Craft has done a lot of coarse grained parallel development tools work. Regards, -- Walter Banks Byte Craft Limited http://www.bytecraft.com --- news://freenews.netfront.net/ - complaints: news(a)netfront.net ---
From: Mok-Kong Shen on 4 Jun 2010 10:41 Walter Banks wrote: [snip] > With current processors multiprecesion calculations would stay with > a single ALU because it would take too much time to transfer the > condition code information between ALU's. [snip] Just an additional question for my clear understanding: For the special case of multiplication of two large integers does that mean that the cost of hardware needed for parallel processing doesn't weigh up the net processing efficiency gained? M. K. Shen
From: Walter Banks on 4 Jun 2010 16:55 Mok-Kong Shen wrote: > > Walter Banks wrote: > [snip] > > With current processors multiprecesion calculations would stay with > > a single ALU because it would take too much time to transfer the > > condition code information between ALU's. > [snip] > > Just an additional question for my clear understanding: For the special > case of multiplication of two large integers does that mean that the > cost of hardware needed for parallel processing doesn't weigh up the > net processing efficiency gained? Yes. However multiply may be a bad example because there are some very fast multipliers and partial products don't need to wait for condition codes. It would probably work well in a fine grained parallel environment. Depending on application and processor a piplined software implementation might also be very efficient. Regards, Walter.. -- Walter Banks Byte Craft Limited http://www.bytecraft.com --- news://freenews.netfront.net/ - complaints: news(a)netfront.net ---
From: Ira Baxter on 10 Jun 2010 11:40
"Mok-Kong Shen" <mok-kong.shen(a)t-online.de> wrote in message news:hub3ak$i34$02$1(a)news.t-online.com... > Walter Banks wrote: > [snip] >> With current processors multiprecesion calculations would stay with >> a single ALU because it would take too much time to transfer the >> condition code information between ALU's. > [snip] > > Just an additional question for my clear understanding: For the special > case of multiplication of two large integers does that mean that the > cost of hardware needed for parallel processing doesn't weigh up the > net processing efficiency gained? > > M. K. Shen For really large multiprecision multiplication, you might be able to partition a single multiplication into computing partial products in (task) parallel and using an adder tree to reduce/sum the partial products and get some performance. If I were doing lots of this kind of arithmetic, I'd be tempted instead to find a way to call the multiprecision routines as units of parallel work. They are big enough to make nice units of parallelism. We have a parallel programming langauge, PARLANSE, (www.semanticdesigns.com/Products/PARLANSE) that is designed to handle modest-size units of code in task parallel for lots of combinations (pure parallal, partial order, arbitrary forking). It just so happens to have a multiprecision library (integers and rationals), and in fact calls on that library *do* get used in parallel in our everyday use of PARLANSE for program analysis purposes. In practice, we don't see a lot of instances of this occuring, but it does occur. -- IDB |