Effects of Memory Latency and Bandwidth on Supercomputer,Application Performance [Computer Architecture]

Prev: High-bandwidth computing (hbc) wiki and mailing list
Next: Effects of Memory Latency and Bandwidth onSupercomputer,Application Performance

From: nmm1 on 27 Jul 2010 03:56

In article <hae2i7-e4g.ln1(a)ntp.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Andy Glew wrote:
>> And phrase it in terms of a dancehall configuration - all memory on one
>> side, all processors on the other. I.e. do NOT neglect bandwidth to
>> processor local memory.
>
>Nice way of putting it: All accesses are just as expensive, i.e. very
>expensive.

And it makes reasoning about consistency easier - Lamport used that
model in his seminal paper on sequential consistency.

>> One you get to a certain level of the FFT, we have lots of nice cache
>> line sized accesses. Unless they are being distributed to different
>> processors. Which is where my scatter/gather interconnect would help.
>>
>> We might want to create 3D tiles.
>
>This sounds a lot more reasonable: 1D FFT is very nice from a cache
>viewpoint, while 2D and 3D are much worse.

Now you have lost me. That is exactly the converse of my understanding.

1-D FFTs are pretty dire from a caching viewpoint, because there is
virtually no data reuse, except in some special cases and with code
designed to use them. Or have you found a new approach?

2- and 3-D FFTs are quite good for caching, if the code is changed
to make use of that. For a weak meaning of 'quite good', true. And,
yes, you need a blocked transpose operation.

I think that a good scatter/gather interconnect would help with both,
but the devil is in the details.

Regards,
Nick Maclaren.

From: Terje Mathisen "terje.mathisen at on 27 Jul 2010 08:34

nmm1(a)cam.ac.uk wrote:
> In article<hae2i7-e4g.ln1(a)ntp.tmsw.no>,
> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>> Andy Glew wrote:
>>> And phrase it in terms of a dancehall configuration - all memory on one
>>> side, all processors on the other. I.e. do NOT neglect bandwidth to
>>> processor local memory.
>>
>> Nice way of putting it: All accesses are just as expensive, i.e. very
>> expensive.
>
> And it makes reasoning about consistency easier - Lamport used that
> model in his seminal paper on sequential consistency.
>
>>> One you get to a certain level of the FFT, we have lots of nice cache
>>> line sized accesses. Unless they are being distributed to different
>>> processors. Which is where my scatter/gather interconnect would help.
>>>
>>> We might want to create 3D tiles.
>>
>> This sounds a lot more reasonable: 1D FFT is very nice from a cache
>> viewpoint, while 2D and 3D are much worse.
>
> Now you have lost me. That is exactly the converse of my understanding.
>
> 1-D FFTs are pretty dire from a caching viewpoint, because there is
> virtually no data reuse, except in some special cases and with code
> designed to use them. Or have you found a new approach?

Not at new approach afaik, this is just from memory, from the IMDCT I
worked on for an audio codec:

For each of the log2(n) iterations there are just two pointers which
traverse the data array, and they do so sequentially, so caching
behavior should be close to optimal as long as the cache is at least 2-way.

OK, I do see that when the data size is significantly larger than the
outermost cache layer, this becomes a pure streaming vector job, with no
reuse at all between those log2(n) iterations. :-(

For arrays that do fit in some cache layer, sequential access is
optimal, but for significantly larger data sizes you would like to
overlap multiple iterations...
>
> 2- and 3-D FFTs are quite good for caching, if the code is changed
> to make use of that. For a weak meaning of 'quite good', true. And,
> yes, you need a blocked transpose operation.

I'll have to look into this at some point in time.
>
> I think that a good scatter/gather interconnect would help with both,
> but the devil is in the details.

S/G is very useful, IFF it can increase the total available bandwidth,
but as long as DRAMs keep getting more and more dependent upon big block
transfers, it sounds like a hard problem. :-(

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

From: Kai Harrekilde-Petersen on 27 Jul 2010 16:59

Terje Mathisen <"terje.mathisen at tmsw.no"> writes:

> Benny Amorsen wrote:
>> Andy Glew<"newsgroup at comp-arch.net"> writes:
>>
>>> Network routing (big routers, lots of packets).
>>
>> With IPv4 you can usually get away with routing on the top 24 bits + a
>> bit of special handling of the few local routes on longer than 24-bit.
>> That's a mere 16MB table if you can make do with possible 256
>> "gateways".
>
> Even going to a full /32 table is just 4GB, which is very cheap these
> days. :-)

The problem that I saw when I was designing Ethernet switch/routers 5
years ago, wasn't one particular lookup, but the fact that you need to
do *several* quick lookups for each packet (DMAC, 2*SMAC (rd+wr for
learning), DIP, SIP, VLAN, ACLs, whatnot). Each Gbit of Ethernet can
generate 1.488M packets per second.

The DRAMs may be cheap enough, but the pins and the power to drive
multiple banks sure ain't cheap.

Remember that you want to do this in the switch/router hardware path (ie
no CPU should touch the packet), and at wire-speed for all ports at the
same time.

Kai
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>

From: Morten Reistad on 27 Jul 2010 19:30

In article <m3hbjlifki.fsf(a)ursa.amorsen.dk>,
Benny Amorsen <benny+usenet(a)amorsen.dk> wrote:
>Andy Glew <"newsgroup at comp-arch.net"> writes:
>
>> Network routing (big routers, lots of packets).

Well, for loose coupling ethernet switching is a better option
as long as you can keep the mac table within the associative
memory of the laster switches. This means around 8000 as
an upper limit. You can push this somewhat by fancy
interconnect patterns, but hardly by more than an order
of magnitude.

So the hardware can get you a few tens of thousands of
processors around 50 us away from each other.
>
>With IPv4 you can usually get away with routing on the top 24 bits + a
>bit of special handling of the few local routes on longer than 24-bit.
>That's a mere 16MB table if you can make do with possible 256
>"gateways".

Stop thinking in the limits of ipv4. We have had a more
scalable version for a decade now. It even works in the last
three versions of windows.

>You can simply replicate the whole routing table to local memory on all
>processors; updates are a challenge but it's usually ok if packets go to
>the wrong route for a few ms.

Why? Map the mac addresses into the ip address, and let the
switches do the heavy lifting.

-- mrr

From: Kai Harrekilde-Petersen on 28 Jul 2010 06:11

Terje Mathisen <"terje.mathisen at tmsw.no"> writes:

> Kai Harrekilde-Petersen wrote:
>> Terje Mathisen<"terje.mathisen at tmsw.no"> writes:
>>
>>> Benny Amorsen wrote:
>>>> Andy Glew<"newsgroup at comp-arch.net"> writes:
>>>>
>>>>> Network routing (big routers, lots of packets).
>>>>
>>>> With IPv4 you can usually get away with routing on the top 24 bits + a
>>>> bit of special handling of the few local routes on longer than 24-bit.
>>>> That's a mere 16MB table if you can make do with possible 256
>>>> "gateways".
>>>
>>> Even going to a full /32 table is just 4GB, which is very cheap these
>>> days. :-)
>>
>> The problem that I saw when I was designing Ethernet switch/routers 5
>> years ago, wasn't one particular lookup, but the fact that you need to
>> do *several* quick lookups for each packet (DMAC, 2*SMAC (rd+wr for
>> learning), DIP, SIP, VLAN, ACLs, whatnot).
>
> I sort of assumed that not all of these would require the same size
> table. What is the total index size, i.e. sum of all the index bits?

I had to go back to some old documentation to remember it. So take the
following as a real life example, but not absolute gospel.

The first thing you need to do in a router, is to identify which logical
port you're on (think link aggregation) and whether you need to do
routing or not (a VID,MAC lookup).

If you're doing routing (L3 forwarding), next you need to do an ingress
VLAN check and then a CIDR Longest-prefix lookup for IPv4 unicast + an
ARP lookup, and another lookup for IPv4 multicast (the ARP and multicast
tables can share storage). Oh, and you need to do a egress VLAN lookup
as well.

The CIDR lookup has around 55 bits, whereas the ARP was a straight 13bit
index.

At Layer two, you have a (VID, MAC) table is around 75-80 bits wide
(12+48bit [VID,MAC] plus the data you need to store such as autolearning
status, aging info & destination, CPU copy/move, mirroring etc), and has
as many entries you want to throw at it. 8K-16K entries is the norm in
the low end, door-stopper boxes.

I've probably forgotten a couply of minor lookups.

The size of the tables also depends on whether you put real port bitmaps
in all the tables, or you put an Id in there, and then have secondary
Id-> port bitmap conversion tables later. We did the latter.

>> Each Gbit of Ethernet can generate 1.488M packets per second.
>
> And unless you can promise wire-speed for any N/2->N/2 full duplex
> mesh you should not try to sell the product, right?

Correct. The ability to do this has become cheap enough that it's become
a tickmark requirement (at least at the low-port count end). For the big
modular switches used at the campus/corporate/WAN level, this is not
feasible.

Kai
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: High-bandwidth computing (hbc) wiki and mailing list
Next: Effects of Memory Latency and Bandwidth onSupercomputer,Application Performance