From: Terje Mathisen "terje.mathisen at on 28 Jul 2010 03:04 Kai Harrekilde-Petersen wrote: > Terje Mathisen<"terje.mathisen at tmsw.no"> writes: > >> Benny Amorsen wrote: >>> Andy Glew<"newsgroup at comp-arch.net"> writes: >>> >>>> Network routing (big routers, lots of packets). >>> >>> With IPv4 you can usually get away with routing on the top 24 bits + a >>> bit of special handling of the few local routes on longer than 24-bit. >>> That's a mere 16MB table if you can make do with possible 256 >>> "gateways". >> >> Even going to a full /32 table is just 4GB, which is very cheap these >> days. :-) > > The problem that I saw when I was designing Ethernet switch/routers 5 > years ago, wasn't one particular lookup, but the fact that you need to > do *several* quick lookups for each packet (DMAC, 2*SMAC (rd+wr for > learning), DIP, SIP, VLAN, ACLs, whatnot). I sort of assumed that not all of these would require the same size table. What is the total index size, i.e. sum of all the index bits? > Each Gbit of Ethernet can generate 1.488M packets per second. And unless you can promise wire-speed for any N/2->N/2 full duplex mesh you should not try to sell the product, right? > > The DRAMs may be cheap enough, but the pins and the power to drive > multiple banks sure ain't cheap. > > Remember that you want to do this in the switch/router hardware path (ie > no CPU should touch the packet), and at wire-speed for all ports at the > same time. Obviously. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Bakul Shah on 29 Jul 2010 02:36 On 7/28/10 12:04 AM, Terje Mathisen wrote: > Kai Harrekilde-Petersen wrote: >> Terje Mathisen<"terje.mathisen at tmsw.no"> writes: >> >>> Benny Amorsen wrote: >>>> Andy Glew<"newsgroup at comp-arch.net"> writes: >>>> >>>>> Network routing (big routers, lots of packets). >>>> >>>> With IPv4 you can usually get away with routing on the top 24 bits + a >>>> bit of special handling of the few local routes on longer than 24-bit. >>>> That's a mere 16MB table if you can make do with possible 256 >>>> "gateways". >>> >>> Even going to a full /32 table is just 4GB, which is very cheap these >>> days. :-) >> >> The problem that I saw when I was designing Ethernet switch/routers 5 >> years ago, wasn't one particular lookup, but the fact that you need to >> do *several* quick lookups for each packet (DMAC, 2*SMAC (rd+wr for >> learning), DIP, SIP, VLAN, ACLs, whatnot). > > I sort of assumed that not all of these would require the same size > table. What is the total index size, i.e. sum of all the index bits? The sum is too large to allow a single access in a realistic system. You use whatever tricks you have to, to achieve wire- speed switching/IP forwarding while staying within given constraints. Nowadays ternary CAMs are used to do the heavy lifting for lookups but indexing, some sort of tries are also used. In addition to the things mentioned above, you also have to deal with QoS (based on some subset of {src, dst} {ether-addr, ip-addr, port}, ether-ype, protocol, vlan, mpls tags), policing, shaping, scheduling, counter updates etc and everything requires access to memory or TCAM! Typically a heavily pipelined network processor or special purpose ASIC is used. The nice thing is packet forwarding is almost completely parallelizable. To bring this back to comp.arch, if you have on-chip L1, L2 & L3 caches, chances are you have to xfer large chunks of data on every access of external RAM for efficiency reasons. Seems to me, you may as well wrap each such request, response in an ethernet frame by putting multiple ethernet framers onboard! Similarly memory modules should provide an ethernet internface. Note that there is already AOE (ATA over Ethernet) to connect disks via ethernet. And since external memories are now like disk (or tape)... :-)
From: Terje Mathisen "terje.mathisen at on 29 Jul 2010 03:57 Bakul Shah wrote: > On 7/28/10 12:04 AM, Terje Mathisen wrote: >> Kai Harrekilde-Petersen wrote: >>> Terje Mathisen<"terje.mathisen at tmsw.no"> writes: >>> >>>> Benny Amorsen wrote: >>>>> Andy Glew<"newsgroup at comp-arch.net"> writes: >>>>> >>>>>> Network routing (big routers, lots of packets). >>>>> >>>>> With IPv4 you can usually get away with routing on the top 24 bits + a >>>>> bit of special handling of the few local routes on longer than 24-bit. >>>>> That's a mere 16MB table if you can make do with possible 256 >>>>> "gateways". >>>> >>>> Even going to a full /32 table is just 4GB, which is very cheap these >>>> days. :-) >>> >>> The problem that I saw when I was designing Ethernet switch/routers 5 >>> years ago, wasn't one particular lookup, but the fact that you need to >>> do *several* quick lookups for each packet (DMAC, 2*SMAC (rd+wr for >>> learning), DIP, SIP, VLAN, ACLs, whatnot). >> >> I sort of assumed that not all of these would require the same size >> table. What is the total index size, i.e. sum of all the index bits? > > The sum is too large to allow a single access in a realistic Sorry, I was unclear, but the OP did understand what I meant: I was really asking for the full list of lookups, with individual sizes, needed to do everything a router has to do these days. > system. You use whatever tricks you have to, to achieve wire- > speed switching/IP forwarding while staying within given > constraints. Nowadays ternary CAMs are used to do the heavy > lifting for lookups but indexing, some sort of tries are also > used. In addition to the things mentioned above, you also > have to deal with QoS (based on some subset of {src, dst} > {ether-addr, ip-addr, port}, ether-ype, protocol, vlan, mpls > tags), policing, shaping, scheduling, counter updates etc and > everything requires access to memory or TCAM! Typically a > heavily pipelined network processor or special purpose ASIC > is used. The nice thing is packet forwarding is almost > completely parallelizable. Yes, but if you want to take a chance and skip the trailing checksum test, in order to forward packets as soon as you have the header, then you would have even more severe timing restrictions, right? (Skipping/delaying the checksum test would mean depending upon the end node to detect the error.) BTW, is anyone doing this? Maybe in order to win benchmarketing tests? > To bring this back to comp.arch, if you have on-chip L1, L2 & > L3 caches, chances are you have to xfer large chunks of data > on every access of external RAM for efficiency reasons. Seems > to me, you may as well wrap each such request, response in an > ethernet frame by putting multiple ethernet framers onboard! > Similarly memory modules should provide an ethernet internface. > Note that there is already AOE (ATA over Ethernet) to connect > disks via ethernet. And since external memories are now like > disk (or tape)... :-) Well, what does the current cross-cpu protocols look like? AMD or Intel doesn't seem to matter, there's still a nice little HW network stack in there. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Bakul Shah on 29 Jul 2010 04:27 On 7/29/10 12:57 AM, Terje Mathisen wrote: > Bakul Shah wrote: >> system. You use whatever tricks you have to, to achieve wire- >> speed switching/IP forwarding while staying within given >> constraints. Nowadays ternary CAMs are used to do the heavy >> lifting for lookups but indexing, some sort of tries are also >> used. In addition to the things mentioned above, you also >> have to deal with QoS (based on some subset of {src, dst} >> {ether-addr, ip-addr, port}, ether-ype, protocol, vlan, mpls >> tags), policing, shaping, scheduling, counter updates etc and >> everything requires access to memory or TCAM! Typically a >> heavily pipelined network processor or special purpose ASIC >> is used. The nice thing is packet forwarding is almost >> completely parallelizable. > > Yes, but if you want to take a chance and skip the trailing checksum > test, in order to forward packets as soon as you have the header, then > you would have even more severe timing restrictions, right? You still have the same time budget: worst case you still have to send out 64 byte packets back to back. Most lookups can be done as soon as the NPU can get at the header. > (Skipping/delaying the checksum test would mean depending upon the end > node to detect the error.) > > BTW, is anyone doing this? Maybe in order to win benchmarketing tests? You can drop a bad CRC packet at a later point in the pipeline but before sending it out. > >> To bring this back to comp.arch, if you have on-chip L1, L2 & >> L3 caches, chances are you have to xfer large chunks of data >> on every access of external RAM for efficiency reasons. Seems >> to me, you may as well wrap each such request, response in an >> ethernet frame by putting multiple ethernet framers onboard! >> Similarly memory modules should provide an ethernet internface. >> Note that there is already AOE (ATA over Ethernet) to connect >> disks via ethernet. And since external memories are now like >> disk (or tape)... :-) > > Well, what does the current cross-cpu protocols look like? > > AMD or Intel doesn't seem to matter, there's still a nice little HW > network stack in there. Ethernet frames seem to have become the most common denominator in networks so I was speculating may be that'd be the cheapest way to shove lots of data around?
From: Terje Mathisen "terje.mathisen at on 29 Jul 2010 12:04 Bakul Shah wrote: > On 7/29/10 12:57 AM, Terje Mathisen wrote: >> Yes, but if you want to take a chance and skip the trailing checksum >> test, in order to forward packets as soon as you have the header, then >> you would have even more severe timing restrictions, right? > > You still have the same time budget: worst case you still have to send > out 64 byte packets back to back. Most lookups can be done as soon as the > NPU can get at the header. > >> (Skipping/delaying the checksum test would mean depending upon the end >> node to detect the error.) >> >> BTW, is anyone doing this? Maybe in order to win benchmarketing tests? > > You can drop a bad CRC packet at a later point in the pipeline but before > sending it out. I meant sending out _before_ you have received it, as soon as you have the dest address. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 Prev: IBM zEnterprise Announced Next: High-bandwidth computing (hbc) wiki and mailing list |