From: Terje Mathisen "terje.mathisen at on 5 Jun 2010 01:29 robertwessel2(a)yahoo.com wrote: > On Jun 4, 3:33 am, Terje Mathisen<"terje.mathisen at tmsw.no"> wrote: >> The 16450 rs232 chip could be programmed to delay interrupts until a >> given percentage of the 16-entry FIFO buffer had been consumed, but a >> receive irq handler could still see that the buffer was non-empty by >> polling the status. > > > To be pedantic, that was the 16550A. The 16450 was basically an 8250 > clone, with official support for higher speeds. Thanks, you're right. My memory isn't what it was 25 (or so) years ago. > > Of course programming the 16550 and using the buffer was complicated > by a number of bugs, not least being its propensity of the write FIFO > getting stuck if you put a single byte into it at just the wrong time. Aha! So that was the reason _some_ PCs could fail unless the write buffer was skipped... Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"
From: Rob Warnock on 6 Jun 2010 06:43 Rick Jones <rick.jones2(a)hp.com> wrote: +--------------- | Rob Warnock <rpw3(a)rpw3.org> wrote: | > 3. "Coalescing": The device driver interrupt service routine shall | > *continue* to poll for attention status and continue to service | > the device until the attention status goes false. [In fact, in | > some systems it's a good idea if it polls the *first* time into | > the ISR as well, to deal with potential spurious interrupts. But | > that's another story...] | | I trust that is something other than a PIO Read? +--------------- Sometimes a PIO Read was all you had, and in that case performance could still suck, even with all the other optimizations. (*sigh*) But in all the devices I worked on at SGI & later, the attention/status "poll" was a main memory read of the completion/status circular queue that the device updated with DMA, so the poll was quite cheap on the host CPU side. [Note that there are tricks you can play (and *need* to, for best performance) to avoid the usual cache line ownership thrashing you get with DMA that uses naive "producer/consumer" pointers in its command/status queues. But that's another story...] +--------------- | > In most typical applications, the optimal dally time will be a | > small fraction of the holdoff time. And in any case, the peaks | > for "good" values of both parameters are rather broad. | | Sounds a little like deciding how long to spin on a mutex before going | into the "other guy has it" path :) +--------------- Just so! ;-} -Rob ----- Rob Warnock <rpw3(a)rpw3.org> 627 26th Avenue <URL:http://rpw3.org/> San Mateo, CA 94403 (650)572-2607
From: Morten Reistad on 6 Jun 2010 10:17 In article <X8ydnWeXrKN85pbRnZ2dnUVZ_j2dnZ2d(a)speakeasy.net>, Rob Warnock <rpw3(a)rpw3.org> wrote: >Rick Jones <rick.jones2(a)hp.com> wrote: >+--------------- >| Rob Warnock <rpw3(a)rpw3.org> wrote: >| > 3. "Coalescing": The device driver interrupt service routine shall >| > *continue* to poll for attention status and continue to service >| > the device until the attention status goes false. [In fact, in >| > some systems it's a good idea if it polls the *first* time into >| > the ISR as well, to deal with potential spurious interrupts. But >| > that's another story...] >| >| I trust that is something other than a PIO Read? >+--------------- > >Sometimes a PIO Read was all you had, and in that case performance >could still suck, even with all the other optimizations. (*sigh*) > >But in all the devices I worked on at SGI & later, the attention/status >"poll" was a main memory read of the completion/status circular queue >that the device updated with DMA, so the poll was quite cheap on the >host CPU side. > >[Note that there are tricks you can play (and *need* to, for best >performance) to avoid the usual cache line ownership thrashing you >get with DMA that uses naive "producer/consumer" pointers in its >command/status queues. But that's another story...] All of these I/O methods overlays interrupts and DMA and polls into the classic von Neumann model. The costs of these overlays go up radically as we tune the core cpus for pipeline length, memory bandwith etc. When we test modern systems for performance we usually end up benchmarking how the core system I/O is implemented. The best news in systems design we have seen since the cpu frequencies stalled somewhere between 2 and 4 GHz has been the hyperchannel. It is really effective in what it says it will do. So why not use that, or a similar mechanism, to do the master I/O on our systems. Instead of 137 interrupts (really, the last, large server we tested had this many interrupts enabled) we can do with around 5. We can put the vast bulk on the I/O into the wide, fast, low latency pipe, and have one low-level driver demultiplex it where we want the data, and ship from where we want to ship stuff. OK; we still need to schedule, wake other processors up, etc but we don't need all that priceise interrupts for that. If the preformance-critical I/O is done, we can send signals to the instruction dispatcher instead of blowing away the pipeline with a high priority interrupt. Once timing and high-speed stuff is dealt with we can easily live with a 10us latency for the rest. Then make some version of the "south/north bridge" chips that break out the hyperchannel fifo into the I/O channels we already know. Like ethernet, usb, sata and pci; and let the stock interfaces take it from there. >+--------------- >| > In most typical applications, the optimal dally time will be a >| > small fraction of the holdoff time. And in any case, the peaks >| > for "good" values of both parameters are rather broad. >| >| Sounds a little like deciding how long to spin on a mutex before going >| into the "other guy has it" path :) >+--------------- > >Just so! ;-} For the real high speed I/O, let the hardware do it. -- mrr
From: Rick Jones on 7 Jun 2010 14:36 Morten Reistad <first(a)last.name> wrote: > The best news in systems design we have seen since the cpu > frequencies stalled somewhere between 2 and 4 GHz has been the > hyperchannel. It is really effective in what it says it will > do. > So why not use that, or a similar mechanism, to do the master > I/O on our systems. Instead of 137 interrupts (really, the last, > large server we tested had this many interrupts enabled) we > can do with around 5. We can put the vast bulk on the I/O into > the wide, fast, low latency pipe, and have one low-level > driver demultiplex it where we want the data, and ship from > where we want to ship stuff. Ah the one channel to feed them all. That reminds me of the HP9000 K-Class systems - when they first shipped in the mid 1990's it was felt that just the one or two "HSC" (aka GSC+) I/O slots would be sufficient to feed the beast. There was I believe a two slot I/O expansion card one could install. Before the system went off the price-list, HP were shipping a four-slot HSC expansion module for the thing. Going-back farther, there was the "DTC" (aka Avesta) on the PA-RISC HP3000's (and used with the HP9000s) - all that ugly slow serial stuff was put out into the DTC and it was linked with the host via a (by then standards) blazing-fast 10 Mbit/s Ethernet link. The X.25 functionality wsa offloaded into that thing too. Trouble was, there ended-up being an entire networking stack for talking to the DTC... rick jones -- The computing industry isn't as much a game of "Follow The Leader" as it is one of "Ring Around the Rosy" or perhaps "Duck Duck Goose." - Rick Jones these opinions are mine, all mine; HP might not want them anyway... :) feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
From: Morten Reistad on 7 Jun 2010 17:12
In article <huje81$5kq$5(a)usenet01.boi.hp.com>, Rick Jones <rick.jones2(a)hp.com> wrote: >Morten Reistad <first(a)last.name> wrote: >> The best news in systems design we have seen since the cpu >> frequencies stalled somewhere between 2 and 4 GHz has been the >> hyperchannel. It is really effective in what it says it will >> do. > >> So why not use that, or a similar mechanism, to do the master >> I/O on our systems. Instead of 137 interrupts (really, the last, >> large server we tested had this many interrupts enabled) we >> can do with around 5. We can put the vast bulk on the I/O into >> the wide, fast, low latency pipe, and have one low-level >> driver demultiplex it where we want the data, and ship from >> where we want to ship stuff. > >Ah the one channel to feed them all. That reminds me of the HP9000 >K-Class systems - when they first shipped in the mid 1990's it was >felt that just the one or two "HSC" (aka GSC+) I/O slots would be >sufficient to feed the beast. There was I believe a two slot I/O >expansion card one could install. No, not necessarily one channel. But few channels, and a channel, not a bus. With some chip like a south bridge on the other side, to let hardware handle what it does best. Eat larger chunks of data with no hands-on cpu involvment. It the hardware knows where stuff goes, no need to interrupt the cpu. >Before the system went off the price-list, HP were shipping a >four-slot HSC expansion module for the thing. > >Going-back farther, there was the "DTC" (aka Avesta) on the PA-RISC >HP3000's (and used with the HP9000s) - all that ugly slow serial stuff >was put out into the DTC and it was linked with the host via a (by >then standards) blazing-fast 10 Mbit/s Ethernet link. The X.25 >functionality wsa offloaded into that thing too. > >Trouble was, there ended-up being an entire networking stack for >talking to the DTC... The error they, and Prime, and DEC, and HP did was to think outboarding, and use slow links. Think FAST link. Sufficiently fast to run the L2->memory interface on. Like hyperchannel, only better integrated into hardware. Maybe we can use hyperchannel without changing the hardware much. Besides, we use full network stacks today, to talk to USB, SCSI, IP, SATA, PCI, SCI, Firewire, even the sons of PCMCIA. Just use them through a single digit number of really fast links and let hardware demultiplex that. Exactly what we would do with 10G ethernet, only with a little lower latency. -- mrr |