Prev: Simple Hack To Get $2500 To Your PayPal Account.
Next: ARM-based desktop computer ? (Hybrid computers ?: Low + High performance ;))
From: Andy Glew "newsgroup at on 22 Jul 2010 12:42 On 7/21/2010 8:38 AM, nmm1(a)cam.ac.uk wrote: > Arising out of a course I am writing, I want to find out roughly > how Intel and AMD handle I/O transfers to and from memory at the > hardware level. So far, my searching has led nowhere beyond what > I know, such as: > > The I/O controller uses the HyperTransport or QuickPath link > to talk to the memory controller on the CPU that owns the memory. > Well, I assume that, because anything else would be silly. > > But: > > Do they do those in a cache-coherent fashion, or is that > independent of the cache? I/O DMAs can be both cache coherent, or non coherent. [UC] They can be directed to memory ranges that are never cached - which I suppose is cache coherent in a manner. [WB] They can be directed to memory ranges that are cached by the processors. In so doing, the caches are kept coherent - which usually means that the processor caches are snooped, flushed or invalidated, on a line by line basis. (By the way, one of the issues was: "Is an I/O DMA allowed to completely overwrite an M-state line without first obtaining ownership? " We allowed this on P6 FSB, but I think that QPI has just lost this ability.) [NC] They can be directed to memory that can be cached, but for which the I/O DMA transactions are not snooped, and so it is not coherent. This is, I believe, HIGHLY DISCOURAGED, although every few years somebody gets the bright idea to do this. Much of this is configured in what used to be called the chipset, which is now usually integrated. There is a plethora of range registers that describe this. I was quite surprised by Mitch's post, where I think he said that most I/O DMAs are to UC uncached or NC non-coherent memory. This is not my understanding. (I may have misunderstood Mitch's post.) It is my understanding that the vast majority of all I/O DMAs, in terms of bytes transferred, are into WB memory and are coherent. Reason: NC sucks, and if you I/O DMA into UC you need to transfer from UC to WB, or back again. Which means you need a DMA copy engine. Which just puts off the problem. However, there may just be a terminology mismatch: (a) although it is my understanding that the majority of I.O traffic is DMAed in terms of bytes transferred a.1) most disks a.2) many network interfaces (ethernet, etc.), although uncached/NC is more comon with NICs (b) conversely, it may well be that many I/O devices are UC or NC - because most I/O devices are not really important enough to have been tuned. (c) some of the highest performing I/O devices may use UC/NC, seeking to reduce overhead. E.g. in supercomputers. By the way, there is a relatively new motivation for NC: it saves power, avoiding having to power on the CPU to snoop its caches > Indeed, do they update the cache (as some systems used to) and, > if so, up to which level? For the most part, the caches are flushed or invalidated on a line by line basis by I/O DMAs. However, Intel has recently added DCA, Direct Cache Access: http://www.intel.com/network/connectivity/vtc_ioat.htm Direct Cache Access (DCA) allows a capable I/O device, such as a network controller, to place data directly into CPU cache, reducing cache misses and improving application response times. I am only aware of DCA being used by NICs, although it might be used by disk drives. The earliest implementations of DCA were inefficient, and not always a win. They may have improved by now. I am not aware of public docs described which levels of the cache DCA may insert into. (By the way, there are several strategies here 0- don't snoop ccahes 1- snoop / invalidate / flush caches 2- update lines that already exist in the cache (but don't push new lines in) 3- insert lines into cache ) > Is there any public documentation on this? I am not looking > for details, but enough information to be able to write reliable > notes on advanced tuning. I have found some of this information a) in the chipset (not the processor) manuals. I'm reasonably certain I have seen discussions of this in public AMD mannuals, as wel as Intel. b) in the QPI book (Siingh, Safranek, et al) c) in Open Source manuals I wonder if the EMON performance counters could determine what fraction of I/O DMA requests fall into which camp. Heck: I just googled a bit, and found AMD's IOMMU, http://support.amd.com/us/Embedded_TechDocs/34434-IOMMU-Rev_1.26_2-11-09.pdf It has bits like The FC bit in the page translation entry is used to specify if DMA transactions that target the page must clear the PCI-defined No Snoop bit. The state of this bit is returned to a device with an IOTLB on an explicit translation request. If FC=1 for an untranslated access, the IOMMU sets the coherent bit in the upstream HyperTransport� request packet. If FC=0 for an untranslated access, the IOMMU passes upstream the coherent attribute from the originating request. which I think tends to imply that there is some possibility of coherent I/O.
From: MitchAlsup on 22 Jul 2010 12:47 On Jul 22, 11:42 am, Andy Glew <"newsgroup at comp-arch.net"> wrote: > I was quite surprised by Mitch's post, where I think he said that most > I/O DMAs are to UC uncached or NC non-coherent memory. This is not my > understanding. (I may have misunderstood Mitch's post.) I was only refering to buffers an OS may use to support a DMA device with limited address-bits so the I/O goes to a page in memory where I can read or write, and then the OS moves the page to its real location. The UC info may be out-of-date by now. Mitch
From: nmm1 on 22 Jul 2010 13:00 In article <loCdnfqEY4Vp6dXRnZ2dnUVZ_sCdnZ2d(a)giganews.com>, Andy Glew <"newsgroup at comp-arch.net"> wrote: Thanks very much. That is very useful, even if not quite the same. >[NC] They can be directed to memory that can be cached, but for which >the I/O DMA transactions are not snooped, and so it is not coherent. >This is, I believe, HIGHLY DISCOURAGED, although every few years >somebody gets the bright idea to do this. The ability of people to reinvent three-sided wheels is incredible. >I/O DMAs are to UC uncached or NC non-coherent memory. This is not my >understanding. (I may have misunderstood Mitch's post.) One of us did, because that's not what I understood him to say. >It is my understanding that the vast majority of all I/O DMAs, in terms >of bytes transferred, are into WB memory and are coherent. Reason: NC >sucks, and if you I/O DMA into UC you need to transfer from UC to WB, or >back again. Which means you need a DMA copy engine. Which just puts >off the problem. Yes and no. Consider a Unix-like system (aren't they all, nowadays?) One sane implementation is to read blocks of data from disk into uncached memory, and the read and write calls then copy that (which they have to do anyway). So you have not lost anything. The same applies when implementing MPI on top of an unhelpful protocol, such as TCP/IP. I don't think that many people do use uncached memory for that, though. However, Intel has recently added DCA, Direct Cache Access: > http://www.intel.com/network/connectivity/vtc_ioat.htm > > Direct Cache Access (DCA) allows a capable I/O device, such as a > network controller, to place data directly into CPU cache, reducing > cache misses and improving application response times. Hmm. I wonder how many cards use that. Regards, Nick Maclaren.
From: Andy Glew "newsgroup at on 23 Jul 2010 12:58 On 7/22/2010 10:00 AM, nmm1(a)cam.ac.uk wrote: > In article<loCdnfqEY4Vp6dXRnZ2dnUVZ_sCdnZ2d(a)giganews.com>, > Andy Glew<"newsgroup at comp-arch.net"> wrote: >> It is my understanding that the vast majority of all I/O DMAs, in terms >> of bytes transferred, are into WB memory and are coherent. Reason: NC >> sucks, and if you I/O DMA into UC you need to transfer from UC to WB, or >> back again. Which means you need a DMA copy engine. Which just puts >> off the problem. > > Yes and no. Consider a Unix-like system (aren't they all, nowadays?) > One sane implementation is to read blocks of data from disk into > uncached memory, and the read and write calls then copy that (which > they have to do anyway). So you have not lost anything. That might be sane, except: The copy from the uncached (UC memory type) area that the I/O DMA read from the disk wrote to can either be done a) by the CPU b) by some sort of programmable copy engine. Intel's I/O-AT (I/O advanced Technology?) makes the programmable copy engine a bit more standard. Nuff said. On current machines, having the CPU do that copy from the UC area to ordinary memory is very, Very, VERY slow. As we have discussed here previously, UC memory is just plain not optimized. There is no distinction between UC memory that is ordinary memory, that could have burst accesses, etc., and UC memory that has side effects - so the worst case assumptions are made. The USWC memory type could be used as a target to copy from CPU memory into this I/O staging area. This makes transfers from ordinary memory to staginbg area, to be subsequently written to disk, fast. I.e. it makes disk writes fast(er). But doesn't help disk reads. There have been recent steps to improve disk reads, or, rather, the reads from an uncacheable area (actuallly USWC) that might be used. this is mainly in the form of a new instruction. Whose mnemonic I can't remember at the moment. (:-) I don't really like the implementation of this instruction, but it helps. In any case, however, DMA'ing between the I/O device and a staging area, and then betwen the staging area and ordinary memory, repeats operations unnecessarily. Avoiding the double copy by snooping caches usually far outweighs the cost of snooping.
From: nmm1 on 23 Jul 2010 13:23
In article <sMmdnbsyi8_eV9TRnZ2dnUVZ_gWdnZ2d(a)giganews.com>, Andy Glew <"newsgroup at comp-arch.net"> wrote: >On 7/22/2010 10:00 AM, nmm1(a)cam.ac.uk wrote: >> In article<loCdnfqEY4Vp6dXRnZ2dnUVZ_sCdnZ2d(a)giganews.com>, >> Andy Glew<"newsgroup at comp-arch.net"> wrote: > >>> It is my understanding that the vast majority of all I/O DMAs, in terms >>> of bytes transferred, are into WB memory and are coherent. Reason: NC >>> sucks, and if you I/O DMA into UC you need to transfer from UC to WB, or >>> back again. Which means you need a DMA copy engine. Which just puts >>> off the problem. >> >> Yes and no. Consider a Unix-like system (aren't they all, nowadays?) >> One sane implementation is to read blocks of data from disk into >> uncached memory, and the read and write calls then copy that (which >> they have to do anyway). So you have not lost anything. > >That might be sane, except: > >The copy from the uncached (UC memory type) area that the I/O DMA read >from the disk wrote to can either be done > >a) by the CPU > >b) by some sort of programmable copy engine. > >Intel's I/O-AT (I/O advanced Technology?) makes the programmable copy >engine a bit more standard. Nuff said. Right. >On current machines, having the CPU do that copy from the UC area to >ordinary memory is very, Very, VERY slow. As we have discussed here >previously, UC memory is just plain not optimized. There is no >distinction between UC memory that is ordinary memory, that could have >burst accesses, etc., and UC memory that has side effects - so the worst >case assumptions are made. Yuck. All right, given that design, I take your point. I was assuming just plain not cached and accessible to I/O devices. >In any case, however, DMA'ing between the I/O device and a staging area, >and then betwen the staging area and ordinary memory, repeats operations >unnecessarily. Avoiding the double copy by snooping caches usually far >outweighs the cost of snooping. And that's exactly what ISN'T the case, given my assumption! The point is that the software forces a copy between the staging area (which I was assuming could be written into directly by the device) and the cached memory visible to applications. Given that division of memory properties, what I said is the case; it's an old mainframe approach, after all. However, if that isn't the division used, well, then it isn't the case .... Regards, Nick Maclaren. |