Prev: And, following a series of proxy sites blocked open any page or download from RapidShare download sites have all accelerated the internet da free
Next: pseudoterminals and close
From: Scott Lurndal on 22 Mar 2010 18:46 David Schwartz <davids(a)webmaster.com> writes: >On Mar 22, 1:11=A0pm, Chris Friesen <cbf...(a)mail.usask.ca> wrote: > >> I was under the impression that the hardware prefetcher was independent >> of threads of execution, in which case this wouldn't make any >> difference. =A0Are you aware of CPUs which tie the prefetcher to executio= >n >> context? > >The prefetcher is a per-core construct and only sees the flow of >instructions on that particular core. Two cores means two prefetchers, >each seeing half of the operations. Not necessarily; on the current gen Opteron the prefetcher is part of the L3/Northbridge on each socket; since the L3 is shared, the NB can adaptively prefetch for all active cores on the socket. The AMD Family 10h BKDG goes into greater detail on the prefetcher and how it is configured. There are applications for which the prefetcher is completely unsuitable (e.g. social graph analysis, extremely large datasets, pointer chasing applications), and there are system configurations for which the prefetcher is invaluble or pathetic (ccNUMA with long latencies to some remote memory). scott
From: David Schwartz on 22 Mar 2010 19:03 On Mar 22, 3:46 pm, sc...(a)slp53.sl.home (Scott Lurndal) wrote: > Not necessarily; on the current gen Opteron the prefetcher is part of the > L3/Northbridge on each socket; since the L3 is shared, the NB can > adaptively prefetch for all active cores on the socket. The AMD > Family 10h BKDG goes into greater detail on the prefetcher and how > it is configured. I think we're talking about two different prefetchers, though I'm not 100% sure -- I'm not that familiar with the internals of modern AMD CPUs. What I mean by "prefetcher" is the mechanism that sees an upcoming memory read in the instruction stream and attempts to get that data before the CPU actually has to wait for the contents of the memory to be read from the cache hierarchy. It has to be a per-core construct because it's looking at the instruction stream for that core at various stages in the pipeline. DS
From: Chris Friesen on 23 Mar 2010 10:23 On 03/22/2010 05:03 PM, David Schwartz wrote: > On Mar 22, 3:46 pm, sc...(a)slp53.sl.home (Scott Lurndal) wrote: > >> Not necessarily; on the current gen Opteron the prefetcher is part of the >> L3/Northbridge on each socket; since the L3 is shared, the NB can >> adaptively prefetch for all active cores on the socket. The AMD >> Family 10h BKDG goes into greater detail on the prefetcher and how >> it is configured. > > I think we're talking about two different prefetchers, though I'm not > 100% sure -- I'm not that familiar with the internals of modern AMD > CPUs. What I mean by "prefetcher" is the mechanism that sees an > upcoming memory read in the instruction stream and attempts to get > that data before the CPU actually has to wait for the contents of the > memory to be read from the cache hierarchy. It has to be a per-core > construct because it's looking at the instruction stream for that core > at various stages in the pipeline. I'm not a hardware guy, but I think what you're referring to is generally called a speculative read. It requires access to the instruction stream and thus must exist on every core. The hardware prefetcher that Scott is referring to monitors the actual requested memory accesses and tries to look for patterns. So if I'm in a tight loop and indirectly access address X, X+8, and X+16 the prefetcher is going to preload X+24, X+32, X+40... for me. Chris
From: David Schwartz on 23 Mar 2010 10:34 On Mar 23, 7:23 am, Chris Friesen <cbf...(a)mail.usask.ca> wrote: > I'm not a hardware guy, but I think what you're referring to is > generally called a speculative read. It requires access to the > instruction stream and thus must exist on every core. Yes. > The hardware prefetcher that Scott is referring to monitors the actual > requested memory accesses and tries to look for patterns. So if I'm in > a tight loop and indirectly access address X, X+8, and X+16 the > prefetcher is going to preload X+24, X+32, X+40... for me. That's interesting. I didn't know there was such a mechanism. DS
From: Scott Lurndal on 23 Mar 2010 13:18
Chris Friesen <cbf123(a)mail.usask.ca> writes: >On 03/22/2010 05:03 PM, David Schwartz wrote: >> On Mar 22, 3:46 pm, sc...(a)slp53.sl.home (Scott Lurndal) wrote: >> >>> Not necessarily; on the current gen Opteron the prefetcher is part of the >>> L3/Northbridge on each socket; since the L3 is shared, the NB can >>> adaptively prefetch for all active cores on the socket. The AMD >>> Family 10h BKDG goes into greater detail on the prefetcher and how >>> it is configured. >> >> I think we're talking about two different prefetchers, though I'm not >> 100% sure -- I'm not that familiar with the internals of modern AMD >> CPUs. What I mean by "prefetcher" is the mechanism that sees an >> upcoming memory read in the instruction stream and attempts to get >> that data before the CPU actually has to wait for the contents of the >> memory to be read from the cache hierarchy. It has to be a per-core >> construct because it's looking at the instruction stream for that core >> at various stages in the pipeline. > >I'm not a hardware guy, but I think what you're referring to is >generally called a speculative read. It requires access to the >instruction stream and thus must exist on every core. > >The hardware prefetcher that Scott is referring to monitors the actual >requested memory accesses and tries to look for patterns. So if I'm in >a tight loop and indirectly access address X, X+8, and X+16 the >prefetcher is going to preload X+24, X+32, X+40... for me. Yes, although it works by prefetching 64-byte cache-lines. scott |