Prev: And, following a series of proxy sites blocked open any page or download from RapidShare download sites have all accelerated the internet da free
Next: pseudoterminals and close
From: William Ahern on 23 Mar 2010 15:34 Peter Olcott <NoSpam(a)ocr4screen.com> wrote: > The algorithm is essentially a huge deterministic finite > automaton where the memory required is much larger than the > largest cache, and the memory access pattern is essentially > unpredictable to any cache algorithm. > > The essential core processing of this DFA is to lookup in > memory the next location to look up in memory, it does very > little else. I don't have sufficient experience in this particular area to be of much help. But assuming you're just executing a single DFA, and if I was in your shoes, I see two non-exclusive options: (1) research using ACM Portal (subscription required) and CiteSeerX parallel DFA algorithms; and (2) look for ways prefetching--as mentioned elsethread--might help, including using a separate dedicated thread to prefetch or otherwise make sure the memory is being fully utilized. To reiterate, just because it's memory bound doesn't mean it's using memory optimally. Multiple processors may be helpful. Indeed, a prefetcher is just another processor, either a general purpose CPU being used to prefetch, or dedicated hardware executing in parallel to prefetch.
From: Scott Lurndal on 23 Mar 2010 16:05 William Ahern <william(a)wilbur.25thandClement.com> writes: >Peter Olcott <NoSpam(a)ocr4screen.com> wrote: >> The algorithm is essentially a huge deterministic finite >> automaton where the memory required is much larger than the >> largest cache, and the memory access pattern is essentially >> unpredictable to any cache algorithm. >> >> The essential core processing of this DFA is to lookup in >> memory the next location to look up in memory, it does very >> little else. > >I don't have sufficient experience in this particular area to be of much >help. But assuming you're just executing a single DFA, and if I was in your >shoes, I see two non-exclusive options: (1) research using ACM Portal >(subscription required) and CiteSeerX parallel DFA algorithms; and (2) look >for ways prefetching--as mentioned elsethread--might help, including using a >separate dedicated thread to prefetch or otherwise make sure the memory is >being fully utilized. > >To reiterate, just because it's memory bound doesn't mean it's using memory >optimally. Multiple processors may be helpful. Indeed, a prefetcher is just >another processor, either a general purpose CPU being used to prefetch, or >dedicated hardware executing in parallel to prefetch. One of the things that really kills this type of application is TLB refills. Since each TLB refill for a 4k page results in several memory accesses (3 or 4 depending on 32-bit or 64-bit page tables), if you get a TLB miss on each reference (a characteristic of such applications), you're performing 4 or 5 memory accesses to retreive the desired cache line. Granted, the higher level page table entries _may_ still be in the processor cache, but the lower level are very unlikely to be. To improve this, code your application to use larger page sizes (4MB in 32-bit systems, 2MB in 64-bit, 1GB in Opteron). If such page sizes are supported by the OS (Linux supports 4/2MB pages only for applications), you can improve your application performance considerably. 1GB pages are best if you can, since a TLB fill only needs to do two references, one of which will be cached. I don't think any of the distro's support 1GB pages yet, not sure about the kernel. On linux, look for hugetlbfs (mmap) or the SHM_HUGE flag for shmat. One caveat; because intel and amd are lazy[*], they require large pages to be naturally aligned - this means that large pages need to be reserved and allocated at boot time, or early after boot, to ensure that sufficient aligned contiguous physical memory is available. scott [*] Perhaps not fair, but they can currently just wire-or the address bits within the page and the physical address in the TLB to generate the resulting address. If they allowed misaligned large pages, they'd need to do an add instead, which may perform less well, but makes large pages _much_ more usable.
From: Peter Olcott on 23 Mar 2010 21:41 "William Ahern" <william(a)wilbur.25thandClement.com> wrote in message news:3nhn77-cnl.ln1(a)wilbur.25thandClement.com... > Peter Olcott <NoSpam(a)ocr4screen.com> wrote: >> The algorithm is essentially a huge deterministic finite >> automaton where the memory required is much larger than >> the >> largest cache, and the memory access pattern is >> essentially >> unpredictable to any cache algorithm. >> >> The essential core processing of this DFA is to lookup in >> memory the next location to look up in memory, it does >> very >> little else. > > I don't have sufficient experience in this particular area > to be of much > help. But assuming you're just executing a single DFA, and > if I was in your > shoes, I see two non-exclusive options: (1) research using > ACM Portal > (subscription required) and CiteSeerX parallel DFA > algorithms; and (2) look > for ways prefetching--as mentioned elsethread--might help, > including using a > separate dedicated thread to prefetch or otherwise make > sure the memory is > being fully utilized. > > To reiterate, just because it's memory bound doesn't mean > it's using memory > optimally. Multiple processors may be helpful. Indeed, a > prefetcher is just > another processor, either a general purpose CPU being used > to prefetch, or > dedicated hardware executing in parallel to processes. I was inspired by a member of another group to write this code that tests the memory access performance of a machine. I ran this code as four separate processes and because execution time (on a quad core processor) was only slightly degraded over the time of a single process, I reasonably concluded that I am likely not experiencing a memory bandwidth bottleneck. All that this code does is move to distant places in a very large std::vector. Slight modifications may be required to run under Unix. #include <stdio.h> #include <stdlib.h> #include <vector> #include <time.h> #define uint32 unsigned int const uint32 repeat = 2; //const uint32 size = 488000000; const uint32 size = 100000000; std::vector<uint32> Data; void Process() { clock_t finish; clock_t start = clock(); double duration; uint32 num; for (uint32 r = 0; r < repeat; r++) for (uint32 i = 0; i < size; i++) num = Data[num]; finish = clock(); duration = (double)(finish - start) / CLOCKS_PER_SEC; printf("%4.2f Seconds\n", duration); } int main() { printf("Size in bytes--->%d\n", size * 4); Data.reserve(size); for (uint32 N = 0; N < size; N++) { uint32 Random = rand() * rand(); Random %= size; // printf("Random--->%d\n", Random); Data.push_back( Random ); } char N; printf("Hit any key to Continue:"); scanf("%c", &N); Process(); return 0; }
From: Moi on 24 Mar 2010 07:13
On Mon, 22 Mar 2010 12:10:48 -0500, Peter Olcott wrote: > "William Ahern" <william(a)wilbur.25thandClement.com> wrote in message > news:pe9k77-gjk.ln1(a)wilbur.25thandClement.com... >> Peter Olcott <NoSpam(a)ocr4screen.com> wrote: >>> "Eric Sosman" <esosman(a)ieee-dot-org.invalid> wrote in message >>> news:ho5tof$lon$1(a)news.eternal-september.org... >> <snip> >>> > But if there's another CPU/core/strand/pipeline, it's possible that >>> > one >>> > processor's stall time could be put to productive use by another if >>> > there were multiple execution threads. >> <snip> >>> is there any possible way that this app is not memory bound that you >>> can >>> provide a specific concrete example of? >> >> Your question was answered. >> >> You're hung up on your numbers and preconceived ideas. Your application >> could be BOTH memory bound AND able to benefit from multiple CPUs. But >> it's >> nearly impossible to guess without knowing at least the algorithm; more >> specifically, the code. >> >> > The algorithm is essentially a huge deterministic finite automaton where > the memory required is much larger than the largest cache, and the > memory access pattern is essentially unpredictable to any cache > algorithm. > > The essential core processing of this DFA is to lookup in memory the > next location to look up in memory, it does very little else. The obvious way to accelerate this program would IMHO be to 'cluster' the "frequently occuring together" states to adjacent positions in core. (you only need to renumber the states) You could start by putting the "hard hitters" together. Assuming a 90/10 rule, that would buy you a factor of about four in memory bandwidth pressure. Also, keeping the structures aligned (modulo) slotsize could save you some cache misses. /2cts AvK |