From: Piotr Wyderski on 3 Aug 2010 09:00 Terje Mathisen wrote: > Writing to CR3 to invalidate the entire TLB subsystem is _very_ > expensive: Not because the operations itself takes so long, but because > you have to reload the 90+% of data which is still needed. Of course, but I wonder why does the operation itself take so long. Conceptually it is very similar to mfence. Best regards Piotr Wyderski
From: James Harris on 3 Aug 2010 09:34 On 3 Aug, 14:00, "Piotr Wyderski" <piotr.wyder...(a)mothers.against.spam.gmail.com> wrote: > Terje Mathisen wrote: > > Writing to CR3 to invalidate the entire TLB subsystem is _very_ > > expensive: Not because the operations itself takes so long, but because > > you have to reload the 90+% of data which is still needed. > > Of course, but I wonder why does the operation itself take so long. > Conceptually it is very similar to mfence. I'm not sure I see the similarity to mfence but this branch of the thread has become x86-based so I'll carry on in that vein. I haven't measured either a reload of CR3 or an invlpg. Both need low-level access that is not readily available. However, we can say that there is an immediate cost and a longer-term one, both of which are undefined. If you are *sure* they are cheap then it's fine to go do them in all cases but if there's any doubt it's best to avoid them. There have been reports - though I haven't measured them myself - that some Intel operations take a surprisingly long time. That said, to put them in context, if a page has had to be swapped in from disk then the cost of invalidating the whole TLB would be negligible compared to the disk access time. In fact since the faulting task is just about to be restarted the TLB likely contains entries for another task and needs to be flushed anyway. On the other hand, swapping in a page is just one possible response. The page fault may, instead, require a pre-zeroed page to be mapped in. That would be very quick, could be done immediately, and keeps the faulting process running. In this case existing TLB entries would be in use and should be kept. Invalidations here could be expensive to carry out and/or to bounce back from. James
From: Andy Glew "newsgroup at on 3 Aug 2010 10:43 On 8/3/2010 4:55 AM, Piotr Wyderski wrote: > Andy Glew wrote: > >> Heck: if you yourself can rewalk the page tables, on all machines you >> can avoid the "expensive TLB invalidation". > > On the other hand, why is the TLB invalidation expensive? > There are two ways to do it, the first is via invlpg and the > other is to write to cr3. But both if them should be relatively cheap, > i.e. wait until the LSU pipe is empty and then pulse > a global edge/level reset line of the TLB subsystem. Why > isn't the reality as simple as that? As Terje notes, invalidating the entire TLB, or only the local entries, via a write to CR3 imposes a major TLB reload cost. INVALPG should not need to be that expensive. Although it should be noted that "waiting until the LSU pipe" can itself take quite a few cycles, 30-100. However, more likely the implementation waits until the entire pipeline is drained, to take into account the possibility of ITLB invalidation. I suppose you could see if an entry is in the ITLB, and drain only the data side if not. I suspect, however, that the original poster is thinking about doing a multiprocessor TLB shootdown.
From: EricP on 3 Aug 2010 10:49 James Harris wrote: > > I'm not sure I see the similarity to mfence but this branch of the > thread has become x86-based so I'll carry on in that vein. http://developer.intel.com/products/processor/manuals/index.htm Intel manual 3A System Programming Guide, Part 1 (#253668) Section 4.10 "CACHING TRANSLATION INFORMATION" covers TLB caching (over 16 pages of info) http://developer.intel.com/Assets/PDF/manual/253668.pdf Eric
From: EricP on 3 Aug 2010 10:59
Andy Glew wrote: > > If ever you see flakey results, on x86 or elsewhere I would strongly > suggest that you have your invalid page exception handler rewalk the > page tables to see if the page is, indeed, invalid. In a multi-threaded SMP OS, I think you may, depending on the OS design, have to always do that. The OS _should_ allow concurrent page faults from different threads in the same process - no reason not to - with access to the process page table coordinated by a mutex. It is therefore possible that between when a fault occurs and when the table mutex is granted, another thread could patch up the PTE. There are also various PTE atomic update issues to consider, and other race conditions because the cpu hardware is not using spinlocks/mutexes to coordinate its tables accesses and updates (it does use atomic operations to update PTE A and D bits), but any OS changes by various threads are coordinated. Eric |