Prev: [PATCH 0/3] HID: make raw output callback more flexible
Next: kprobes: Disable booster when CONFIG_PREEMPT=y
From: Benjamin Herrenschmidt on 6 Mar 2010 16:10 On Sat, 2010-03-06 at 19:36 +0000, Russell King - ARM Linux wrote: > On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote: > > On a fault in of exec data, we first try to get the page out of the page > > cache. If it's not present, we put the faulting process to sleep and > > fetch it in from storage. When we do the read, on the PIO path, the > > kernel alias for the page becomes dirty. Some time later, we place the > > page into the user space (updating the pte entry that caused a fault). > > At this point, we'll call both flush_icache_page() and > > update_mmu_cache() ... this is where the I/D resolution should be done. > > No - this is where things get extremely icky. > > The problem at this point occurs on SMP architectures. As soon as you > update the PTE entry, it is visible to other threads of the application. > If you do I-cache handling after updating the PTE, then there is a window > where another CPU can execute the page: Right, we actually hit that bug on powerpc, however, James explanation is misleading, ie, I think the -code- actually is right and flush_icache_page() is called before set_pte_at(). However, see my other email, I have other issues with it as it is, but nothing unfixable. So for now, I keep my flush in set_pte_at() and ptep_set_access_flags(), we'll see if I can move that to an improved flush_icache_page(). In fact, even set_pte_at() isn't a panacea for me, as I want the fault type as well. Cheers, Ben. > CPU0 CPU1 > speculatively prefetches from page N via kernel > mapping, loads garbage into I-cache > attempts to execute P > page fault > page N allocated > set_pte_at > executes P > *splat* > flush I-cache -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: James Bottomley on 6 Mar 2010 22:40 On Sun, 2010-03-07 at 08:03 +1100, Benjamin Herrenschmidt wrote: > On Sat, 2010-03-06 at 16:17 +0530, James Bottomley wrote: > > On a fault in of exec data, we first try to get the page out of the page > > cache. If it's not present, we put the faulting process to sleep and > > fetch it in from storage. When we do the read, on the PIO path, the > > kernel alias for the page becomes dirty. Some time later, we place the > > page into the user space (updating the pte entry that caused a fault). > > At this point, we'll call both flush_icache_page() and > > update_mmu_cache() ... this is where the I/D resolution should be done. > > Since it's after any I/O has occurred, it doesn't matter whether the CPU > > speculatively moved anything in or not. As long as you flush the kernel > > alias and invalidate the user I and D aliases, we're good to go. Using > > the page arch flags is really only to optimise this process (defer > > kernel D alias flushing). > > Ok, so while flush_icache_page() looks like something we could use > instead of set_pte_at() for the icache flushing, it doesn't answer all > the questions. Off the top of my mind: OK, so what I was actually trying to get across is the point that we don't handle I cache problems in the I/O or page cache code ... we handle them in the mm code, so the mm piece of the above was deliberately a bit vague. > - I see the calls to flush_icache_page() in mm/memory.c but I don't see > them next to all set_pte_at() that insert a valid PTE. For example, we > don't flush the icache for anonymous pages. While that might seem like a > good idea, we have been under pressure to "fix" that on powerpc to make > sure there is no stale icache content from another process leaking into > userspace. I'm not entirely sure what flush_icache_page() is supposed to do. On parisc it flushes the *kernel* icache ... which has got to be wrong. According to cachetlb.txt it's an obsolete interface. > - It needs to be done -before- set_pte_at() but I think the code does it > right, only your explanation above makes it unclear :-) Sorry, like I said, I only sketched the mm piece. However, at least on parisc, there's a technical problem with flushing before we have the pte: On VIPT systems, we need a mapping before the flush will work. I was experimenting with a mechanism whereby we set aside in the kernel an aligned region of our congruence size and simply flushed in that region with the correct mappings, but we haven't got around to implementing it in the kernel yet. > - It doesn't take the PTE pointer as an argument, so here goes our trick > on powerpc of filtering out exec permission rather than flushing when a > page is accessed by a read fault > > - We -still- have the problem of tracking whether the icache has been > flushed or not yet for a given physical page on archs with PIPT (or non > aliasing VIPT) like powerpc. Without that tracking, we flush a lot more > than necessary since we'll end up flushing things like glibc text pages > for every process they are mapped into which is totally wasteful. Thus > the idea of using a new PG bit to separate D$ from I$ tracking still > makes sense. So, assuming full congruence of user space, can't you use the VMA as an indicator? i.e. if we have no user space mappings, we have to flush the icache ... if we have one or more, the icache has been flushed and placing the same page congruently in a different address space benefits from that prior flush, so consequently there's no need to flush again? I also think we've established the relevant facts for the I/O thread (that we only need to either flush the kernel D cache or mark it as to be flushed later on PIO reads). We're now into deep technicalities of how the mm system operates at the architecture level, so perhaps we should move this to linux-arch? James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: James Bottomley on 7 Mar 2010 01:00 On Sat, 2010-03-06 at 19:36 +0000, Russell King - ARM Linux wrote: > On Sat, Mar 06, 2010 at 04:17:23PM +0530, James Bottomley wrote: > > On a fault in of exec data, we first try to get the page out of the page > > cache. If it's not present, we put the faulting process to sleep and > > fetch it in from storage. When we do the read, on the PIO path, the > > kernel alias for the page becomes dirty. Some time later, we place the > > page into the user space (updating the pte entry that caused a fault). > > At this point, we'll call both flush_icache_page() and > > update_mmu_cache() ... this is where the I/D resolution should be done. > > No - this is where things get extremely icky. OK, but the point I'm trying to make is that the page cache code, including the I/O layer, only manages kernel D alias state (either by flushing or marking it dirty). The user space I/D handling is done in the mm code (I'm not claiming it's done correctly there, just claiming it's done there). > The problem at this point occurs on SMP architectures. As soon as you > update the PTE entry, it is visible to other threads of the application. > If you do I-cache handling after updating the PTE, then there is a window > where another CPU can execute the page: > > CPU0 CPU1 > speculatively prefetches from page N via kernel > mapping, loads garbage into I-cache > attempts to execute P > page fault > page N allocated > set_pte_at > executes P > *splat* > flush I-cache OK, so I can believe this. We see extremely rare segfaults on parisc which look to be the result of some I flush race like this. However, I think for a discussion of problems with the arch and mm interfaces, we should probably move off the usb list and onto linux-arch. Our specific problem on parisc is that being VIPT we can't do an I (or D) user flush without a mapping. We have two schemes for fixing this: One is to use a PAGE_FLUSH flag for the mapping ... it allows the flushes to work but refuses any type of RWX access (can do this because we have a software TLB). The other is to use a flush area within the kernel where we flush a page congruent to the userspace address ... I haven't got this working yet, and it's a bit wasteful of kernel address space because our congruence modulus is 4MB. James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Pavel Machek on 7 Mar 2010 03:30 Hi! > > Seems like ARM has requirement other architectures do not, that is > > a) not documented anywhere > > b) causes problems > > Well, ARM is pretty similar to other architectures in this respect. And > I'm sure other architectures have similar problems, only that they only > become visible in some circumstances they may not have encountered (i.e. > PIO drivers + filesystem that doesn't call flush_dcache_page like ext*). > Some other architectures may do heavier flushing > > Of course, a Documentation/arm/cachetlb.txt file would make sense. Actually, short/simple documentation for driver authors would be even better. Then you can claim it is bug in driver :-). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: FUJITA Tomonori on 8 Mar 2010 03:50
On Sun, 07 Mar 2010 09:07:17 +0530 James Bottomley <James.Bottomley(a)HansenPartnership.com> wrote: > So, assuming full congruence of user space, can't you use the VMA as an > indicator? i.e. if we have no user space mappings, we have to flush the > icache ... if we have one or more, the icache has been flushed and > placing the same page congruently in a different address space benefits > from that prior flush, so consequently there's no need to flush again? I'm not sure about this (sounds like the trick might work for some though). As I said earlier, I think that IA64 could avoid flushing I-cache even if the page has no user space mappings (if it did dma to the page). ia64 needs to track pages for that. As Ben said, I guess that we need two separate bits for D and I. I think that it's a good idea to standardize how to use the bits for optimization (some uses none, some uses only one, some needs both though). And then we need to revisit I/O path (fs, the block layer, drivers). Seems that we added flush_dcache_page() everywhere. > I also think we've established the relevant facts for the I/O thread > (that we only need to either flush the kernel D cache or mark it as to > be flushed later on PIO reads). We have the PIO issue about D-cache aliasing now? That's, don't mm/ or fs/ already flush D-cache properly? I thought that Catalin has only D/I cache consistency issue. If not, PIO doesn't also work powerpc that handles properly D/I cache consistency. > We're now into deep technicalities of > how the mm system operates at the architecture level, so perhaps we > should move this to linux-arch? Yeah, probably we should. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |