From: Andrea Arcangeli on 28 Nov 2009 14:00 Hi Hugh and everyone, On Tue, Nov 10, 2009 at 04:33:30PM +0000, Hugh Dickins wrote: > In fairness I've added Andrea and KOSAKI-san to the Cc, since I know > they are two people keen to fix this issue once and for all. Whereas Right, I'm sure Nick also wants to fix this once and for all (adding him too to Cc ;). I thought and I still think it's bad to leave races like this open for people to find out the hard way. It just takes somebody to use pthread_create, open a file with O_DIRECT with 512byte (not page alignment) and call fork to trigger this, and they may find out only later after going productive on thousand of servers... If this was a too hard problem to fix I would understand, but I've all patches ready to fix this completely! And they're quite localized they only touch fork and gup and they don't alter the fast path (except for 1 conditional jump in fork that surely is lost in the noise, plus fork is all but a fast path). I tried to fix this in RHEL but eventually the user affected added larger alignment to the userland app to prevent this, so it isn't as urgent anymore and so I'd rather prefer to fix this in mainline first. This isn't the first and surely won't be the last user that is bitten by this, unless we take action. > I am with Linus in the opposite camp: solutions have looked nasty, > and short of bright new ideas, I feel we've gone as far as we ought. There are two gup races that materializes when we wrprotect and share an anonymous page. bug 1) If a parent thread writes to the first half of the page while the gup user writes to the second half of the page and then fork is run, the O_DIRECT read from disk in the second half of the page gets lost. In addition the child will still receive the O_DIRECT writes to memory when it should not. bug 2) The backward race happens after fork, when the parent starts an O_DIRECT write to disk from the first half of the page, and then writes to memory in the second half of the page, after that the child writes to the page will be read by the parent direct-io. fix for bug 1) is what Nick and me implemented, that consists in copying (instead of sharing) anon pages during fork, if they could be under gup. The two implementations are vastly different but they look to do the same thing (he used bitflags in the vma and in the page, I only used a bitflag in the page, worst thing of my patch was having to set that bitflag in gup_fast too, I don't like having to add a bit to the vma when a bit in the page is enough). fix A for bug 2) is what KOSAKI tried to implement in message-id 20090414151554.C64A.A69D9226. The trick is in having do_wp_page not taking over a page under GUP (that means reuse_swap_cache has to take the page_count into account too, not just the mapcount). However taking page_count into account in reuse_swap_cache, means that it won't be capable of taking over a page under gup that got temporarily converted to swapcache and unmapped, so leading to losing O_DIRECT reads from disk during paging. So another change is required to rmap code to prevent ever unmapping any pinned anon page that could be under GUP to avoid losing I/O during paging. fix B for bug 2) is what Nick and me implemented, that consists in always de-cowing anon shared pages during gup even in case of gup(write=0). That's much simpler than fix A for bug 2 and the fix doesn't affect rmap swap semantics, but it loses some sharing capability in gup(write=0) cases, not a practical matter though. All other patches floating around spread an mm-wide semaphore over fork fast path, and across O_DIRECT, nfs, and aio, and they most certainly didn't fix the two races for all gup users, and they weren't stable because of having to identify the closure of the I/O across all possible put_page. That approach kind of opens a can of worms and it looks the wrong way to go to me, and I think they scale worse too for the fast path (no O_DIRECT or no fork). Identifying the gup closure points and replacing the raw put_page with gup_put_page would not be an useless effort though and I felt if the gup API was just a little bit more sophisticated I could simplify a bit the put_compound_page to serialize the race against split_huge_page_refcount, but this is an orthogonal issue with the mm-wide semaphore release addition which I personally dislike. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mark Veltzer on 28 Nov 2009 17:30 On Saturday 28 November 2009 20:50:52 you wrote: > Hi Hugh and everyone, > > On Tue, Nov 10, 2009 at 04:33:30PM +0000, Hugh Dickins wrote: > > In fairness I've added Andrea and KOSAKI-san to the Cc, since I know > > they are two people keen to fix this issue once and for all. Whereas > > Right, I'm sure Nick also wants to fix this once and for all (adding > him too to Cc ;). > > I thought and I still think it's bad to leave races like this open for > people to find out the hard way. It just takes somebody to use > pthread_create, open a file with O_DIRECT with 512byte (not page > .... Hello all! First let me state that I solved my problems by simply avoiding GUP completely and going with a clean mmap implemenation (with the nopage version) which causes no problems what so ever. mmap does not suffer from all the problems discussed above (aside from the fact that you have to do your own book keeping as far as vma_open and vma_close and fault function goes...). Please correct me if I'm wrong...:) The fact that I solved all my problems with mmap and the complexity of the proposed solutions got me thinking about GUP in more general terms. Would it be fair to say that mmap is much more aligned to the kernels way of doing things than GUP? It feels like the vma concept which is a solid one and probably works well for most architectures is in conflict with GUP and so is fork. It also feels like the vma concept is well aligned with fork which means in turn that mmap is well aligned with fork while GUP is not. This is a new conclusion for me and one which did not register back when reading the LDD book (I got the impression that you can pick between mmap and GUP and it does not really matter but now I feel that mmap is much advanced and trouble free). Testing it out I grepped the drivers folder of a recent kernel with ONLY 26 mentions of GUP in the entire drivers folder! The main drivers using GUP are scsi st and infiniband. If GUP is so unused is it really essential as an in kernel interface? If GUP is slowly dropped and drivers converted to mmap would it not simplify kernel code or at least prevent complex solutions to GUP problems from complicating mm code even more? Again, I'm no kernel expert so please don't flame me too hard if I'm talking heresy or nonsense, I would just like to hear your take on this. It may well be that I simply have no clue and so my conclusions are way too radical... Cheers, Mark -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on 30 Nov 2009 07:00 On Sat, Nov 28, 2009 at 07:50:52PM +0100, Andrea Arcangeli wrote: > All other patches floating around spread an mm-wide semaphore over > fork fast path, and across O_DIRECT, nfs, and aio, and they most > certainly didn't fix the two races for all gup users, and they weren't > stable because of having to identify the closure of the I/O across all > possible put_page. That approach kind of opens a can of worms and it > looks the wrong way to go to me, and I think they scale worse too for > the fast path (no O_DIRECT or no fork). Identifying the gup closure > points and replacing the raw put_page with gup_put_page would not be > an useless effort though and I felt if the gup API was just a little > bit more sophisticated I could simplify a bit the put_compound_page to > serialize the race against split_huge_page_refcount, but this is an > orthogonal issue with the mm-wide semaphore release addition which I > personally dislike. IIRC, the last time this came up, it kind of became stalled on this point. Linus hated our "preemptive cow" approaches, and thought the above approach was better. I don't think we need to bother arguing details between our former approaches until we get past this sticking point. FWIW, I need to change get_user_pages semantics somewhat because we have filesystems that cannot tolerate a set_page_dirty() to dirty a clean page (it must only be dirtied with page_mkwrite). This should probably require converting callers to use put_user_pages and disallowing lock_page, mmap_sem, user-copy etc. within these sections. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on 30 Nov 2009 07:10 On Sun, Nov 29, 2009 at 12:22:17AM +0200, Mark Veltzer wrote: > On Saturday 28 November 2009 20:50:52 you wrote: > > Hi Hugh and everyone, > > > > On Tue, Nov 10, 2009 at 04:33:30PM +0000, Hugh Dickins wrote: > > > In fairness I've added Andrea and KOSAKI-san to the Cc, since I know > > > they are two people keen to fix this issue once and for all. Whereas > > > > Right, I'm sure Nick also wants to fix this once and for all (adding > > him too to Cc ;). > > > > I thought and I still think it's bad to leave races like this open for > > people to find out the hard way. It just takes somebody to use > > pthread_create, open a file with O_DIRECT with 512byte (not page > > .... > > Hello all! > > First let me state that I solved my problems by simply avoiding GUP completely > and going with a clean mmap implemenation (with the nopage version) which > causes no problems what so ever. mmap does not suffer from all the problems > discussed above (aside from the fact that you have to do your own book keeping > as far as vma_open and vma_close and fault function goes...). Please correct > me if I'm wrong...:) > > The fact that I solved all my problems with mmap and the complexity of the > proposed solutions got me thinking about GUP in more general terms. Would it > be fair to say that mmap is much more aligned to the kernels way of doing > things than GUP? It feels like the vma concept which is a solid one and > probably works well for most architectures is in conflict with GUP and so is > fork. It also feels like the vma concept is well aligned with fork which means > in turn that mmap is well aligned with fork while GUP is not. This is a new > conclusion for me and one which did not register back when reading the LDD > book (I got the impression that you can pick between mmap and GUP and it does > not really matter but now I feel that mmap is much advanced and trouble free). > > Testing it out I grepped the drivers folder of a recent kernel with ONLY 26 > mentions of GUP in the entire drivers folder! The main drivers using GUP are > scsi st and infiniband. If GUP is so unused is it really essential as an in > kernel interface? If GUP is slowly dropped and drivers converted to mmap would > it not simplify kernel code or at least prevent complex solutions to GUP > problems from complicating mm code even more? Again, I'm no kernel expert so > please don't flame me too hard if I'm talking heresy or nonsense, I would just > like to hear your take on this. It may well be that I simply have no clue and > so my conclusions are way too radical... GUP is basically required to do any kind of IO operations on user addresses (those not owned by your driver, ie. arbitrary addresses) without first copying memory into kernel. If you can wean O_DIRECT off get_user_pages, you'd have most of the battle won. I don't think it's really possible though. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Arcangeli on 30 Nov 2009 11:20 On Mon, Nov 30, 2009 at 01:01:45PM +0100, Nick Piggin wrote: > If you can wean O_DIRECT off get_user_pages, you'd have most of the > battle won. I don't think it's really possible though. Agreed. Not just O_DIRECT, virtualization requires it too, the kvm page fault calls get_user_pages, practically anything that uses mmu notifier also uses get_user_pages. There are things you simply can't do without it. In general if the memory doesn't need to be persistently stored on disk to survive task killage, there's not much point in using pagecache MAP_SHARED on-disk, instead of anonymous memory, this is why anonymous memory is backing malloc, and there's no reason why people should be prevented to issue disk I/O in zero-copy with anonymous memory (or tmpfs), if they know they access this data only once and they want to manage the cache in some logical form rather than in physical on-disk format (or if there are double physical caches more efficient kept elsewhere, like in KVM guest case). OTOH if you'd be using the I/O data in physical format in your userland memory, then using pagecache by mmapping the file and disabling O_DIRECT on the filesystem is surely preferred and more efficient (if nothing else, because it also provides caching just in case). For drivers (Mark's case) it depends, but if you can avoid to use get_user_pages without slowing down anything you should, that usually makes code simpler... and it won't risk to suffer from these race conditions either ;). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
First
|
Prev
|
Pages: 1 2 Prev: net/compat_ioctl: support SIOCWANDEV Next: [PATCHv9 2/3] mm: export use_mm/unuse_mm to modules |