Prev: [Bug #15615] NULL pointer deref in task_is_waking
Next: [Bug #15603] lockdep warning at boot time when determining whether to resume
From: Rik van Riel on 7 Apr 2010 18:00 On 04/07/2010 05:19 PM, Linus Torvalds wrote: > Comments? I remember there being an "unfixable" spot with this approach when I originally wrote the new anon_vma linking code. However, I can't for the life of me find that spot. I am starting to believe I made it fixable as a side effect of one of the changes I made :) One of the issues with your patch is that anon_vma_prepare can fail and this patch ignores its return value. Having anon_vma-prepare fail after an mremap or mprotect might result in messing up the VMAs of a process, or having to undo the VMA changes that were made. In fact, this may be the problem I was running into - not wanting to add even more complex error paths to the vma shuffling code. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on 7 Apr 2010 18:20 On Wed, 7 Apr 2010, Rik van Riel wrote: > > One of the issues with your patch is that anon_vma_prepare > can fail and this patch ignores its return value. Yes. The failure point is too late to do anything really interesting with, and the old code also just causes a SIGBUS. My intention was to change the WARN_ONCE(!vma->anon_vma); into returning that SIGBUS - which is not wonderful, but is no different from old failures. In the long run, it would be nicer to actually return an error from the mmap() that fails, but that's more complicated, and as mentioned, it's not what the old code used to do either (since the failure point was always at the page fault stage). > Having anon_vma-prepare fail after an mremap or mprotect > might result in messing up the VMAs of a process, or having > to undo the VMA changes that were made. We really aren't any worse off than we have always been. If anon_vma_prepare() fails, the vma list will be valid, but no new pages can be added to that vma. That used to be true before too. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on 7 Apr 2010 18:30 On Wed, 7 Apr 2010, Linus Torvalds wrote: > > In the long run, it would be nicer to actually return an error from the > mmap() that fails, but that's more complicated, and as mentioned, it's not > what the old code used to do either (since the failure point was always at > the page fault stage). Put another way: I'm not proud of it, but the new code isn't any worse than what we used to have, and I think the new code is _fixable_. The easiest way to do that would likely be to pre-allocate the anon_vma struct (and anon_vma_chain), and pass it down to anon_vma_prepare. That way anon_vma_prepare() itself can never fail, and all we need to do is a simple allocation earlier in the call-chain. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Linus Torvalds on 7 Apr 2010 19:50 On Wed, 7 Apr 2010, Linus Torvalds wrote: > > Yes. The failure point is too late to do anything really interesting with, > and the old code also just causes a SIGBUS. My intention was to change the > > WARN_ONCE(!vma->anon_vma); > > into returning that SIGBUS - which is not wonderful, but is no different > from old failures. Not SIGBUS, but VM_FAULT_OOM, of course. IOW, something like this should be no worse than what we have now, and has the much nicer locking semantics. Having done some more digging, I can point to a downside: we do end up having about twice as many anon_vma entries. It seems about half of the vma's never need an anon_vma entry, probably because they end up being read-only file mappings, and thus never trigger the anonvma case. That said: - I don't really think you can fix the locking problem you have in a saner way - the anon_vma entry is much smaller than the vm_area_struct, so we're still using much less memory for them than for vma's. - We _could_ avoid allocating anonvma entries for shared mappings or for mappings that are read-only. That might force us to allocate some of them at mprotect time, and/or when doing a forced COW event with ptrace, but we have the mmap_sem for writing for the one case, and we could decide to get it for the other. So it's not a _fundamental_ problem if we decide we want to recover most of the memory lost by doing unconditional allocations. There are alternative models. For example, the VM layer _could_ decide to just release the mmap_sem, and re-do it and take it for writing if the vma doesn't have an anon_vma. I dunno. I like how this patch makes things so much less subtle, though. For example: with this in place, we could further simplify anon_vma_prepare(), since it would now never have the re-entrancy issue and wouldn't need to worry about taking that page_table_lock and re-testing vma->anon_vma for races. Linus --- mm/memory.c | 12 +++--------- mm/mmap.c | 17 ++++------------- 2 files changed, 7 insertions(+), 22 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 833952d..b5efe76 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2223,9 +2223,6 @@ reuse: gotten: pte_unmap_unlock(page_table, ptl); - if (unlikely(anon_vma_prepare(vma))) - goto oom; - if (is_zero_pfn(pte_pfn(orig_pte))) { new_page = alloc_zeroed_user_highpage_movable(vma, address); if (!new_page) @@ -2766,8 +2763,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, /* Allocate our own private page. */ pte_unmap(page_table); - if (unlikely(anon_vma_prepare(vma))) - goto oom; page = alloc_zeroed_user_highpage_movable(vma, address); if (!page) goto oom; @@ -2863,10 +2858,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (flags & FAULT_FLAG_WRITE) { if (!(vma->vm_flags & VM_SHARED)) { anon = 1; - if (unlikely(anon_vma_prepare(vma))) { - ret = VM_FAULT_OOM; - goto out; - } page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); if (!page) { @@ -3115,6 +3106,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, pmd_t *pmd; pte_t *pte; + if (!vma->anon_vma) + return VM_FAULT_OOM; + __set_current_state(TASK_RUNNING); count_vm_event(PGFAULT); diff --git a/mm/mmap.c b/mm/mmap.c index 75557c6..c14284b 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -463,6 +463,8 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma, mm->map_count++; validate_mm(mm); + + anon_vma_prepare(vma); } /* @@ -479,6 +481,8 @@ static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) BUG_ON(__vma && __vma->vm_start < vma->vm_end); __vma_link(mm, vma, prev, rb_link, rb_parent); mm->map_count++; + + anon_vma_prepare(vma); } static inline void @@ -1674,12 +1678,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address) if (!(vma->vm_flags & VM_GROWSUP)) return -EFAULT; - /* - * We must make sure the anon_vma is allocated - * so that the anon_vma locking is not a noop. - */ - if (unlikely(anon_vma_prepare(vma))) - return -ENOMEM; anon_vma_lock(vma); /* @@ -1720,13 +1718,6 @@ static int expand_downwards(struct vm_area_struct *vma, { int error; - /* - * We must make sure the anon_vma is allocated - * so that the anon_vma locking is not a noop. - */ - if (unlikely(anon_vma_prepare(vma))) - return -ENOMEM; - address &= PAGE_MASK; error = security_file_mmap(NULL, 0, 0, 0, address, 1); if (error) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Rik van Riel on 7 Apr 2010 20:50
On 04/07/2010 06:15 PM, Linus Torvalds wrote: > On Wed, 7 Apr 2010, Linus Torvalds wrote: >> >> In the long run, it would be nicer to actually return an error from the >> mmap() that fails, but that's more complicated, and as mentioned, it's not >> what the old code used to do either (since the failure point was always at >> the page fault stage). > > Put another way: I'm not proud of it, but the new code isn't any worse > than what we used to have, and I think the new code is _fixable_. Agreed, it is no worse than what we had before. As to fixable, I supect both situations are fixable. The new code by getting the error paths right, the old code by completely bailing out of the page fault and retrying it (the pageout code should trigger an OOM kill at some point, if we are really out of memory). > The easiest way to do that would likely be to pre-allocate the anon_vma > struct (and anon_vma_chain), and pass it down to anon_vma_prepare. That > way anon_vma_prepare() itself can never fail, and all we need to do is a > simple allocation earlier in the call-chain. That may not work, because we may want to merge the anon_vma with the anon_vma in an adjacant VMA ... and that adjacant VMA could be chained onto multiple anon_vmas. That means allocating a single anon_vma_chain may not be enough. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |