Do not call ->writepage[s] from direct reclaim and use a_ops->writepages() where possible [Kernel]

Prev: [GIT PULL] UBI changes for 2.6.35-rc3
Next: [PATCH 1/5] ACPI / ACPICA: Use helper function for computing GPE masks

From: Rik van Riel on 15 Jun 2010 15:20

On 06/15/2010 12:54 PM, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 12:49:49PM -0400, Rik van Riel wrote:
>> This is already in a filesystem. Why does ->writepage get
>> called a second time? Shouldn't this have a gfp_mask
>> without __GFP_FS set?
>
> Why would it? GFP_NOFS is not for all filesystem code, but only for
> code where we can't re-enter the filesystem due to deadlock potential.

Why? How about because you know the stack is not big enough
to have the XFS call path on it twice? :)

Isn't the whole purpose of this patch series to prevent writepage
from being called by the VM, when invoked from a deep callstack
like xfs writepage?

That sounds a lot like simply wanting to not have GFP_FS...

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Chris Mason on 15 Jun 2010 15:50

On Tue, Jun 15, 2010 at 03:17:16PM -0400, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 03:13:09PM -0400, Rik van Riel wrote:
> > Why? How about because you know the stack is not big enough
> > to have the XFS call path on it twice? :)
> >
> > Isn't the whole purpose of this patch series to prevent writepage
> > from being called by the VM, when invoked from a deep callstack
> > like xfs writepage?
>
> It's not invoked from xfs writepage, but from xfs_file_aio_write via
> generic_file_buffered_write. Which isn't actually an all that deep
> callstack, just en example of one that's alread bad enough to overflow
> the stack.

Keep in mind that both ext4 and btrfs have similar checks in their
writepage path. I think Dave Chinner's stack analysis we very clear
here, there's no room in the stack for any filesystem and direct reclaim
to live happily together.

Circling back to an older thread:

> 32) 3184 64 xfs_vm_writepage+0xab/0x160 [xfs]
> 33) 3120 384 shrink_page_list+0x65e/0x840
> 34) 2736 528 shrink_zone+0x63f/0xe10
> 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0
> 36) 2096 128 try_to_free_pages+0x77/0x80
> 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710
> 35) 2208 112 do_try_to_free_pages+0xc2/0x3c0
> 36) 2096 128 try_to_free_pages+0x77/0x80
> 37) 1968 240 __alloc_pages_nodemask+0x3e4/0x710
> 38) 1728 48 alloc_pages_current+0x8c/0xe0
> 39) 1680 16 __get_free_pages+0xe/0x50
> 40) 1664 48 __pollwait+0xca/0x110
> 41) 1616 32 unix_poll+0x28/0xc0
> 42) 1584 16 sock_poll+0x1d/0x20
> 43) 1568 912 do_select+0x3d6/0x700
> 44) 656 416 core_sys_select+0x18c/0x2c0
> 45) 240 112 sys_select+0x4f/0x110
> 46) 128 128 system_call_fastpath+0x16/0x1b

So, before xfs can hand this work off to one of its 16 btrees, push
it through the hand tuned irix simulator or even think about spreading
the work across 512 cpus (whoops, I guess that's just btrfs), we've used
up quite a lot of the stack.

I'm not against direct reclaim, but I think we have to admit that it has
to be done directly with another stack context. Handoff to a different
thread, whatever.

When the reclaim does happen, it would be really nice if ios were done
in large-ish clusters. Small ios reclaim less memory in more time and
slow everything down.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 16 Jun 2010 04:00

On Tue, Jun 15, 2010 at 03:13:09PM -0400, Rik van Riel wrote:
> On 06/15/2010 12:54 PM, Christoph Hellwig wrote:
> >On Tue, Jun 15, 2010 at 12:49:49PM -0400, Rik van Riel wrote:
> >>This is already in a filesystem. Why does ->writepage get
> >>called a second time? Shouldn't this have a gfp_mask
> >>without __GFP_FS set?
> >
> >Why would it? GFP_NOFS is not for all filesystem code, but only for
> >code where we can't re-enter the filesystem due to deadlock potential.
>
> Why? How about because you know the stack is not big enough
> to have the XFS call path on it twice? :)
>
> Isn't the whole purpose of this patch series to prevent writepage
> from being called by the VM, when invoked from a deep callstack
> like xfs writepage?
>
> That sounds a lot like simply wanting to not have GFP_FS...

buffered write path uses __GFP_FS by design because huge amounts
of (dirty) memory can be allocated in doing pagecache writes. If
would be nasty if that was not allowed to wait for filesystem
activity.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Andrea Arcangeli on 16 Jun 2010 13:10

On Wed, Jun 16, 2010 at 12:59:54PM -0400, Rik van Riel wrote:
> __GFP_IO can wait for filesystem activity

Hmm I think it's for submitting I/O, not about waiting. At some point
you may not enter the FS because of the FS locks you already hold
(like within writepage itself), but you can still submit I/O through
blkdev layer.

> __GFP_FS can kick off new filesystem activity

Yes that's for dcache/icache/writepage or anything that can reenter
the fs locks and deadlock IIRC.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Rik van Riel on 16 Jun 2010 13:10

On 06/16/2010 03:57 AM, Nick Piggin wrote:
> On Tue, Jun 15, 2010 at 03:13:09PM -0400, Rik van Riel wrote:
>> On 06/15/2010 12:54 PM, Christoph Hellwig wrote:
>>> On Tue, Jun 15, 2010 at 12:49:49PM -0400, Rik van Riel wrote:
>>>> This is already in a filesystem. Why does ->writepage get
>>>> called a second time? Shouldn't this have a gfp_mask
>>>> without __GFP_FS set?
>>>
>>> Why would it? GFP_NOFS is not for all filesystem code, but only for
>>> code where we can't re-enter the filesystem due to deadlock potential.
>>
>> Why? How about because you know the stack is not big enough
>> to have the XFS call path on it twice? :)
>>
>> Isn't the whole purpose of this patch series to prevent writepage
>> from being called by the VM, when invoked from a deep callstack
>> like xfs writepage?
>>
>> That sounds a lot like simply wanting to not have GFP_FS...
>
> buffered write path uses __GFP_FS by design because huge amounts
> of (dirty) memory can be allocated in doing pagecache writes. If
> would be nasty if that was not allowed to wait for filesystem
> activity.

__GFP_IO can wait for filesystem activity

__GFP_FS can kick off new filesystem activity

At least, that's how I remember it from when I last looked
at that code in detail. Things may have changed subtly.

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev |
Pages: 1 2 3 4 5 6 7
Prev: [GIT PULL] UBI changes for 2.6.35-rc3
Next: [PATCH 1/5] ACPI / ACPICA: Use helper function for computing GPE masks