new ->perform_write fop [Kernel]

Prev: [PATCH 1/2] sysfs: add struct file* to bin_attr callbacks
Next: [git pull] Please pull powerpc.git merge branch

From: Jan Kara on 20 May 2010 16:20

On Thu 20-05-10 09:50:54, Dave Chinner wrote:
> On Wed, May 19, 2010 at 01:09:12AM +1000, Nick Piggin wrote:
> > On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote:
> > > On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote:
> > > > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote:
> > > > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote:
> > > > > > Well you could do a large span block allocation at the beginning,
> > > > > > and then dirty the pagecache one by one like we do right now.
> > > > >
> > > > > The problem is that if we fail to allocate a page (e.g. ENOMEM) or
> > > > > fail the copy (EFAULT) after the block allocation, we have to undo
> > > > > the allocation we have already completed. If we don't, we leave
> > > > > uninitialisaed allocations on disk that will expose stale data.
> > > > >
> > > > > In the second case (EFAULT) we might be able to zero the pages to
> > > > > avoid punching out blocks, but the first case where pages can't be
> > > > > allocated to cover the block allocated range makes it very
> > > > > difficult without being able to punch holes in allocated block
> > > > > ranges.
> > > > >
> > > > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary
> > > > > ranges of allocated blocks from an inode - there is not VFS method
> > > > > for it, just an ioctl (XFS_IOC_UNRESVSP).
> > > > >
> > > > > Hence the way to avoid needing hole punching is to allocate and lock
> > > > > down all the pages into the page cache fіrst, then do the copy so
> > > > > they fail before the allocation is done if they are going to fail.
> > > > > That makes it much, much easier to handle failures....
> > > >
> > > > So it is just a matter of what is exposed as a vfs interface?
> > >
> > > More a matter of utilising the functionality most filesystems
> > > already have and minimising the amount of churn in critical areas of
> > > filesytsem code. Hole punching is not simple, anѕ bugs will likely
> > > result in a corrupted filesystem. And the hole punching will only
> > > occur in a hard to trigger corner case, so it's likely that bugs
> > > will go undetected and filesystems will suffer from random,
> > > impossible to track down corruptions as a result.
> > >
> > > In comparison, adding reserve/unreserve functionality might cause
> > > block accounting issues if there is a bug, but it won't cause
> > > on-disk corruption that results in data loss. Hole punching is not
> > > simple or easy - it's a damn complex way to handle errors and if
> > > that's all it's required for then we've failed already.
> >
> > As I said, we can have a dumb fallback path for filesystems that
> > don't implement hole punching. Clear the blocks past i size, and
> > zero out the allocated but not initialized blocks.
> >
> > There does not have to be pagecache allocated in order to do this,
> > you could do direct IO from the zero page in order to do it.
>
> I don't see that as a good solution - it's once again a fairly
> complex way of dealing with the problem, especially as it now means
> that direct io would fall back to buffered which would fall back to
> direct IO....
>
> > Hole punching is not only useful there, it is already exposed to
> > userspace via MADV_REMOVE.
>
> That interface is *totally broken*. It has all the same problems as
> vmtruncate() for removing file blocks (because it uses vmtruncate).
> It also has the fundamental problem of being called un the mmap_sem,
> which means that inode locks and therefore de-allocation cannot be
> executed without the possibility of deadlocks. Fundamentally, hole
> punching is an inode operation, not a VM operation....
>
>
> > > > > > Basically, once pagecache is marked uptodate, I don't think we should
> > > > > > ever put maybe-invalid data into it -- the way to do it is to invalidate
> > > > > > that page and put a *new* page in there.
> > > > >
> > > > > Ok, so lets do that...
> > > > >
> > > > > > Why? Because user mappings are just one problem, but once you had a
> > > > > > user mapping, you can have been subject to get_user_pages, so it could
> > > > > > be in the middle of a DMA operation or something.
> > > > >
> > > > > ... because we already know this behaviour causes problems for
> > > > > high end enterprise level features like hardware checksumming IO
> > > > > paths.
> > > > >
> > > > > Hence it seems that a multipage write needs to:
> > > > >
> > > > > 1. allocate new pages
> > > > > 2. attach bufferheads/mapping structures to pages (if required)
> > > > > 3. copy data into pages
> > > > > 4. allocate space
> > > > > 5. for each old page in the range:
> > > > > lock page
> > > > > invalidate mappings
> > > > > clear page uptodate flag
> > > > > remove page from page cache
> > > > > 6. for each new page:
> > > > > map new page to allocated space
> > > > > lock new page
> > > > > insert new page into pagecache
> > > > > update new page state (write_end equivalent)
> > > > > unlock new page
> > > > > 7. free old pages
> > > > >
> > > > > Steps 1-4 can all fail, and can all be backed out from without
> > > > > changing the current state. Steps 5-7 can't fail AFAICT, so we
> > > > > should be able to run this safely after the allocation without
> > > > > needing significant error unwinding...
> > > > >
> > > > > Thoughts?
> > > >
> > > > Possibly. The importance of hot cache is reduced, because we are
> > > > doing full-page copies, and bulk copies, by definition. But it
> > > > could still be an issue. The allocations and deallocations could
> > > > cost a little as well.
> > >
> > > They will cost far less than the reduction in allocation overhead
> > > saves us, and there are potential optimisations there
> >
> > An API that doesn't require that, though, should be less overhead
> > and simpler.
> >
> > Is it really going to be a problem to implement block hole punching
> > in ext4 and gfs2?
>
> I can't follow the ext4 code - it's an intricate maze of weird entry
> and exit points, so I'm not even going to attempt to comment on it.
Hmm, I was thinking about it and I see two options how to get out
of problems:
a) Filesystems which are not able to handle hole punching will allow
multipage writes only after EOF (which can be easily undone by
truncate in case of failure). That should actually cover lots of
cases we are interested in (I don't expect multipage writes to holes
to be a common case).
b) E.g. ext4 can do even without hole punching. It can allocate extent
as 'unwritten' and when something during the write fails, it just
leaves the extent allocated and the 'unwritten' flag makes sure that
any read will see zeros. I suppose that other filesystems that care
about multipage writes are able to do similar things (e.g. btrfs can
do the same as far as I remember, I'm not sure about gfs2).

Honza
--
Jan Kara <jack(a)suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Dave Chinner on 20 May 2010 19:10

On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote:
> On Thu 20-05-10 09:50:54, Dave Chinner wrote:
> > On Wed, May 19, 2010 at 01:09:12AM +1000, Nick Piggin wrote:
> > > On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote:
> > > > On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote:
> > > > > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote:
> > > > > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote:
> > > > > > > Well you could do a large span block allocation at the beginning,
> > > > > > > and then dirty the pagecache one by one like we do right now.
> > > > > >
> > > > > > The problem is that if we fail to allocate a page (e.g. ENOMEM) or
> > > > > > fail the copy (EFAULT) after the block allocation, we have to undo
> > > > > > the allocation we have already completed. If we don't, we leave
> > > > > > uninitialisaed allocations on disk that will expose stale data.
> > > > > >
> > > > > > In the second case (EFAULT) we might be able to zero the pages to
> > > > > > avoid punching out blocks, but the first case where pages can't be
> > > > > > allocated to cover the block allocated range makes it very
> > > > > > difficult without being able to punch holes in allocated block
> > > > > > ranges.
> > > > > >
> > > > > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary
> > > > > > ranges of allocated blocks from an inode - there is not VFS method
> > > > > > for it, just an ioctl (XFS_IOC_UNRESVSP).
> > > > > >
> > > > > > Hence the way to avoid needing hole punching is to allocate and lock
> > > > > > down all the pages into the page cache fіrst, then do the copy so
> > > > > > they fail before the allocation is done if they are going to fail.
> > > > > > That makes it much, much easier to handle failures....
> > > > >
> > > > > So it is just a matter of what is exposed as a vfs interface?
> > > >
> > > > More a matter of utilising the functionality most filesystems
> > > > already have and minimising the amount of churn in critical areas of
> > > > filesytsem code. Hole punching is not simple, anѕ bugs will likely
> > > > result in a corrupted filesystem. And the hole punching will only
> > > > occur in a hard to trigger corner case, so it's likely that bugs
> > > > will go undetected and filesystems will suffer from random,
> > > > impossible to track down corruptions as a result.
> > > >
> > > > In comparison, adding reserve/unreserve functionality might cause
> > > > block accounting issues if there is a bug, but it won't cause
> > > > on-disk corruption that results in data loss. Hole punching is not
> > > > simple or easy - it's a damn complex way to handle errors and if
> > > > that's all it's required for then we've failed already.
> > >
> > > As I said, we can have a dumb fallback path for filesystems that
> > > don't implement hole punching. Clear the blocks past i size, and
> > > zero out the allocated but not initialized blocks.
> > >
> > > There does not have to be pagecache allocated in order to do this,
> > > you could do direct IO from the zero page in order to do it.
> >
> > I don't see that as a good solution - it's once again a fairly
> > complex way of dealing with the problem, especially as it now means
> > that direct io would fall back to buffered which would fall back to
> > direct IO....
> >
> > > Hole punching is not only useful there, it is already exposed to
> > > userspace via MADV_REMOVE.
> >
> > That interface is *totally broken*. It has all the same problems as
> > vmtruncate() for removing file blocks (because it uses vmtruncate).
> > It also has the fundamental problem of being called un the mmap_sem,
> > which means that inode locks and therefore de-allocation cannot be
> > executed without the possibility of deadlocks. Fundamentally, hole
> > punching is an inode operation, not a VM operation....
> >
> >
> > > > > > > Basically, once pagecache is marked uptodate, I don't think we should
> > > > > > > ever put maybe-invalid data into it -- the way to do it is to invalidate
> > > > > > > that page and put a *new* page in there.
> > > > > >
> > > > > > Ok, so lets do that...
> > > > > >
> > > > > > > Why? Because user mappings are just one problem, but once you had a
> > > > > > > user mapping, you can have been subject to get_user_pages, so it could
> > > > > > > be in the middle of a DMA operation or something.
> > > > > >
> > > > > > ... because we already know this behaviour causes problems for
> > > > > > high end enterprise level features like hardware checksumming IO
> > > > > > paths.
> > > > > >
> > > > > > Hence it seems that a multipage write needs to:
> > > > > >
> > > > > > 1. allocate new pages
> > > > > > 2. attach bufferheads/mapping structures to pages (if required)
> > > > > > 3. copy data into pages
> > > > > > 4. allocate space
> > > > > > 5. for each old page in the range:
> > > > > > lock page
> > > > > > invalidate mappings
> > > > > > clear page uptodate flag
> > > > > > remove page from page cache
> > > > > > 6. for each new page:
> > > > > > map new page to allocated space
> > > > > > lock new page
> > > > > > insert new page into pagecache
> > > > > > update new page state (write_end equivalent)
> > > > > > unlock new page
> > > > > > 7. free old pages
> > > > > >
> > > > > > Steps 1-4 can all fail, and can all be backed out from without
> > > > > > changing the current state. Steps 5-7 can't fail AFAICT, so we
> > > > > > should be able to run this safely after the allocation without
> > > > > > needing significant error unwinding...
> > > > > >
> > > > > > Thoughts?
> > > > >
> > > > > Possibly. The importance of hot cache is reduced, because we are
> > > > > doing full-page copies, and bulk copies, by definition. But it
> > > > > could still be an issue. The allocations and deallocations could
> > > > > cost a little as well.
> > > >
> > > > They will cost far less than the reduction in allocation overhead
> > > > saves us, and there are potential optimisations there
> > >
> > > An API that doesn't require that, though, should be less overhead
> > > and simpler.
> > >
> > > Is it really going to be a problem to implement block hole punching
> > > in ext4 and gfs2?
> >
> > I can't follow the ext4 code - it's an intricate maze of weird entry
> > and exit points, so I'm not even going to attempt to comment on it.
> Hmm, I was thinking about it and I see two options how to get out
> of problems:
> a) Filesystems which are not able to handle hole punching will allow
> multipage writes only after EOF (which can be easily undone by
> truncate in case of failure). That should actually cover lots of
> cases we are interested in (I don't expect multipage writes to holes
> to be a common case).

multipage writes to holes is a relatively common operation in the
HPC space that XFS is designed for (e.g. calculations on huge sparse
matrices), so I'm not really fond of this idea....

> b) E.g. ext4 can do even without hole punching. It can allocate extent
> as 'unwritten' and when something during the write fails, it just
> leaves the extent allocated and the 'unwritten' flag makes sure that
> any read will see zeros. I suppose that other filesystems that care
> about multipage writes are able to do similar things (e.g. btrfs can
> do the same as far as I remember, I'm not sure about gfs2).

Allocating multipage writes as unwritten extents turns off delayed
allocation and hence we'd lose all the benefits that this gives...

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Steven Whitehouse on 21 May 2010 05:40

Hi,

On Fri, 2010-05-21 at 09:05 +1000, Dave Chinner wrote:
> On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote:
> > On Thu 20-05-10 09:50:54, Dave Chinner wrote:
> > > On Wed, May 19, 2010 at 01:09:12AM +1000, Nick Piggin wrote:
> > > > On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote:
> > > >
> > > > Is it really going to be a problem to implement block hole punching
> > > > in ext4 and gfs2?
> > >
[snip]
> > b) E.g. ext4 can do even without hole punching. It can allocate extent
> > as 'unwritten' and when something during the write fails, it just
> > leaves the extent allocated and the 'unwritten' flag makes sure that
> > any read will see zeros. I suppose that other filesystems that care
> > about multipage writes are able to do similar things (e.g. btrfs can
> > do the same as far as I remember, I'm not sure about gfs2).
>
> Allocating multipage writes as unwritten extents turns off delayed
> allocation and hence we'd lose all the benefits that this gives...

It should be possible to implement hole punching in GFS2 I think. The
main issue is locking order of resource groups. We have on our todo list
a rewrite of the truncate/delete code which is currently used to
deallocate data blocks and metadata tree blocks. The current algorithm
is a rather inefficient recursive scanning of the tree which is done
multiple times depending on the tree height.

Adapting that to punch holes should be possible without too much effort
if we need to do that. We do need to allow for the possibility that such
a deallocation might have to be split into multiple transactions
depending on the amount of metadata involved (for large files, this
could be larger than the size of the log for example). Currently the
code will split up truncates into multiple transactions which allows the
deallocation to be restartable from any transaction boundary.

GFS2 does not have any way to mark unwritten extents, so we cannot do
delayed allocation or implement an efficient fallocate. We can do better
performance-wise than just dd'ing zeros to a file for fallocate, but
we'll never be able to match a fs that can mark extents unwritten in
performance terms,

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Josef Bacik on 21 May 2010 09:50

On Fri, May 21, 2010 at 09:05:24AM +1000, Dave Chinner wrote:
> On Thu, May 20, 2010 at 10:12:32PM +0200, Jan Kara wrote:
> > On Thu 20-05-10 09:50:54, Dave Chinner wrote:
> > > On Wed, May 19, 2010 at 01:09:12AM +1000, Nick Piggin wrote:
> > > > On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote:
> > > > > On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote:
> > > > > > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote:
> > > > > > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote:
> > > > > > > > Well you could do a large span block allocation at the beginning,
> > > > > > > > and then dirty the pagecache one by one like we do right now.
> > > > > > >
> > > > > > > The problem is that if we fail to allocate a page (e.g. ENOMEM) or
> > > > > > > fail the copy (EFAULT) after the block allocation, we have to undo
> > > > > > > the allocation we have already completed. If we don't, we leave
> > > > > > > uninitialisaed allocations on disk that will expose stale data.
> > > > > > >
> > > > > > > In the second case (EFAULT) we might be able to zero the pages to
> > > > > > > avoid punching out blocks, but the first case where pages can't be
> > > > > > > allocated to cover the block allocated range makes it very
> > > > > > > difficult without being able to punch holes in allocated block
> > > > > > > ranges.
> > > > > > >
> > > > > > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary
> > > > > > > ranges of allocated blocks from an inode - there is not VFS method
> > > > > > > for it, just an ioctl (XFS_IOC_UNRESVSP).
> > > > > > >
> > > > > > > Hence the way to avoid needing hole punching is to allocate and lock
> > > > > > > down all the pages into the page cache fіrst, then do the copy so
> > > > > > > they fail before the allocation is done if they are going to fail.
> > > > > > > That makes it much, much easier to handle failures....
> > > > > >
> > > > > > So it is just a matter of what is exposed as a vfs interface?
> > > > >
> > > > > More a matter of utilising the functionality most filesystems
> > > > > already have and minimising the amount of churn in critical areas of
> > > > > filesytsem code. Hole punching is not simple, anѕ bugs will likely
> > > > > result in a corrupted filesystem. And the hole punching will only
> > > > > occur in a hard to trigger corner case, so it's likely that bugs
> > > > > will go undetected and filesystems will suffer from random,
> > > > > impossible to track down corruptions as a result.
> > > > >
> > > > > In comparison, adding reserve/unreserve functionality might cause
> > > > > block accounting issues if there is a bug, but it won't cause
> > > > > on-disk corruption that results in data loss. Hole punching is not
> > > > > simple or easy - it's a damn complex way to handle errors and if
> > > > > that's all it's required for then we've failed already.
> > > >
> > > > As I said, we can have a dumb fallback path for filesystems that
> > > > don't implement hole punching. Clear the blocks past i size, and
> > > > zero out the allocated but not initialized blocks.
> > > >
> > > > There does not have to be pagecache allocated in order to do this,
> > > > you could do direct IO from the zero page in order to do it.
> > >
> > > I don't see that as a good solution - it's once again a fairly
> > > complex way of dealing with the problem, especially as it now means
> > > that direct io would fall back to buffered which would fall back to
> > > direct IO....
> > >
> > > > Hole punching is not only useful there, it is already exposed to
> > > > userspace via MADV_REMOVE.
> > >
> > > That interface is *totally broken*. It has all the same problems as
> > > vmtruncate() for removing file blocks (because it uses vmtruncate).
> > > It also has the fundamental problem of being called un the mmap_sem,
> > > which means that inode locks and therefore de-allocation cannot be
> > > executed without the possibility of deadlocks. Fundamentally, hole
> > > punching is an inode operation, not a VM operation....
> > >
> > >
> > > > > > > > Basically, once pagecache is marked uptodate, I don't think we should
> > > > > > > > ever put maybe-invalid data into it -- the way to do it is to invalidate
> > > > > > > > that page and put a *new* page in there.
> > > > > > >
> > > > > > > Ok, so lets do that...
> > > > > > >
> > > > > > > > Why? Because user mappings are just one problem, but once you had a
> > > > > > > > user mapping, you can have been subject to get_user_pages, so it could
> > > > > > > > be in the middle of a DMA operation or something.
> > > > > > >
> > > > > > > ... because we already know this behaviour causes problems for
> > > > > > > high end enterprise level features like hardware checksumming IO
> > > > > > > paths.
> > > > > > >
> > > > > > > Hence it seems that a multipage write needs to:
> > > > > > >
> > > > > > > 1. allocate new pages
> > > > > > > 2. attach bufferheads/mapping structures to pages (if required)
> > > > > > > 3. copy data into pages
> > > > > > > 4. allocate space
> > > > > > > 5. for each old page in the range:
> > > > > > > lock page
> > > > > > > invalidate mappings
> > > > > > > clear page uptodate flag
> > > > > > > remove page from page cache
> > > > > > > 6. for each new page:
> > > > > > > map new page to allocated space
> > > > > > > lock new page
> > > > > > > insert new page into pagecache
> > > > > > > update new page state (write_end equivalent)
> > > > > > > unlock new page
> > > > > > > 7. free old pages
> > > > > > >
> > > > > > > Steps 1-4 can all fail, and can all be backed out from without
> > > > > > > changing the current state. Steps 5-7 can't fail AFAICT, so we
> > > > > > > should be able to run this safely after the allocation without
> > > > > > > needing significant error unwinding...
> > > > > > >
> > > > > > > Thoughts?
> > > > > >
> > > > > > Possibly. The importance of hot cache is reduced, because we are
> > > > > > doing full-page copies, and bulk copies, by definition. But it
> > > > > > could still be an issue. The allocations and deallocations could
> > > > > > cost a little as well.
> > > > >
> > > > > They will cost far less than the reduction in allocation overhead
> > > > > saves us, and there are potential optimisations there
> > > >
> > > > An API that doesn't require that, though, should be less overhead
> > > > and simpler.
> > > >
> > > > Is it really going to be a problem to implement block hole punching
> > > > in ext4 and gfs2?
> > >
> > > I can't follow the ext4 code - it's an intricate maze of weird entry
> > > and exit points, so I'm not even going to attempt to comment on it.
> > Hmm, I was thinking about it and I see two options how to get out
> > of problems:
> > a) Filesystems which are not able to handle hole punching will allow
> > multipage writes only after EOF (which can be easily undone by
> > truncate in case of failure). That should actually cover lots of
> > cases we are interested in (I don't expect multipage writes to holes
> > to be a common case).
>
> multipage writes to holes is a relatively common operation in the
> HPC space that XFS is designed for (e.g. calculations on huge sparse
> matrices), so I'm not really fond of this idea....
>
> > b) E.g. ext4 can do even without hole punching. It can allocate extent
> > as 'unwritten' and when something during the write fails, it just
> > leaves the extent allocated and the 'unwritten' flag makes sure that
> > any read will see zeros. I suppose that other filesystems that care
> > about multipage writes are able to do similar things (e.g. btrfs can
> > do the same as far as I remember, I'm not sure about gfs2).
>
> Allocating multipage writes as unwritten extents turns off delayed
> allocation and hence we'd lose all the benefits that this gives...
>

I just realized we have another problem, the mmap_sem/page_lock deadlock.
Currently BTRFS is susceptible to this, since we don't prefault any of the pages
in yet. If we're going to do multi-page writes we're going to need to have a
way to fault in all of the iovec's at once, so when we do the
pagefault_disable() copy pagefault_enable() we don't just end up copying the
first iovec. Nick have you done something like this already? If not I assume
I can just loop through all the iovec's and call fault_in_pages_readable on all
of them and be good to go right? Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 21 May 2010 10:30

On Fri, May 21, 2010 at 09:50:54AM -0400, Josef Bacik wrote:
> On Fri, May 21, 2010 at 09:05:24AM +1000, Dave Chinner wrote:
> > Allocating multipage writes as unwritten extents turns off delayed
> > allocation and hence we'd lose all the benefits that this gives...
> >
>
> I just realized we have another problem, the mmap_sem/page_lock deadlock.
> Currently BTRFS is susceptible to this, since we don't prefault any of the pages
> in yet. If we're going to do multi-page writes we're going to need to have a
> way to fault in all of the iovec's at once, so when we do the
> pagefault_disable() copy pagefault_enable() we don't just end up copying the
> first iovec. Nick have you done something like this already? If not I assume
> I can just loop through all the iovec's and call fault_in_pages_readable on all
> of them and be good to go right? Thanks,

Yes, well it's a different issue. With multi-page writes, even a single
iovec may not be faulted in properly. And with multiple iovecs, we are
already suboptimal with faulting.

faulting in multiple iovecs may already be a good idea. I didn't add
that code, I had hoped for a test case first, but perhaps we can just
go and add it.

With multipage writes, we would want to fault in multiple source pages
at once if the design was to lock multiple pages at once and do the
copy. I still think we might be able to just lock and copy one page at
a time, but I could be wrong.

Oh wow, btrfs is deadlocky there. Firstly, fault_in_pages_readable does
not guarantee success (race can always unmap the page in the meantime).
Secondly, calling it inside the page lock section just means it will
cause the deadlock rather than the copy_from_user.

Quick workaround to reduce probability is to do fault_in_pages_readable
calls before locking the pages.

But you really need to handle the short-copy case. From the error
handling there, it seems like you can just free_reserved_data_space and
retry the copy?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8
Prev: [PATCH 1/2] sysfs: add struct file* to bin_attr callbacks
Next: [git pull] Please pull powerpc.git merge branch