Prev: [PATCH 1/2] sysfs: add struct file* to bin_attr callbacks
Next: [git pull] Please pull powerpc.git merge branch
From: Nick Piggin on 18 May 2010 06:50 On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote: > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote: > > Well you could do a large span block allocation at the beginning, > > and then dirty the pagecache one by one like we do right now. > > The problem is that if we fail to allocate a page (e.g. ENOMEM) or > fail the copy (EFAULT) after the block allocation, we have to undo > the allocation we have already completed. If we don't, we leave > uninitialisaed allocations on disk that will expose stale data. > > In the second case (EFAULT) we might be able to zero the pages to > avoid punching out blocks, but the first case where pages can't be > allocated to cover the block allocated range makes it very > difficult without being able to punch holes in allocated block > ranges. > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary > ranges of allocated blocks from an inode - there is not VFS method > for it, just an ioctl (XFS_IOC_UNRESVSP). > > Hence the way to avoid needing hole punching is to allocate and lock > down all the pages into the page cache fіrst, then do the copy so > they fail before the allocation is done if they are going to fail. > That makes it much, much easier to handle failures.... So it is just a matter of what is exposed as a vfs interface? I would much prefer to make it a requirement to support hole punching in the block allocator if the filesystem wishes to do such writes. If you have an ->allocate(inode, RESERVE/ALLOCATE, interval) API then it makes sense to have a DEALLOCATE operation there too. That seems like it should be cleaner than to work around it if we're talking about adding new APIs anyway. > Remember, we still don't do the "truncate blocks past eof on failure" > correctly in block_write_begin (still calls vmtruncate, not > ->setattr), so we can't claim that the current way of doing things > is a model of perfection that we ca't improve on.... Right, but the API (write_begin/write_end) is sufficient now to make it work correctly. What should really happen is the filesystem detects and handles these error cases. In the truncate patchset (that deprecates ->truncate and vmtruncate entirely), that is exactly what happens. > > > The only reason to do operations on multiple pages at once is if > > we need to lock them all. > > Well, to avoid the refaulting of pages we just unmapped we'd need to > do that... Well, the lock/unmap/copy/unlock could be done on a per-page basis. > > Now the fs might well have that requirement > > (if it is not using i_mutex for block (de)allocation > > serialisation), but I don't think generic code needs to be doing > > that. > > XFS already falls into the category of a filesystem using the > generic code that does not use i_mutex for allocation serialisation. > I'm sure it isn't the only filesystem that this is true for, so it > seems sensible for the generic code to handle this case. Well, does it need page lock? All pages locked concurrently in a range under which block allocation is happening? I would much prefer an allocation API that supports allocation/freeing without requiring any pagecache at all. > > Basically, once pagecache is marked uptodate, I don't think we should > > ever put maybe-invalid data into it -- the way to do it is to invalidate > > that page and put a *new* page in there. > > Ok, so lets do that... > > > Why? Because user mappings are just one problem, but once you had a > > user mapping, you can have been subject to get_user_pages, so it could > > be in the middle of a DMA operation or something. > > ... because we already know this behaviour causes problems for > high end enterprise level features like hardware checksumming IO > paths. > > Hence it seems that a multipage write needs to: > > 1. allocate new pages > 2. attach bufferheads/mapping structures to pages (if required) > 3. copy data into pages > 4. allocate space > 5. for each old page in the range: > lock page > invalidate mappings > clear page uptodate flag > remove page from page cache > 6. for each new page: > map new page to allocated space > lock new page > insert new page into pagecache > update new page state (write_end equivalent) > unlock new page > 7. free old pages > > Steps 1-4 can all fail, and can all be backed out from without > changing the current state. Steps 5-7 can't fail AFAICT, so we > should be able to run this safely after the allocation without > needing significant error unwinding... > > Thoughts? Possibly. The importance of hot cache is reduced, because we are doing full-page copies, and bulk copies, by definition. But it could still be an issue. The allocations and deallocations could cost a little as well. Compared to having a nice API to just do bulk allocate/free block operations and then just doing simple per-page copies, I think it doesn't look so nice. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on 18 May 2010 08:30 On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote: > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote: > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote: > > > Well you could do a large span block allocation at the beginning, > > > and then dirty the pagecache one by one like we do right now. > > > > The problem is that if we fail to allocate a page (e.g. ENOMEM) or > > fail the copy (EFAULT) after the block allocation, we have to undo > > the allocation we have already completed. If we don't, we leave > > uninitialisaed allocations on disk that will expose stale data. > > > > In the second case (EFAULT) we might be able to zero the pages to > > avoid punching out blocks, but the first case where pages can't be > > allocated to cover the block allocated range makes it very > > difficult without being able to punch holes in allocated block > > ranges. > > > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary > > ranges of allocated blocks from an inode - there is not VFS method > > for it, just an ioctl (XFS_IOC_UNRESVSP). > > > > Hence the way to avoid needing hole punching is to allocate and lock > > down all the pages into the page cache fіrst, then do the copy so > > they fail before the allocation is done if they are going to fail. > > That makes it much, much easier to handle failures.... > > So it is just a matter of what is exposed as a vfs interface? More a matter of utilising the functionality most filesystems already have and minimising the amount of churn in critical areas of filesytsem code. Hole punching is not simple, anѕ bugs will likely result in a corrupted filesystem. And the hole punching will only occur in a hard to trigger corner case, so it's likely that bugs will go undetected and filesystems will suffer from random, impossible to track down corruptions as a result. In comparison, adding reserve/unreserve functionality might cause block accounting issues if there is a bug, but it won't cause on-disk corruption that results in data loss. Hole punching is not simple or easy - it's a damn complex way to handle errors and if that's all it's required for then we've failed already. > I would much prefer to make it a requirement to support hole > punching in the block allocator if the filesystem wishes to do > such writes. I think that's an unrealistic requirement simply because it can be avoided. With a reserve/alloc/unreserve interface, btrfs will work almost unmodified, XFS will require some new wrappers and bufferhead mapping code, and ext4 and gfs2 look to be pretty much in the same boat. All relatively simple on the filesystem side. If we have to add hole punching, XFS will require an extra wrapper but btrfs, gfs2 and ext4 will have to implement hole punching from the ground up. Personally, I think that requiring hole punching is asking far too much for multipage writes, esp. given that btrfs already implements them without needing such functionality... > > > The only reason to do operations on multiple pages at once is if > > > we need to lock them all. > > > > Well, to avoid the refaulting of pages we just unmapped we'd need to > > do that... > > Well, the lock/unmap/copy/unlock could be done on a per-page > basis. The moment we unmap the old page we cannot unlock it until the new page is in the page cache. If we do unlock it, we risk having it faulted again before we insert the new copy. Yes, that can be done page by page, but shoul donly be done after all the pages are allocated and copied into. FWIW, I don't think we can unmap the old page until after the entire copy is done because the old page(s) might be where we are copying from.... > > > Now the fs might well have that requirement > > > (if it is not using i_mutex for block (de)allocation > > > serialisation), but I don't think generic code needs to be doing > > > that. > > > > XFS already falls into the category of a filesystem using the > > generic code that does not use i_mutex for allocation serialisation. > > I'm sure it isn't the only filesystem that this is true for, so it > > seems sensible for the generic code to handle this case. > > Well, does it need page lock? All pages locked concurrently in > a range under which block allocation is happening? No, allocation doesn't require page locks either - XFS has it's own inode locks for serialisation of allocation, truncation and hole punching. > I would much > prefer an allocation API that supports allocation/freeing > without requiring any pagecache at all. Allocation doesn't require any pagecache at all. It's the fact that the allocation needs to be sycnhronised with the page cache state change that requires page locks to be taken as part of the write process. > > > Basically, once pagecache is marked uptodate, I don't think we should > > > ever put maybe-invalid data into it -- the way to do it is to invalidate > > > that page and put a *new* page in there. > > > > Ok, so lets do that... > > > > > Why? Because user mappings are just one problem, but once you had a > > > user mapping, you can have been subject to get_user_pages, so it could > > > be in the middle of a DMA operation or something. > > > > ... because we already know this behaviour causes problems for > > high end enterprise level features like hardware checksumming IO > > paths. > > > > Hence it seems that a multipage write needs to: > > > > 1. allocate new pages > > 2. attach bufferheads/mapping structures to pages (if required) > > 3. copy data into pages > > 4. allocate space > > 5. for each old page in the range: > > lock page > > invalidate mappings > > clear page uptodate flag > > remove page from page cache > > 6. for each new page: > > map new page to allocated space > > lock new page > > insert new page into pagecache > > update new page state (write_end equivalent) > > unlock new page > > 7. free old pages > > > > Steps 1-4 can all fail, and can all be backed out from without > > changing the current state. Steps 5-7 can't fail AFAICT, so we > > should be able to run this safely after the allocation without > > needing significant error unwinding... > > > > Thoughts? > > Possibly. The importance of hot cache is reduced, because we are > doing full-page copies, and bulk copies, by definition. But it > could still be an issue. The allocations and deallocations could > cost a little as well. They will cost far less than the reduction in allocation overhead saves us, and there are potential optimisations there for reuse of old pages.... Cheers, Dave. -- Dave Chinner david(a)fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on 18 May 2010 11:10 On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote: > On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote: > > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote: > > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote: > > > > Well you could do a large span block allocation at the beginning, > > > > and then dirty the pagecache one by one like we do right now. > > > > > > The problem is that if we fail to allocate a page (e.g. ENOMEM) or > > > fail the copy (EFAULT) after the block allocation, we have to undo > > > the allocation we have already completed. If we don't, we leave > > > uninitialisaed allocations on disk that will expose stale data. > > > > > > In the second case (EFAULT) we might be able to zero the pages to > > > avoid punching out blocks, but the first case where pages can't be > > > allocated to cover the block allocated range makes it very > > > difficult without being able to punch holes in allocated block > > > ranges. > > > > > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary > > > ranges of allocated blocks from an inode - there is not VFS method > > > for it, just an ioctl (XFS_IOC_UNRESVSP). > > > > > > Hence the way to avoid needing hole punching is to allocate and lock > > > down all the pages into the page cache fіrst, then do the copy so > > > they fail before the allocation is done if they are going to fail. > > > That makes it much, much easier to handle failures.... > > > > So it is just a matter of what is exposed as a vfs interface? > > More a matter of utilising the functionality most filesystems > already have and minimising the amount of churn in critical areas of > filesytsem code. Hole punching is not simple, anѕ bugs will likely > result in a corrupted filesystem. And the hole punching will only > occur in a hard to trigger corner case, so it's likely that bugs > will go undetected and filesystems will suffer from random, > impossible to track down corruptions as a result. > > In comparison, adding reserve/unreserve functionality might cause > block accounting issues if there is a bug, but it won't cause > on-disk corruption that results in data loss. Hole punching is not > simple or easy - it's a damn complex way to handle errors and if > that's all it's required for then we've failed already. As I said, we can have a dumb fallback path for filesystems that don't implement hole punching. Clear the blocks past i size, and zero out the allocated but not initialized blocks. There does not have to be pagecache allocated in order to do this, you could do direct IO from the zero page in order to do it. Hole punching is not only useful there, it is already exposed to userspace via MADV_REMOVE. > > I would much prefer to make it a requirement to support hole > > punching in the block allocator if the filesystem wishes to do > > such writes. > > I think that's an unrealistic requirement simply because it can be > avoided. With a reserve/alloc/unreserve interface, btrfs will work > almost unmodified, XFS will require some new wrappers and bufferhead > mapping code, and ext4 and gfs2 look to be pretty much in the same > boat. All relatively simple on the filesystem side. > > If we have to add hole punching, XFS will require an extra wrapper > but btrfs, gfs2 and ext4 will have to implement hole punching from > the ground up. Personally, I think that requiring hole punching is > asking far too much for multipage writes, esp. given that btrfs > already implements them without needing such functionality... I think that is the better long term choice rather than having inferior APIs then being stuck with them for years. > > > > The only reason to do operations on multiple pages at once is if > > > > we need to lock them all. > > > > > > Well, to avoid the refaulting of pages we just unmapped we'd need to > > > do that... > > > > Well, the lock/unmap/copy/unlock could be done on a per-page > > basis. > > The moment we unmap the old page we cannot unlock it until the new > page is in the page cache. If we do unlock it, we risk having it > faulted again before we insert the new copy. Yes, that can be done > page by page, but shoul donly be done after all the pages are > allocated and copied into. > > FWIW, I don't think we can unmap the old page until after the entire > copy is done because the old page(s) might be where we are copying > from.... Yeah true. > > > > Now the fs might well have that requirement > > > > (if it is not using i_mutex for block (de)allocation > > > > serialisation), but I don't think generic code needs to be doing > > > > that. > > > > > > XFS already falls into the category of a filesystem using the > > > generic code that does not use i_mutex for allocation serialisation. > > > I'm sure it isn't the only filesystem that this is true for, so it > > > seems sensible for the generic code to handle this case. > > > > Well, does it need page lock? All pages locked concurrently in > > a range under which block allocation is happening? > > No, allocation doesn't require page locks either - XFS has it's own > inode locks for serialisation of allocation, truncation and hole > punching. Right, I was just mentioning multiple page locks could just be useful for that. I was not advocating that we only support i_mutex filesystems from the generic code. > > I would much > > prefer an allocation API that supports allocation/freeing > > without requiring any pagecache at all. > > Allocation doesn't require any pagecache at all. It's the fact that > the allocation needs to be sycnhronised with the page cache state > change that requires page locks to be taken as part of the write > process. When setting up the buffer state with the filesystem state. Sure. > > > > Basically, once pagecache is marked uptodate, I don't think we should > > > > ever put maybe-invalid data into it -- the way to do it is to invalidate > > > > that page and put a *new* page in there. > > > > > > Ok, so lets do that... > > > > > > > Why? Because user mappings are just one problem, but once you had a > > > > user mapping, you can have been subject to get_user_pages, so it could > > > > be in the middle of a DMA operation or something. > > > > > > ... because we already know this behaviour causes problems for > > > high end enterprise level features like hardware checksumming IO > > > paths. > > > > > > Hence it seems that a multipage write needs to: > > > > > > 1. allocate new pages > > > 2. attach bufferheads/mapping structures to pages (if required) > > > 3. copy data into pages > > > 4. allocate space > > > 5. for each old page in the range: > > > lock page > > > invalidate mappings > > > clear page uptodate flag > > > remove page from page cache > > > 6. for each new page: > > > map new page to allocated space > > > lock new page > > > insert new page into pagecache > > > update new page state (write_end equivalent) > > > unlock new page > > > 7. free old pages > > > > > > Steps 1-4 can all fail, and can all be backed out from without > > > changing the current state. Steps 5-7 can't fail AFAICT, so we > > > should be able to run this safely after the allocation without > > > needing significant error unwinding... > > > > > > Thoughts? > > > > Possibly. The importance of hot cache is reduced, because we are > > doing full-page copies, and bulk copies, by definition. But it > > could still be an issue. The allocations and deallocations could > > cost a little as well. > > They will cost far less than the reduction in allocation overhead > saves us, and there are potential optimisations there for reuse of > old pages.... An API that doesn't require that, though, should be less overhead and simpler. Is it really going to be a problem to implement block hole punching in ext4 and gfs2? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on 19 May 2010 20:00 On Wed, May 19, 2010 at 01:09:12AM +1000, Nick Piggin wrote: > On Tue, May 18, 2010 at 10:27:14PM +1000, Dave Chinner wrote: > > On Tue, May 18, 2010 at 08:43:51PM +1000, Nick Piggin wrote: > > > On Tue, May 18, 2010 at 06:05:03PM +1000, Dave Chinner wrote: > > > > On Tue, May 18, 2010 at 04:36:47PM +1000, Nick Piggin wrote: > > > > > Well you could do a large span block allocation at the beginning, > > > > > and then dirty the pagecache one by one like we do right now. > > > > > > > > The problem is that if we fail to allocate a page (e.g. ENOMEM) or > > > > fail the copy (EFAULT) after the block allocation, we have to undo > > > > the allocation we have already completed. If we don't, we leave > > > > uninitialisaed allocations on disk that will expose stale data. > > > > > > > > In the second case (EFAULT) we might be able to zero the pages to > > > > avoid punching out blocks, but the first case where pages can't be > > > > allocated to cover the block allocated range makes it very > > > > difficult without being able to punch holes in allocated block > > > > ranges. > > > > > > > > AFAIK, only XFS and OCFS2 currently support punching out arbitrary > > > > ranges of allocated blocks from an inode - there is not VFS method > > > > for it, just an ioctl (XFS_IOC_UNRESVSP). > > > > > > > > Hence the way to avoid needing hole punching is to allocate and lock > > > > down all the pages into the page cache fіrst, then do the copy so > > > > they fail before the allocation is done if they are going to fail. > > > > That makes it much, much easier to handle failures.... > > > > > > So it is just a matter of what is exposed as a vfs interface? > > > > More a matter of utilising the functionality most filesystems > > already have and minimising the amount of churn in critical areas of > > filesytsem code. Hole punching is not simple, anѕ bugs will likely > > result in a corrupted filesystem. And the hole punching will only > > occur in a hard to trigger corner case, so it's likely that bugs > > will go undetected and filesystems will suffer from random, > > impossible to track down corruptions as a result. > > > > In comparison, adding reserve/unreserve functionality might cause > > block accounting issues if there is a bug, but it won't cause > > on-disk corruption that results in data loss. Hole punching is not > > simple or easy - it's a damn complex way to handle errors and if > > that's all it's required for then we've failed already. > > As I said, we can have a dumb fallback path for filesystems that > don't implement hole punching. Clear the blocks past i size, and > zero out the allocated but not initialized blocks. > > There does not have to be pagecache allocated in order to do this, > you could do direct IO from the zero page in order to do it. I don't see that as a good solution - it's once again a fairly complex way of dealing with the problem, especially as it now means that direct io would fall back to buffered which would fall back to direct IO.... > Hole punching is not only useful there, it is already exposed to > userspace via MADV_REMOVE. That interface is *totally broken*. It has all the same problems as vmtruncate() for removing file blocks (because it uses vmtruncate). It also has the fundamental problem of being called un the mmap_sem, which means that inode locks and therefore de-allocation cannot be executed without the possibility of deadlocks. Fundamentally, hole punching is an inode operation, not a VM operation.... > > > > > Basically, once pagecache is marked uptodate, I don't think we should > > > > > ever put maybe-invalid data into it -- the way to do it is to invalidate > > > > > that page and put a *new* page in there. > > > > > > > > Ok, so lets do that... > > > > > > > > > Why? Because user mappings are just one problem, but once you had a > > > > > user mapping, you can have been subject to get_user_pages, so it could > > > > > be in the middle of a DMA operation or something. > > > > > > > > ... because we already know this behaviour causes problems for > > > > high end enterprise level features like hardware checksumming IO > > > > paths. > > > > > > > > Hence it seems that a multipage write needs to: > > > > > > > > 1. allocate new pages > > > > 2. attach bufferheads/mapping structures to pages (if required) > > > > 3. copy data into pages > > > > 4. allocate space > > > > 5. for each old page in the range: > > > > lock page > > > > invalidate mappings > > > > clear page uptodate flag > > > > remove page from page cache > > > > 6. for each new page: > > > > map new page to allocated space > > > > lock new page > > > > insert new page into pagecache > > > > update new page state (write_end equivalent) > > > > unlock new page > > > > 7. free old pages > > > > > > > > Steps 1-4 can all fail, and can all be backed out from without > > > > changing the current state. Steps 5-7 can't fail AFAICT, so we > > > > should be able to run this safely after the allocation without > > > > needing significant error unwinding... > > > > > > > > Thoughts? > > > > > > Possibly. The importance of hot cache is reduced, because we are > > > doing full-page copies, and bulk copies, by definition. But it > > > could still be an issue. The allocations and deallocations could > > > cost a little as well. > > > > They will cost far less than the reduction in allocation overhead > > saves us, and there are potential optimisations there > > An API that doesn't require that, though, should be less overhead > and simpler. > > Is it really going to be a problem to implement block hole punching > in ext4 and gfs2? I can't follow the ext4 code - it's an intricate maze of weird entry and exit points, so I'm not even going to attempt to comment on it. The gfs2 code is easier to follow and it looks like it would require a redesign and rewrite of the block truncation implementation as it appears to assume that blocks are only ever removed from the end of the file - I don't think the recursive algorithms for trimming the indirect block trees can be easily modified for punching out arbitrary ranges of blocks easily. I could be wrong, though, as I'm not a gfs2 expert.... Cheers, Dave. -- Dave Chinner david(a)fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on 20 May 2010 02:50
On Thu, May 20, 2010 at 09:50:54AM +1000, Dave Chinner wrote: > > As I said, we can have a dumb fallback path for filesystems that > > don't implement hole punching. Clear the blocks past i size, and > > zero out the allocated but not initialized blocks. > > > > There does not have to be pagecache allocated in order to do this, > > you could do direct IO from the zero page in order to do it. > > I don't see that as a good solution - it's once again a fairly > complex way of dealing with the problem, especially as it now means > that direct io would fall back to buffered which would fall back to > direct IO.... Well it wouldn't use the full direct IO path. It has the block, just build a bio with the source zero page and write it out. If the fs requires anything more fancy than that, tough, it should just implement hole punching. > > Hole punching is not only useful there, it is already exposed to > > userspace via MADV_REMOVE. > > That interface is *totally broken*. Why? > It has all the same problems as > vmtruncate() for removing file blocks (because it uses vmtruncate). > It also has the fundamental problem of being called un the mmap_sem, > which means that inode locks and therefore de-allocation cannot be > executed without the possibility of deadlocks. None of that is an API problem, it's all implementation. Yes fadivse would be a much better API, but the madvise API is still there. Implementation wise: it does not use vmtruncate; it has no mmap_sem problem. > Fundamentally, hole > punching is an inode operation, not a VM operation.... VM acts as a handle to inode operations. It's no big deal. > > An API that doesn't require that, though, should be less overhead > > and simpler. > > > > Is it really going to be a problem to implement block hole punching > > in ext4 and gfs2? > > I can't follow the ext4 code - it's an intricate maze of weird entry > and exit points, so I'm not even going to attempt to comment on it. > > The gfs2 code is easier to follow and it looks like it would require > a redesign and rewrite of the block truncation implementation as it > appears to assume that blocks are only ever removed from the end of > the file - I don't think the recursive algorithms for trimming the > indirect block trees can be easily modified for punching out > arbitrary ranges of blocks easily. I could be wrong, though, as I'm > not a gfs2 expert.... I'm far more in favour of doing the interfaces right, and making the filesystems fix themselves to use it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |