Prev: [PATCH 2/3] DMAENGINE: add runtime slave control to COH 901 318
Next: trace syscalls: Clean confusing {s,r,}name and fix ABI breakage
From: Dave Chinner on 28 Jun 2010 20:30 On Mon, Jun 28, 2010 at 10:35:29AM -0700, Joel Becker wrote: > This reverts commit d87815cb2090e07b0b0b2d73dc9740706e92c80c. Hi Joel, I have no problems with it being reverted - it's a really just a WAR for the simplest case of the sync hold holdoff. However, I had no idea that any filesystem relied on being able to write pages beyond EOF, and I'd like to understand the implications of it on the higher level code and, more importantly, understand how the writes are getting to disk through multiple layers of page-beyond-i_size checks in the writeback code.... > This patch causes any filesystem with an allocation unit larger than the > filesystem blocksize will leak unzeroed data. During a file extend, the > entire allocation unit is zeroed. XFS has this same underlying issue - it can have uninitialised, allocated blocks past EOF that have to be zeroed when extending the file. > However, this patch prevents the tail > blocks of the allocation unit from being written back to disk. When the > file is next extended, i_size will now cover these unzeroed blocks, > leaking the old contents of the disk to userspace and creating a corrupt > file. XFS doesn't zero blocks at allocation. Instead, XFS zeros the range between the old EOF and the new EOF on each extending write. Hence these pages get written because they fall inside the new i_size that is set during the write. The i_size on disk doesn't get changed until after the data writes have completed, so even on a crash we don't expose uninitialised blocks. > This affects ocfs2 directly. As Tao Ma mentioned in his reporting > email: > > 1. all the place we use filemap_fdatawrite in ocfs2 doesn't flush pages > after i_size now. > 2. sync, fsync, fdatasync and umount don't flush pages after i_size(they > are called from writeback_single_inode). I'm not sure this was ever supposed to work - my understanding is that we should never do anything with pages beyong i_size as pages beyond EOF as being beyond i_size implies we are racing with a truncate and the page is no longer valid. In that case, we should definitely not write it back to disk. Looking at ocfs2_writepage(), it simply calls block_write_full_page(), which does: /* Is the page fully outside i_size? (truncate in progress) */ offset = i_size & (PAGE_CACHE_SIZE-1); if (page->index >= end_index+1 || !offset) { /* * The page may have dirty, unmapped buffers. For example, * they may have been added in ext3_writepage(). Make them * freeable here, so the page does not leak. */ do_invalidatepage(page, 0); unlock_page(page); return 0; /* don't care */ } i.e. pages beyond EOF get invalidated. If it somehow gets through that check, __block_write_full_page() will avoid writing dirty bufferheads beyond EOF because the write is "racing with truncate". Hence there are multiple layers of protection against writing past i_size, so I'm wondering how these pages are even getting to disk in the first place.... > 3. reflink have a BUG_ON triggered because we have some dirty pages > while during CoW. http://oss.oracle.com/bugzilla/show_bug.cgi?id=1265 I'd suggest that the reason you see the BUG_ON() with this patch is that the pages beyond EOF are not being invalidated because they are not being passed to ->writepage and hence are remaining dirty in the cache. IOWs, I suspect that this commit has uncovered a bug in ocfs2, not that it has caused a regression. Your thoughts, Joel? Cheers, Dave. -- Dave Chinner david(a)fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Joel Becker on 28 Jun 2010 21:00 On Tue, Jun 29, 2010 at 10:24:21AM +1000, Dave Chinner wrote: > On Mon, Jun 28, 2010 at 10:35:29AM -0700, Joel Becker wrote: > > This reverts commit d87815cb2090e07b0b0b2d73dc9740706e92c80c. > > Hi Joel, > > I have no problems with it being reverted - it's a really just a > WAR for the simplest case of the sync hold holdoff. I have to insist that we revert it until we find a way to make ocfs2 work. The rest of the email will discuss the ocfs2 issues therein. > > This patch causes any filesystem with an allocation unit larger than the > > filesystem blocksize will leak unzeroed data. During a file extend, the > > entire allocation unit is zeroed. > > XFS has this same underlying issue - it can have uninitialised, > allocated blocks past EOF that have to be zeroed when extending the > file. Does XFS do this in get_blocks()? We deliberately do no allocation in get_blocks(), which is where our need for up-front allocation comes from. > > However, this patch prevents the tail > > blocks of the allocation unit from being written back to disk. When the > > file is next extended, i_size will now cover these unzeroed blocks, > > leaking the old contents of the disk to userspace and creating a corrupt > > file. > > XFS doesn't zero blocks at allocation. Instead, XFS zeros the range > between the old EOF and the new EOF on each extending write. Hence > these pages get written because they fall inside the new i_size that > is set during the write. The i_size on disk doesn't get changed > until after the data writes have completed, so even on a crash we > don't expose uninitialised blocks. We do the same, but we zero the entire allocation. This works both when filling holes and when extending, though obviously the extending is what we're worried about here. We change i_size in write_end, so our guarantee is the same as yours for the page containing i_size. > Looking at ocfs2_writepage(), it simply calls > block_write_full_page(), which does: > > /* Is the page fully outside i_size? (truncate in progress) */ > offset = i_size & (PAGE_CACHE_SIZE-1); > if (page->index >= end_index+1 || !offset) { > /* > * The page may have dirty, unmapped buffers. For example, > * they may have been added in ext3_writepage(). Make them > * freeable here, so the page does not leak. > */ > do_invalidatepage(page, 0); > unlock_page(page); > return 0; /* don't care */ > } > > i.e. pages beyond EOF get invalidated. If it somehow gets through > that check, __block_write_full_page() will avoid writing dirty > bufferheads beyond EOF because the write is "racing with truncate". Your contention is that we've never gotten those tail blocks to disk. Instead, our code either handles the future extensions of i_size or we've just gotten lucky with our testing. Our current BUG trigger is because we have a new check that catches this case. Does that summarize your position correctly? I'm not averse to having a zero-only-till-i_size policy, but I know we've visited this problem before and got bit. I have to go reload that context. Regarding XFS, how do you handle catching the tail of an allocation with an lseek(2)'d write? That is, your current allocation has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF and write there. The code has to recognize to zero around old_i_size before moving out to new_i_size, right? I think that's where our old approaches had problems. Joel -- "The real reason GNU ls is 8-bit-clean is so that they can start using ISO-8859-1 option characters." - Christopher Davis (ckd(a)loiosh.kei.com) Joel Becker Consulting Software Developer Oracle E-mail: joel.becker(a)oracle.com Phone: (650) 506-8127 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on 28 Jun 2010 22:00 On Mon, Jun 28, 2010 at 05:54:04PM -0700, Joel Becker wrote: > On Tue, Jun 29, 2010 at 10:24:21AM +1000, Dave Chinner wrote: > > On Mon, Jun 28, 2010 at 10:35:29AM -0700, Joel Becker wrote: > > > This reverts commit d87815cb2090e07b0b0b2d73dc9740706e92c80c. > > > > Hi Joel, > > > > I have no problems with it being reverted - it's a really just a > > WAR for the simplest case of the sync hold holdoff. > > I have to insist that we revert it until we find a way to make > ocfs2 work. The rest of the email will discuss the ocfs2 issues > therein. > > > > This patch causes any filesystem with an allocation unit larger than the > > > filesystem blocksize will leak unzeroed data. During a file extend, the > > > entire allocation unit is zeroed. > > > > XFS has this same underlying issue - it can have uninitialised, > > allocated blocks past EOF that have to be zeroed when extending the > > file. > > Does XFS do this in get_blocks()? We deliberately do no > allocation in get_blocks(), which is where our need for up-front > allocation comes from. No, it does it in xfs_file_aio_write() (i.e. ->aio_write()) so it catches both buffered and direct IO. This can't be done in the get_blocks() callback because (IMO) there really isn't the context available to know exactly how we are extending the file in get_blocks(). > > > However, this patch prevents the tail > > > blocks of the allocation unit from being written back to disk. When the > > > file is next extended, i_size will now cover these unzeroed blocks, > > > leaking the old contents of the disk to userspace and creating a corrupt > > > file. > > > > XFS doesn't zero blocks at allocation. Instead, XFS zeros the range > > between the old EOF and the new EOF on each extending write. Hence > > these pages get written because they fall inside the new i_size that > > is set during the write. The i_size on disk doesn't get changed > > until after the data writes have completed, so even on a crash we > > don't expose uninitialised blocks. > > We do the same, but we zero the entire allocation. This works > both when filling holes and when extending, though obviously the > extending is what we're worried about here. We change i_size in > write_end, so our guarantee is the same as yours for the page containing > i_size. Ok, so the you've got cached pages covering the file and the tail of the last page/block zeroed in memory. I'd guess that ordered mode journalling then ensures the inode size update doesn't hit the disk until after the data does, so this is crash-safe. That would explain (to me, at least) why you are not seeing stale data exposure on crashes. > > Looking at ocfs2_writepage(), it simply calls > > block_write_full_page(), which does: > > > > /* Is the page fully outside i_size? (truncate in progress) */ > > offset = i_size & (PAGE_CACHE_SIZE-1); > > if (page->index >= end_index+1 || !offset) { > > /* > > * The page may have dirty, unmapped buffers. For example, > > * they may have been added in ext3_writepage(). Make them > > * freeable here, so the page does not leak. > > */ > > do_invalidatepage(page, 0); > > unlock_page(page); > > return 0; /* don't care */ > > } > > > > i.e. pages beyond EOF get invalidated. If it somehow gets through > > that check, __block_write_full_page() will avoid writing dirty > > bufferheads beyond EOF because the write is "racing with truncate". > > Your contention is that we've never gotten those tail blocks to > disk. Instead, our code either handles the future extensions of i_size > or we've just gotten lucky with our testing. Our current BUG trigger is > because we have a new check that catches this case. Does that summarize > your position correctly? Yes, that summarises it pretty well ;) > I'm not averse to having a zero-only-till-i_size policy, but I > know we've visited this problem before and got bit. I have to go reload > that context. There's no hurry from my perspective - I just prefer to understand the the root cause of a problem before jumping.... > Regarding XFS, how do you handle catching the tail of an > allocation with an lseek(2)'d write? That is, your current allocation > has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF > and write there. The code has to recognize to zero around old_i_size > before moving out to new_i_size, right? I think that's where our old > approaches had problems. xfs_file_aio_write() handles both those cases for us via xfs_zero_eof(). What it does is map the region from the old EOF to the start of the new write and zeroes any allocated blocks that are not marked unwritten that lie within the range. It does this via the internal mapping interface because we hide allocated blocks past EOF from the page cache and higher layers. FWIW, the way XFS does this is safe against crashes because the inode size does not get updated on disk or in the journal until after the data has hit the disk. Ordered journalling should also provide this guarantee, i think. Cheers, Dave. -- Dave Chinner david(a)fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Joel Becker on 28 Jun 2010 22:10 On Tue, Jun 29, 2010 at 11:56:15AM +1000, Dave Chinner wrote: > > Regarding XFS, how do you handle catching the tail of an > > allocation with an lseek(2)'d write? That is, your current allocation > > has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF > > and write there. The code has to recognize to zero around old_i_size > > before moving out to new_i_size, right? I think that's where our old > > approaches had problems. > > xfs_file_aio_write() handles both those cases for us via > xfs_zero_eof(). What it does is map the region from the old EOF to > the start of the new write and zeroes any allocated blocks that are > not marked unwritten that lie within the range. It does this via the > internal mapping interface because we hide allocated blocks past EOF > from the page cache and higher layers. Makes sense as an approach. We deliberately do this through the page cache to take advantage of its I/O patterns and tie in with JBD2. Also, we don't feel like maintaining an entire shadow page cache ;-) Joel -- Life's Little Instruction Book #356 "Be there when people need you." Joel Becker Consulting Software Developer Oracle E-mail: joel.becker(a)oracle.com Phone: (650) 506-8127 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on 28 Jun 2010 22:30
On Mon, Jun 28, 2010 at 07:04:20PM -0700, Joel Becker wrote: > On Tue, Jun 29, 2010 at 11:56:15AM +1000, Dave Chinner wrote: > > > Regarding XFS, how do you handle catching the tail of an > > > allocation with an lseek(2)'d write? That is, your current allocation > > > has a few blocks outside of i_size, then I lseek(2) a gigabyte past EOF > > > and write there. The code has to recognize to zero around old_i_size > > > before moving out to new_i_size, right? I think that's where our old > > > approaches had problems. > > > > xfs_file_aio_write() handles both those cases for us via > > xfs_zero_eof(). What it does is map the region from the old EOF to > > the start of the new write and zeroes any allocated blocks that are > > not marked unwritten that lie within the range. It does this via the > > internal mapping interface because we hide allocated blocks past EOF > > from the page cache and higher layers. > > Makes sense as an approach. We deliberately do this through the > page cache to take advantage of its I/O patterns and tie in with JBD2. > Also, we don't feel like maintaining an entire shadow page cache ;-) Just to clarify any possible misunderstanding here, xfs_zero_eof() also does it's IO through the page cache for similar reasons. It's just the mappings are found via the internal interfaces before the zeroing is done via the anonymous pagecache_write_begin()/ pagecache_write_end() functions (in xfs_iozero()) rather than using the generic block functions. Cheers, Dave. -- Dave Chinner david(a)fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |