Prev: driver for mcs7830 (aka DeLOCK) USB ethernet adapter
Next: [PATCH] [MTD] CHIPS: Support for SST 49LF040B flash chip
From: Dave Jones on 9 Oct 2006 19:00 On Mon, Oct 09, 2006 at 02:59:25PM -0700, Badari Pulavarty wrote: > journal_dirty_data() would do submit_bh() ONLY if its part of the older > transaction. > > I need to take a closer look to understand the race. > > BTW, is this 1k or 2k filesystem ? (18:41:11:davej(a)gelk:~)$ sudo tune2fs -l /dev/md0 | grep size Block size: 1024 Fragment size: 1024 Inode size: 128 (18:41:16:davej(a)gelk:~)$ > How easy is to reproduce the problem ? I can reproduce it within a few hours of stressing, but only on that one box. I've not figured out exactly what's so special about it yet (though the 1k block thing may be it). I had been thinking it was a raid0 only thing, as none of my other boxes have that. I'm not entirely sure how it got set up that way either. The Fedora installer being too smart for its own good perhaps. Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Jan Kara on 10 Oct 2006 09:20 > On Mon, 2006-10-09 14:46:30 -0500, Eric Sandeen <sandeen(a)sandeen.net> wrote: > > Andrew Morton wrote: > > > On Tue, 03 Oct 2006 00:43:01 -0500 > > > Eric Sandeen <sandeen(a)sandeen.net> wrote: > > > > Dave Jones wrote: > > > > > So I managed to reproduce it with an 'fsx foo' and a > > > > > 'fsstress -d . -r -n 100000 -p 20 -r'. This time I grabbed it from > > > > > a vanilla 2.6.18 with none of the Fedora patches.. > > > > > > > > > > I'll give 2.6.18-git a try next. > > > > > > > > > > ----------- [cut here ] --------- [please bite here ] --------- > > > > > Kernel BUG at fs/buffer.c:2791 > > > > I had thought/hoped that this was fixed by Jan's patch at > > > > http://lkml.org/lkml/2006/9/7/236 from the thread started at > > > > http://lkml.org/lkml/2006/9/1/149, but it seems maybe not. Dave hit this bug > > > > first by going through that new codepath.... > > > > > > Yes, Jan's patch is supposed to fix that !buffer_mapped() assertion. iirc, > > > Badari was hitting that BUG and was able to confirm that Jan's patch > > > (3998b9301d3d55be8373add22b6bc5e11c1d9b71 in post-2.6.18 mainline) fixed > > > it. > > > > Looking at some BH traces*, it appears that what Dave hit is a truncate > > racing with a sync... > > > > truncate ... > > ext3_invalidate_page > > journal_invalidatepage > > journal_unmap buffer > > > > going off at the same time as > > > > sync ... > > journal_dirty_data > > sync_dirty_buffer > > submit_bh <-- finds unmapped buffer, boom. > > Is this possibly related to the issues that are discussed in another > thread? We're seeing problems while unlinking large files (usually get > it within some hours with 200MB files, but couldn't yet reproduce it > with 20MB.) I don't think this is related (BTW: I've run your test for 5 hours without any luck ;( Maybe I'll try again for some longer time...). Honza -- Jan Kara <jack(a)suse.cz> SuSE CR Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Jan Kara on 10 Oct 2006 10:20 > On Mon, Oct 09, 2006 at 02:59:25PM -0700, Badari Pulavarty wrote: > > > journal_dirty_data() would do submit_bh() ONLY if its part of the older > > transaction. > > > > I need to take a closer look to understand the race. > > > > BTW, is this 1k or 2k filesystem ? > > (18:41:11:davej(a)gelk:~)$ sudo tune2fs -l /dev/md0 | grep size > Block size: 1024 > Fragment size: 1024 > Inode size: 128 > (18:41:16:davej(a)gelk:~)$ > > > How easy is to reproduce the problem ? > > I can reproduce it within a few hours of stressing, but only > on that one box. I've not figured out exactly what's so > special about it yet (though the 1k block thing may be it). > I had been thinking it was a raid0 only thing, as none of > my other boxes have that. > > I'm not entirely sure how it got set up that way either. > The Fedora installer being too smart for its own good perhaps. I think it's really the 1KB block size that makes it happen. I've looked at journal_dirty_data() code and I think the following can happen: sync() eventually ends up in journal_dirty_data(bh) as Eric writes. There is finds dirty buffer attached to the comitting transaction. So it drops all locks and calls sync_dirty_buffer(bh). Now in other process, file is truncated so that 'bh' gets just after EOF. As we have 1kb buffers, it can happen that bh is in the partially truncated page. Buffer is marked unmapped and clean. But in a moment the page is marked dirty and msync() is called. That eventually calls set_page_dirty() and all buffers in the page are marked dirty. The first process now wakes up, locks the buffer, clears the dirty bit and does submit_bh() - Oops. This is essentially the same problem Badari found but in a different place. There are two places that are arguably wrong... 1) We mark buffer dirty after EOF. But actually that may be needed - or what is the expected behaviour when we write into mmapped file after EOF, then extend the file and do msync()? 2) We submit a buffer after EOF for IO. This should be clearly avoided but getting the needed info from bh is really ugly... What we could do is: Instead od calling sync_dirty_buffer() we do something like: lock_buffer(bh); jbd_lock_bh_state(bh); if (!buffer_jbd(bh) || jh != bh2jh(bh) || jh->b_transaction != journal->j_committing_transaction) { jbd_unlock_bh_state(bh); unlock_buffer(bh); } jbd_unlock_bh_state(bh); if (test_clear_buffer_dirty(bh)) { get_bh(bh); bh->b_end_io = end_buffer_write_sync; submit_bh(WRITE, bh); wait_on_buffer(bh); } else unlock_buffer(bh); That should deal with the problem... Much more adventurous change would be to remove the syncing code altogether - the new commit code makes sure to write out each buffer just once so the livelock should not happen now. But then we'd have to put running transaction in j_next_transaction and refile data buffers instead of unfiling them. That should actually give quite some performance improvement when intensively rewriting files. But I guess that is JBD2 matter ;). Honza -- Jan Kara <jack(a)suse.cz> SuSE CR Labs - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Jan-Benedict Glaw on 10 Oct 2006 12:40 On Tue, 2006-10-10 15:16:03 +0200, Jan Kara <jack(a)suse.cz> wrote: > > Is this possibly related to the issues that are discussed in another > > thread? We're seeing problems while unlinking large files (usually get > > it within some hours with 200MB files, but couldn't yet reproduce it > > with 20MB.) > I don't think this is related (BTW: I've run your test for 5 hours > without any luck ;( Maybe I'll try again for some longer time...). Looking at it, I also think it's a different thing. Just to add, I couldn't make it bug with a 10 MB file in a day, but it failed again using a 100MB file. So file size seems to matter somehow. MfG, JBG -- Jan-Benedict Glaw jbglaw(a)lug-owl.de +49-172-7608481 Signature of: Wenn ich wach bin, träume ich. the second :
From: Andrew Morton on 10 Oct 2006 14:50
On Tue, 10 Oct 2006 16:11:45 +0200 Jan Kara <jack(a)suse.cz> wrote: > > On Mon, Oct 09, 2006 at 02:59:25PM -0700, Badari Pulavarty wrote: > > > > > journal_dirty_data() would do submit_bh() ONLY if its part of the older > > > transaction. > > > > > > I need to take a closer look to understand the race. > > > > > > BTW, is this 1k or 2k filesystem ? > > > > (18:41:11:davej(a)gelk:~)$ sudo tune2fs -l /dev/md0 | grep size > > Block size: 1024 > > Fragment size: 1024 > > Inode size: 128 > > (18:41:16:davej(a)gelk:~)$ > > > > > How easy is to reproduce the problem ? > > > > I can reproduce it within a few hours of stressing, but only > > on that one box. I've not figured out exactly what's so > > special about it yet (though the 1k block thing may be it). > > I had been thinking it was a raid0 only thing, as none of > > my other boxes have that. > > > > I'm not entirely sure how it got set up that way either. > > The Fedora installer being too smart for its own good perhaps. > I think it's really the 1KB block size that makes it happen. > I've looked at journal_dirty_data() code and I think the following can > happen: > sync() eventually ends up in journal_dirty_data(bh) as Eric writes. > There is finds dirty buffer attached to the comitting transaction. So it drops > all locks and calls sync_dirty_buffer(bh). > Now in other process, file is truncated so that 'bh' gets just after EOF. > As we have 1kb buffers, it can happen that bh is in the partially > truncated page. Buffer is marked unmapped and clean. But in a moment the page > is marked dirty and msync() is called. That eventually calls > set_page_dirty() and all buffers in the page are marked dirty. > The first process now wakes up, locks the buffer, clears the dirty bit > and does submit_bh() - Oops. > > This is essentially the same problem Badari found but in a different > place. There are two places that are arguably wrong... > 1) We mark buffer dirty after EOF. But actually that may be needed - > or what is the expected behaviour when we write into mmapped file after > EOF, then extend the file and do msync()? yup. > 2) We submit a buffer after EOF for IO. This should be clearly avoided > but getting the needed info from bh is really ugly... Things like __block_write_full_page() avoid this by checking the block's offset against i_size. (Not racy against truncate-down because the page is locked, not racy against truncate-up because the bh is zero and up-to-date). But for jbd writeout we don't hold the page lock, so checking against bh->b_page->host->i_size is a bit racy. hm. But we do lock the buffers in journal_invalidatepage(), so checking i_size after locking the buffer in the writeout path might get us there. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |