Prev: [PATCH] Fix ttm_page_alloc.c build breakage
Next: JBD2: Allow feature checks before journal recovery
From: Jan Kara on 21 Jul 2010 13:20 Hi, > On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote: > > > > The problem with not issuing a cache flush when you have dirty meta > > data or data is that it does not have any tie to the state of the > > volatile write cache of the target storage device. > > We track whether or not there is any metadata updates associated with > the inode already; if it does, we force a journal commit, and this > implies a barrier operation. > > The case we're talking about here is one where either (a) there is no > journal, or (b) there have been no metadata updates (I'm simplifying a > little here; in fact we track whether there have been fdatasync()- vs > fsync()- worthy metadata updates), and so there hasn't been a journal > commit to do the cache flush. > > In this case, we want to track when is the last time an fsync() has > been issued, versus when was the last time data blocks for a > particular inode have been pushed out to disk. > > To use an example I used as motivation for why we might want an > fsync2(int fd[], int flags[], int num) syscall, consider the situation > of: > > fsync(control_fd); > fdatasync(data_fd); > > The first fsync() will have executed a cache flush operation. So when > we do the fdatasync() (assuming that no metadata needs to be flushed > out to disk), there is no need for the cache flush operation. > > If we had an enhanced fsync command, we would also be able to > eliminate a second journal commit in the case where data_fd also had > some metadata that needed to be flushed out to disk. Current implementation already avoids journal commit because of fdatasync(data_fd). We remeber a transaction ID when inode metadata has last been updated and do not force a transaction commit if it is already committed. Thus the first fsync might force a transaction commit but second fdatasync likely won't. We could actually improve the scheme to work for data as well. I wrote a proof-of-concept patches (attached) and they nicely avoid second barrier when doing: echo "aaa" >file1; echo "aaa" >file2; fsync file2; fsync file1 Ted, would you be interested in something like this? Honza -- Jan Kara <jack(a)suse.cz> SuSE CR Labs
From: Darrick J. Wong on 2 Aug 2010 20:20 On Wed, Jul 21, 2010 at 07:16:09PM +0200, Jan Kara wrote: > Hi, > > > On Wed, Jun 30, 2010 at 09:21:04AM -0400, Ric Wheeler wrote: > > > > > > The problem with not issuing a cache flush when you have dirty meta > > > data or data is that it does not have any tie to the state of the > > > volatile write cache of the target storage device. > > > > We track whether or not there is any metadata updates associated with > > the inode already; if it does, we force a journal commit, and this > > implies a barrier operation. > > > > The case we're talking about here is one where either (a) there is no > > journal, or (b) there have been no metadata updates (I'm simplifying a > > little here; in fact we track whether there have been fdatasync()- vs > > fsync()- worthy metadata updates), and so there hasn't been a journal > > commit to do the cache flush. > > > > In this case, we want to track when is the last time an fsync() has > > been issued, versus when was the last time data blocks for a > > particular inode have been pushed out to disk. > > > > To use an example I used as motivation for why we might want an > > fsync2(int fd[], int flags[], int num) syscall, consider the situation > > of: > > > > fsync(control_fd); > > fdatasync(data_fd); > > > > The first fsync() will have executed a cache flush operation. So when > > we do the fdatasync() (assuming that no metadata needs to be flushed > > out to disk), there is no need for the cache flush operation. > > > > If we had an enhanced fsync command, we would also be able to > > eliminate a second journal commit in the case where data_fd also had > > some metadata that needed to be flushed out to disk. > Current implementation already avoids journal commit because of > fdatasync(data_fd). We remeber a transaction ID when inode metadata has > last been updated and do not force a transaction commit if it is already > committed. Thus the first fsync might force a transaction commit but second > fdatasync likely won't. > We could actually improve the scheme to work for data as well. I wrote > a proof-of-concept patches (attached) and they nicely avoid second barrier > when doing: > echo "aaa" >file1; echo "aaa" >file2; fsync file2; fsync file1 > > Ted, would you be interested in something like this? Well... on my fsync-happy workloads, this seems to cut the barrier count down by about 20%, and speeds it up by about 20%. I also have a patch to ext4_sync_files that batches the fsync requests together for a further 20% decrease in barrier IOs, which makes it run another 20% faster. I'll send that one out shortly, though I've not safety-tested it at all. --D -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Darrick J. Wong on 4 Aug 2010 14:20
On Tue, Aug 03, 2010 at 05:01:52AM -0400, Christoph Hellwig wrote: > On Mon, Aug 02, 2010 at 05:09:39PM -0700, Darrick J. Wong wrote: > > Well... on my fsync-happy workloads, this seems to cut the barrier count down > > by about 20%, and speeds it up by about 20%. > > Care to share the test case for this? I'd be especially interesting on > how it behaves with non-draining barriers / cache flushes in fsync. Sure. When I run blktrace with the ffsb profile, I get these results: barriers transactions/sec 16212 206 15625 201 10442 269 10870 266 15658 201 Without Jan's patch: barriers transactions/sec 20855 177 20963 177 20340 174 20908 177 The two ~270 results are a little odd... if we ignore them, the net gain with Jan's patch is about a 25% reduction in barriers issued and about a 15% increase in tps. (If we don't, it's ~30% and ~30%, respectively.) That said, I was running mkfs between runs, so it's possible that the disk layout could have shifted a bit. If I turn off the fsync parts of the ffsb profile, the barrier counts drop to about a couple every second or so, which means that Jan's patch doesn't have much of an effect. But it does help if someone is hammering on the filesystem with fsync. The ffsb profile is attached below. --D ----------- time=300 alignio=1 directio=1 [filesystem0] location=/mnt/ num_files=100000 num_dirs=1000 reuse=1 # File sizes range from 1kB to 1MB. size_weight 1KB 10 size_weight 2KB 15 size_weight 4KB 16 size_weight 8KB 16 size_weight 16KB 15 size_weight 32KB 10 size_weight 64KB 8 size_weight 128KB 4 size_weight 256KB 3 size_weight 512KB 2 size_weight 1MB 1 create_blocksize=1048576 [end0] [threadgroup0] num_threads=64 readall_weight=4 create_fsync_weight=2 delete_weight=1 append_weight = 1 append_fsync_weight = 1 stat_weight = 1 create_weight = 1 writeall_weight = 1 writeall_fsync_weight = 1 open_close_weight = 1 write_size=64KB write_blocksize=512KB read_size=64KB read_blocksize=512KB [stats] enable_stats=1 enable_range=1 msec_range 0.00 0.01 msec_range 0.01 0.02 msec_range 0.02 0.05 msec_range 0.05 0.10 msec_range 0.10 0.20 msec_range 0.20 0.50 msec_range 0.50 1.00 msec_range 1.00 2.00 msec_range 2.00 5.00 msec_range 5.00 10.00 msec_range 10.00 20.00 msec_range 20.00 50.00 msec_range 50.00 100.00 msec_range 100.00 200.00 msec_range 200.00 500.00 msec_range 500.00 1000.00 msec_range 1000.00 2000.00 msec_range 2000.00 5000.00 msec_range 5000.00 10000.00 [end] [end0] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |