improve the performance of large sequential write NFS workloads [Kernel]

Prev: -tip: origin tree boot crash
Next: patch staging-ramzswap-remove-arm-specific-d-cache-hack.patch added to gregkh-2.6 tree

From: Steve Rago on 19 Dec 2009 09:30

On Sat, 2009-12-19 at 20:20 +0800, Wu Fengguang wrote:
> Hi Steve,
>
> // I should really read the NFS code, but maybe you can help us better
> // understand the problem :)
>
> On Thu, Dec 17, 2009 at 04:17:57PM +0800, Peter Zijlstra wrote:
> > On Wed, 2009-12-16 at 21:03 -0500, Steve Rago wrote:
> > > Eager Writeback for NFS Clients
> > > -------------------------------
> > > Prevent applications that write large sequential streams of data (like backup, for example)
> > > from entering into a memory pressure state, which degrades performance by falling back to
> > > synchronous operations (both synchronous writes and additional commits).
>
> What exactly is the "memory pressure state" condition? What's the
> code to do the "synchronous writes and additional commits" and maybe
> how they are triggered?

Memory pressure occurs when most of the client pages have been dirtied
by an application (think backup server writing multi-gigabyte files that
exceed the size of main memory). The system works harder to be able to
free dirty pages so that they can be reused. For a local file system,
this means writing the pages to disk. For NFS, however, the writes
leave the pages in an "unstable" state until the server responds to a
commit request. Generally speaking, commit processing is far more
expensive than write processing on the server; both are done with the
inode locked, but since the commit takes so long, all writes are
blocked, which stalls the pipeline.

Flushing is generated from the flush kernel threads (nee pdflush) and
from the applications themselves (balance_dirty_pages), as well as
periodic sync (kupdate style). This is roughly controlled by adjusting
dirty_ratio and dirty_background_ratio (along with
dirty_expire_centisecs and dirty_writeback_centisecs).

In addition, when the client system *really* needs a page deep down in
the page allocator, it can generate a synchronous write request of
individual pages. This is just about as expensive as a commit, roughly
speaking, again stalling the pipeline.

>
> > > This is accomplished by preventing the client application from
> > > dirtying pages faster than they can be written to the server:
> > > clients write pages eagerly instead of lazily.
>
> We already have the balance_dirty_pages() based global throttling.
> So what makes the performance difference in your proposed "per-inode" throttling?
> balance_dirty_pages() does have much larger threshold than yours.

I originally spent several months playing with the balance_dirty_pages
algorithm. The main drawback is that it affects more than the inodes
that the caller is writing and that the control of what to do is too
coarse. My final changes (which worked well for 1Gb connections) were
more heuristic than the changes in the patch -- I basically had to come
up with alternate ways to write pages without generating commits on
inodes. Doing this was distasteful, as I was adjusting generic system
behavior for an NFS-only problem. Then a colleague found Peter
Staubach's patch, which worked just as well in less code, and isolated
the change to the NFS component, which is where it belongs.

>
> > > The eager writeback is controlled by a sysctl: fs.nfs.nfs_max_woutstanding set to 0 disables
> > > the feature. Otherwise it contains the maximum number of outstanding NFS writes that can be
> > > in flight for a given file. This is used to block the application from dirtying more pages
> > > until the writes are complete.
>
> What if we do heuristic write-behind for sequential NFS writes?

Part of the patch does implement a heuristic write-behind. See where
nfs_wb_eager() is called.

>
> Another related proposal from Peter Staubach is to start async writeback
> (without the throttle in your proposal) when one inode have enough pages
> dirtied:
>
> Another approach that I suggested was to keep track of the
> number of pages which are dirty on a per-inode basis. When
> enough pages are dirty to fill an over the wire transfer,
> then schedule an asynchronous write to transmit that data to
> the server. This ties in with support to ensure that the
> server/network is not completely overwhelmed by the client
> by flow controlling the writing application to better match
> the bandwidth and latencies of the network and server.
> With this support, the NFS client tends not to fill memory
> with dirty pages and thus, does not depend upon the other
> parts of the system to flush these pages.
>
> Can the above alternatives fix the same problem? (or perhaps, is the
> per-inode throttling really necessary?)

This alternative *is contained in* the patch (as this is mostly Peter's
code anyway; all I've done is the analysis and porting). The throttling
is necessary to prevent the client from continuing to fill all of its
memory with dirty pages, which it can always do faster than it can write
pages to the server.

>
> > > This patch is based heavily (okay, almost entirely) on a prior patch by Peter Staubach. For
> > > the original patch, see http://article.gmane.org/gmane.linux.nfs/24323.
> > >
> > > The patch below applies to linux-2.6.32-rc7, but it should apply cleanly to vanilla linux-2.6.32.
> > >
> > > Performance data and tuning notes can be found on my web site (http://www.nec-labs.com/~sar).
> > > With iozone, I see about 50% improvement for large sequential write workloads over a 1Gb Ethernet.
> > > With an in-house micro-benchmark, I see 80% improvement for large, single-stream, sequential
> > > workloads (where "large" is defined to be greater than the memory size on the client).
>
> These are impressive numbers. I wonder what would be the minimal patch
> (just hacking it to fast, without all the aux bits)? Is it this chunk
> to call nfs_wb_eager()?

The first half is the same as before, with different indentation. The
last half is indeed the heuristic to call nfs_wb_eager() to invoke
asynchronous write-behind.

With these changes, the number of NFS commit messages drops from a few
thousands to tens when writing a 32GB file over NFS. This is mainly
because the server is writing dirty pages from its cache in the
background, so when a commit comes along, it has less work to do (as
opposed to writing all of the pages on demand and then waiting for the
commit).

I have a second set of changes, which I have not yet submitted, that
removes these commits, but it extends the protocol (in a
backward-compatible way), which will probably be a harder sell.

Steve

>
> > > @@ -623,10 +635,21 @@ static ssize_t nfs_file_write(struct kio
> > > nfs_add_stats(inode, NFSIOS_NORMALWRITTENBYTES, count);
> > > result = generic_file_aio_write(iocb, iov, nr_segs, pos);
> > > /* Return error values for O_SYNC and IS_SYNC() */
> > > - if (result >= 0 && nfs_need_sync_write(iocb->ki_filp, inode)) {
> > > - int err = nfs_do_fsync(nfs_file_open_context(iocb->ki_filp), inode);
> > > - if (err < 0)
> > > - result = err;
> > > + if (result >= 0) {
> > > + if (nfs_need_sync_write(iocb->ki_filp, inode)) {
> > > + int err;
> > > +
> > > + err = nfs_do_fsync(nfs_file_open_context(iocb->ki_filp),
> > > + inode);
> > > + if (err < 0)
> > > + result = err;
> > > + } else if (nfs_max_woutstanding != 0 &&
> > > + nfs_is_seqwrite(inode, pos) &&
> > > + atomic_read(&nfsi->ndirty) >= NFS_SERVER(inode)->wpages) {
> > > + nfs_wb_eager(inode);
> > > + }
> > > + if (result > 0)
> > > + nfsi->wrpos = pos + result;
> > > }
>
> Thanks,
> Fengguang
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Wu Fengguang on 21 Dec 2009 21:00

Steve,

On Sat, Dec 19, 2009 at 10:25:47PM +0800, Steve Rago wrote:
>
> On Sat, 2009-12-19 at 20:20 +0800, Wu Fengguang wrote:
> >
> > On Thu, Dec 17, 2009 at 04:17:57PM +0800, Peter Zijlstra wrote:
> > > On Wed, 2009-12-16 at 21:03 -0500, Steve Rago wrote:
> > > > Eager Writeback for NFS Clients
> > > > -------------------------------
> > > > Prevent applications that write large sequential streams of data (like backup, for example)
> > > > from entering into a memory pressure state, which degrades performance by falling back to
> > > > synchronous operations (both synchronous writes and additional commits).
> >
> > What exactly is the "memory pressure state" condition? What's the
> > code to do the "synchronous writes and additional commits" and maybe
> > how they are triggered?
>
> Memory pressure occurs when most of the client pages have been dirtied
> by an application (think backup server writing multi-gigabyte files that
> exceed the size of main memory). The system works harder to be able to
> free dirty pages so that they can be reused. For a local file system,
> this means writing the pages to disk. For NFS, however, the writes
> leave the pages in an "unstable" state until the server responds to a
> commit request. Generally speaking, commit processing is far more
> expensive than write processing on the server; both are done with the
> inode locked, but since the commit takes so long, all writes are
> blocked, which stalls the pipeline.

Let me try reiterate the problem with code, please correct me if I'm
wrong.

1) normal fs sets I_DIRTY_DATASYNC when extending i_size, however NFS
will set the flag for any pages written -- why this trick? To
guarantee the call of nfs_commit_inode()? Which unfortunately turns
almost every server side NFS write into sync writes..

writeback_single_inode:
do_writepages
nfs_writepages
nfs_writepage ----[short time later]---> nfs_writeback_release*
nfs_mark_request_commit
__mark_inode_dirty(I_DIRTY_DATASYNC);

if (I_DIRTY_SYNC || I_DIRTY_DATASYNC) <---- so this will be true for most time
write_inode
nfs_write_inode
nfs_commit_inode

2) NFS commit stops pipeline because it sleep&wait inside i_mutex,
which blocks all other NFSDs trying to write/writeback the inode.

nfsd_sync:
take i_mutex
filemap_fdatawrite
filemap_fdatawait
drop i_mutex

If filemap_fdatawait() can be moved out of i_mutex (or just remove
the lock), we solve the root problem:

nfsd_sync:
[take i_mutex]
filemap_fdatawrite => can also be blocked, but less a problem
[drop i_mutex]
filemap_fdatawait

Maybe it's a dumb question, but what's the purpose of i_mutex here?
For correctness or to prevent livelock? I can imagine some livelock
problem here (current implementation can easily wait for extra
pages), however not too hard to fix.

The proposed patch essentially takes two actions in nfs_file_write()
- to start writeback when the per-file nr_dirty goes high
without committing
- to throttle dirtying when the per-file nr_writeback goes high
I guess this effectively prevents pdflush from kicking in with
its bad committing behavior

In general it's reasonable to keep NFS per-file nr_dirty low, however
questionable to do per-file nr_writeback throttling. This does not
work well with the global limits - eg. when there are many dirty
files, the summed-up nr_writeback will still grow out of control.
And it's more likely to impact user visible responsiveness than
a global limit. But my opinion can be biased -- me have a patch to
do global NFS nr_writeback limit ;)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Steve Rago on 22 Dec 2009 11:30

On Tue, 2009-12-22 at 13:25 +0100, Jan Kara wrote:
> > I originally spent several months playing with the balance_dirty_pages
> > algorithm. The main drawback is that it affects more than the inodes
> > that the caller is writing and that the control of what to do is too
> Can you be more specific here please?

Sure; balance_dirty_pages() will schedule writeback by the flusher
thread once the number of dirty pages exceeds dirty_background_ratio.
The flusher thread calls writeback_inodes_wb() to flush all dirty inodes
associated with the bdi. Similarly, the process dirtying the pages will
call writeback_inodes_wbc() when it's bdi threshold has been exceeded.
The first problem is that these functions process all dirty inodes with
the same backing device, which can lead to excess (duplicate) flushing
of the same inode. Second, there is no distinction between pages that
need to be committed and pages that have commits pending in
NR_UNSTABLE_NFS/BDI_RECLAIMABLE (a page that has a commit pending won't
be cleaned any faster by sending more commits). This tends to overstate
the amount of memory that can be cleaned, leading to additional commit
requests. Third, these functions generate a commit for each set of
writes they do, which might not be appropriate. For background writing,
you'd like to delay the commit as long as possible.

[snip]

> >
> > Part of the patch does implement a heuristic write-behind. See where
> > nfs_wb_eager() is called.
> I believe that if we had per-bdi dirty_background_ratio and set it low
> for NFS's bdi, then the write-behind logic would not be needed
> (essentially the flusher thread should submit the writes to the server
> early).
>
> Honza

Maybe so, but you still need something to prevent the process that is
dirtying pages from continuing, because a process can always write to
memory faster than writing to disk/network, so the flusher won't be able
to keep up.

Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Steve Rago on 22 Dec 2009 11:50

On Tue, 2009-12-22 at 09:59 +0800, Wu Fengguang wrote:
> Steve,
>
> On Sat, Dec 19, 2009 at 10:25:47PM +0800, Steve Rago wrote:
> >
> > On Sat, 2009-12-19 at 20:20 +0800, Wu Fengguang wrote:
> > >
> > > On Thu, Dec 17, 2009 at 04:17:57PM +0800, Peter Zijlstra wrote:
> > > > On Wed, 2009-12-16 at 21:03 -0500, Steve Rago wrote:
> > > > > Eager Writeback for NFS Clients
> > > > > -------------------------------
> > > > > Prevent applications that write large sequential streams of data (like backup, for example)
> > > > > from entering into a memory pressure state, which degrades performance by falling back to
> > > > > synchronous operations (both synchronous writes and additional commits).
> > >
> > > What exactly is the "memory pressure state" condition? What's the
> > > code to do the "synchronous writes and additional commits" and maybe
> > > how they are triggered?
> >
> > Memory pressure occurs when most of the client pages have been dirtied
> > by an application (think backup server writing multi-gigabyte files that
> > exceed the size of main memory). The system works harder to be able to
> > free dirty pages so that they can be reused. For a local file system,
> > this means writing the pages to disk. For NFS, however, the writes
> > leave the pages in an "unstable" state until the server responds to a
> > commit request. Generally speaking, commit processing is far more
> > expensive than write processing on the server; both are done with the
> > inode locked, but since the commit takes so long, all writes are
> > blocked, which stalls the pipeline.
>
> Let me try reiterate the problem with code, please correct me if I'm
> wrong.
>
> 1) normal fs sets I_DIRTY_DATASYNC when extending i_size, however NFS
> will set the flag for any pages written -- why this trick? To
> guarantee the call of nfs_commit_inode()? Which unfortunately turns
> almost every server side NFS write into sync writes..

Not really. The commit needs to be sent, but the writes are still
asynchronous. It's just that the pages can't be recycled until they are
on stable storage.

>
> writeback_single_inode:
> do_writepages
> nfs_writepages
> nfs_writepage ----[short time later]---> nfs_writeback_release*
> nfs_mark_request_commit
> __mark_inode_dirty(I_DIRTY_DATASYNC);
>
> if (I_DIRTY_SYNC || I_DIRTY_DATASYNC) <---- so this will be true for most time
> write_inode
> nfs_write_inode
> nfs_commit_inode
>
>
> 2) NFS commit stops pipeline because it sleep&wait inside i_mutex,
> which blocks all other NFSDs trying to write/writeback the inode.
>
> nfsd_sync:
> take i_mutex
> filemap_fdatawrite
> filemap_fdatawait
> drop i_mutex
>
> If filemap_fdatawait() can be moved out of i_mutex (or just remove
> the lock), we solve the root problem:
>
> nfsd_sync:
> [take i_mutex]
> filemap_fdatawrite => can also be blocked, but less a problem
> [drop i_mutex]
> filemap_fdatawait
>
> Maybe it's a dumb question, but what's the purpose of i_mutex here?
> For correctness or to prevent livelock? I can imagine some livelock
> problem here (current implementation can easily wait for extra
> pages), however not too hard to fix.

Commits and writes on the same inode need to be serialized for
consistency (write can change the data and metadata; commit [fsync]
needs to provide guarantees that the written data are stable). The
performance problem arises because NFS writes are fast (they generally
just deposit data into the server's page cache), but commits can take a
long time, especially if there is a lot of cached data to flush to
stable storage.

>
>
> The proposed patch essentially takes two actions in nfs_file_write()
> - to start writeback when the per-file nr_dirty goes high
> without committing
> - to throttle dirtying when the per-file nr_writeback goes high
> I guess this effectively prevents pdflush from kicking in with
> its bad committing behavior
>
> In general it's reasonable to keep NFS per-file nr_dirty low, however
> questionable to do per-file nr_writeback throttling. This does not
> work well with the global limits - eg. when there are many dirty
> files, the summed-up nr_writeback will still grow out of control.

Not with the eager writeback patch. The nr_writeback for NFS is limited
by the woutstanding tunable parameter multiplied by the number of active
NFS files being written.

> And it's more likely to impact user visible responsiveness than
> a global limit. But my opinion can be biased -- me have a patch to
> do global NFS nr_writeback limit ;)

What affects user-visible responsiveness is avoiding long delays and
avoiding delays that vary widely. Whether the limit is global or
per-file is less important (but I'd be happy to be convinced otherwise).

Steve

>
> Thanks,
> Fengguang
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo(a)vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Hellwig on 23 Dec 2009 03:50

On Tue, Dec 22, 2009 at 01:35:39PM +0100, Jan Kara wrote:
> > nfsd_sync:
> > [take i_mutex]
> > filemap_fdatawrite => can also be blocked, but less a problem
> > [drop i_mutex]
> > filemap_fdatawait
> >
> > Maybe it's a dumb question, but what's the purpose of i_mutex here?
> > For correctness or to prevent livelock? I can imagine some livelock
> > problem here (current implementation can easily wait for extra
> > pages), however not too hard to fix.
> Generally, most filesystems take i_mutex during fsync to
> a) avoid all sorts of livelocking problems
> b) serialize fsyncs for one inode (mostly for simplicity)
> I don't see what advantage would it bring that we get rid of i_mutex
> for fdatawait - only that maybe writers could proceed while we are
> waiting but is that really the problem?

It would match what we do in vfs_fsync for the non-nfsd path, so it's
a no-brainer to do it. In fact I did switch it over to vfs_fsync a
while ago but that go reverted because it caused deadlocks for
nfsd_sync_dir which for some reason can't take the i_mutex (I'd have to
check the archives why).

Here's a RFC patch to make some more sense of the fsync callers in nfsd,
including fixing up the data write/wait calling conventions to match the
regular fsync path (which might make this a -stable candidate):

Index: linux-2.6/fs/nfsd/vfs.c
===================================================================
--- linux-2.6.orig/fs/nfsd/vfs.c 2009-12-23 09:32:45.693170043 +0100
+++ linux-2.6/fs/nfsd/vfs.c 2009-12-23 09:39:47.627170082 +0100
@@ -769,45 +769,27 @@ nfsd_close(struct file *filp)
}

/*
- * Sync a file
- * As this calls fsync (not fdatasync) there is no need for a write_inode
- * after it.
+ * Sync a directory to disk.
+ *
+ * This is odd compared to all other fsync callers because we
+ *
+ * a) do not have a file struct available
+ * b) expect to have i_mutex already held by the caller
*/
-static inline int nfsd_dosync(struct file *filp, struct dentry *dp,
- const struct file_operations *fop)
+int
+nfsd_sync_dir(struct dentry *dentry)
{
- struct inode *inode = dp->d_inode;
- int (*fsync) (struct file *, struct dentry *, int);
+ struct inode *inode = dentry->d_inode;
int err;

- err = filemap_fdatawrite(inode->i_mapping);
- if (err == 0 && fop && (fsync = fop->fsync))
- err = fsync(filp, dp, 0);
- if (err == 0)
- err = filemap_fdatawait(inode->i_mapping);
+ WARN_ON(!mutex_is_locked(&inode->i_mutex));

+ err = filemap_write_and_wait(inode->i_mapping);
+ if (err == 0 && inode->i_fop->fsync)
+ err = inode->i_fop->fsync(NULL, dentry, 0);
return err;
}

-static int
-nfsd_sync(struct file *filp)
-{
- int err;
- struct inode *inode = filp->f_path.dentry->d_inode;
- dprintk("nfsd: sync file %s\n", filp->f_path.dentry->d_name.name);
- mutex_lock(&inode->i_mutex);
- err=nfsd_dosync(filp, filp->f_path.dentry, filp->f_op);
- mutex_unlock(&inode->i_mutex);
-
- return err;
-}
-
-int
-nfsd_sync_dir(struct dentry *dp)
-{
- return nfsd_dosync(NULL, dp, dp->d_inode->i_fop);
-}
-
/*
* Obtain the readahead parameters for the file
* specified by (dev, ino).
@@ -1011,7 +993,7 @@ static int wait_for_concurrent_writes(st

if (inode->i_state & I_DIRTY) {
dprintk("nfsd: write sync %d\n", task_pid_nr(current));
- err = nfsd_sync(file);
+ err = vfs_fsync(file, file->f_path.dentry, 0);
}
last_ino = inode->i_ino;
last_dev = inode->i_sb->s_dev;
@@ -1180,7 +1162,7 @@ nfsd_commit(struct svc_rqst *rqstp, stru
return err;
if (EX_ISSYNC(fhp->fh_export)) {
if (file->f_op && file->f_op->fsync) {
- err = nfserrno(nfsd_sync(file));
+ err = nfserrno(vfs_fsync(file, file->f_path.dentry, 0));
} else {
err = nfserr_notsupp;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: -tip: origin tree boot crash
Next: patch staging-ramzswap-remove-arm-specific-d-cache-hack.patch added to gregkh-2.6 tree