VFS: fix recent breakage of FS_REVAL

Prev: perf, trace: Remove IRQ-disable from perf/tracepoint interaction
Next: [PATCH] md: fix raid6test build error

From: Neil Brown on 24 May 2010 21:40

On Mon, 24 May 2010 17:47:36 +0100
Al Viro <viro(a)ZenIV.linux.org.uk> wrote:

> On Mon, May 24, 2010 at 12:21:22PM -0400, Trond Myklebust wrote:
> > > Can an nfs4 server e.g. have /x/y being a symlink that resolves to /a/b and
> > > allow mounting of both /x/y/c and /a/b/c? Which path would it return to
> > > client that has mounted both, walked to some referral point and called
> > > nfs_do_refmount(), triggering nfs4_proc_fs_locations()?
> > >
> > > Trond, Neil?
> >
> > When mounting /x/y/c in your example above, the NFSv4 protocol requires
> > the client itself to resolve the symlink, and then walk down /a/b/c
> > (looking up component by component), so it will in practice not see
> > anything other than /a/b/c.
> >
> > If it walks down to a referral, and then calls nfs_do_refmount, it will
> > do the same thing: obtain a path /e/f/g on the new server, and then walk
> > down that component by component while resolving any symlinks and/or
> > referrals that it crosses in the process.
>
> Ho-hum... What happens if the same fs is mounted twice on server? I.e.
> have ext2 from /dev/sda1 mounted on /a and /b on server, then on the client
> do mount -t nfs foo:/a /tmp/a; mount -t nfs foo:/b /tmp/b. Which path
> would we get from GETATTR with fs_locations requested, if we do it for
> /tmp/a/x and /tmp/b/x resp.? Dentry will be the same, since fsid would
> match.
>
> Or would the server refuse to export things that way?

If an explicit fsid or uuid is given for the two different export points,
then the server will happily export both and the client will see two
different filesystems that happen to contain identical content. They could
return different fs_locations, or could return the same depending on what is
specified in /etc/exports.

If mountd is left to choose the default uuid, then it will only export one of
them at a time. When the client requests the mount of "foo:/b", the
information given to the kernel will over-ride the export of "/a". The
client won't notice much as the filehandles will all be the same.

The value returned for fs_locations is given to the kernel by mountd based on
the contents of /etc/exports.
So if you mount /dev/sda1 on /a /b, then in /etc/exports have:

/a *(locations=foo_a)
/b *(locations=foo_b)

(or whatever the correct syntax is), then when the client does

mount foo:/a /tmp/a

and asks for fs_locations in /tmp/a it will get something based on foo_a..
If it then does
mount foo:/b /tmp/b
then future fs_locations requests in either /tmp/a or /tmp/b will be based on
foo_b.

If one client mounts foo:/a and then another mounts foo:/b, then both will
see foo_b locations.

Obviously setting /etc/exports up like this with different locations for the
same data would be pretty silly. If the different exports of the same fs
had the same locations, then it would all work sensibly.

At least, the above is a worst case. It is not impossible that mountd could
detect that the two exports are the same and would prefer the "first" one in
all cases. In that case the results would be more stable, but not
necessarily more "correct" (whatever that means here).

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Trond Myklebust on 25 May 2010 09:00

On Tue, 2010-05-25 at 00:01 +0100, Al Viro wrote:
> On Mon, May 24, 2010 at 05:13:32PM -0400, Trond Myklebust wrote:
>
> > Sorry... I misunderstood you.
> >
> > In cases like the above, then the default behaviour of the server would
> > be to assign the same filehandles to those mount points. The
> > administrator can, however, make them different by choosing to use the
> > 'fsid' mount option to manually assign different fsids to the different
> > export points.
> >
> > If not, then the client will automatically group these things in the
> > same superblock, so like the server, it too is supposed to share the
> > same inode for these different objects. It will then use
> > d_obtain_alias() to get a root dentry for that inode (see
> > nfs4_get_root()).
>
> Yes, it will. So what will happen in nfs_follow_referral()? Note that
> we check the rootpath returned by the server (whatever it will end up
> being) against the mnt_devname + relative path from mnt_root to referral
> point. In this case it'll be /a/z or /b/z (depending on which export
> will server select when it sees the fsid) vs /a/z/x or /b/z/x (depending
> on which one does client walk into). And the calls of nfs4_proc_fs_locations()
> will get identical arguments whether client walks into a/z/x or b/z/x.
> So will the actual RPC requests seen by the server, so it looks like in
> at least one of those cases we will get the rootpath that is _not_ a prefix
> we are expecting, stepping into
> if (strncmp(path, fs_path, strlen(fs_path)) != 0) {
> dprintk("%s: path %s does not begin with fsroot %s\n",
> __func__, path, fs_path);
> return -ENOENT;
> }
> in nfs4_validate_fspath().
>
> Question regarding RFC3530: is it actually allowed to have the same fhandle
> show up in two different locations in server's namespace? If so, what
> should GETATTR with FS_LOCATIONS return for it?

I think the general expectation in the protocol is that there are no
hard linked directories. This assumption is reflected in the fact that
we have operations such as LOOKUPP (the NFSv4 equivalent of
lookup("..")) which only take a filehandle argument.

> Client question: what stops you from stack overflows in that area? Call
> chains you've got are *deep*, and I really wonder what happens if you
> hit a referral point while traversing nested symlink, get pathname
> resolution (already several levels into recursion) call ->follow_link(),
> bounce down through nfs_do_refmount/nfs_follow_referral/try_location/
> vfs_kern_mount/nfs4_referral_get_sb/nfs_follow_remote_path into
> vfs_path_lookup, which will cheerfully add a few more loops like that.
>
> Sure, the *total* nesting depth through symlinks is still limited by 8, but
> that pile of stack frames is _MUCH_ fatter than what we normally have in
> pathname resolution. You've suddenly added ~60 extra stack frames to the
> worst-case stack footprint of the pathname resolution. Don't try that
> on sparc64, boys and girls, it won't be happy with attempt to carve ~12Kb
> extra out of its kernel stack... In fact, it's worse than just ~60 stack
> frames - several will contain (on-stack) struct nameidata in them, which
> very definitely will _not_ fit into the minimal stack frame. It's about
> 160 bytes extra, for each of those (up to 7).
>
> Come to think of that, x86 variants might get rather upset about that kind
> of treatment as well. Minimal stack frames are smaller, but so's the stack...

See commit ce587e07ba2e25b5c9d286849885b82676661f3e (NFS: Prevent the
mount code from looping forever on broken exports) which was just
merged. It prevents nesting > 2 levels deep (ignore the changelog
comment about MAX_NESTED_LINKS - that is a typo).

Cheers
Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Trond Myklebust on 25 May 2010 09:10

On Tue, 2010-05-25 at 00:44 +0100, Al Viro wrote:
> On Tue, May 25, 2010 at 12:01:09AM +0100, Al Viro wrote:
> > Client question: what stops you from stack overflows in that area? Call
> > chains you've got are *deep*, and I really wonder what happens if you
> > hit a referral point while traversing nested symlink, get pathname
> > resolution (already several levels into recursion) call ->follow_link(),
> > bounce down through nfs_do_refmount/nfs_follow_referral/try_location/
> > vfs_kern_mount/nfs4_referral_get_sb/nfs_follow_remote_path into
> > vfs_path_lookup, which will cheerfully add a few more loops like that.
> >
> > Sure, the *total* nesting depth through symlinks is still limited by 8, but
> > that pile of stack frames is _MUCH_ fatter than what we normally have in
> > pathname resolution. You've suddenly added ~60 extra stack frames to the
> > worst-case stack footprint of the pathname resolution. Don't try that
> > on sparc64, boys and girls, it won't be happy with attempt to carve ~12Kb
> > extra out of its kernel stack... In fact, it's worse than just ~60 stack
> > frames - several will contain (on-stack) struct nameidata in them, which
> > very definitely will _not_ fit into the minimal stack frame. It's about
> > 160 bytes extra, for each of those (up to 7).
>
> Actually, just what will happen if you have a referral that would eventually
> resolve to a directory you have no permissions to access? AFAICS, you'll
> end up trying it on all alternates, since nfs_follow_referral() will cheerfully
> keep trying one variant after another, getting -EACCES from each. Worse,
> if there are nested referrals in it, you'll get all sequences of alternates
> tried before you give up.
>
> ..o*O(at least it's merely exponential; Ackermann would be even more fun)

We could perhaps quit if the referral resolves to EACCES, but there is
the theoretical possibility that the administrator of the particular
replicated server has instead just unexported the filesystem, (which
will also result in EACCES).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev |
Pages: 1 2 3
Prev: perf, trace: Remove IRQ-disable from perf/tracepoint interaction
Next: [PATCH] md: fix raid6test build error

VFS: fix recent breakage of FS_REVAL_DOT