fanotify as syscalls [Kernel]

Prev: BUG kmalloc-64: Poison overwritten, INFO: Allocated in bdi_alloc_work+0x2b/0x100 age=175 cpu=1 pid=3514
Next: [PATCH 62/72] Blackfin: bf537-stamp: add adp5588 gpio resources

From: Alan Cox on 16 Sep 2009 08:10

> You can't rely on the name being non-racy, but you _can_ reliably
> invalidate application-level caches from the sequence of events
> including file writes, creates, renames, links, unlinks, mounts. And
> revalidate such caches by the absence of pending events.

You can't however create the caches reliably because you've no idea if
you are referencing the right object in the first place - which is why
you want a handle in these cases. I see fanotify as a handle producing
addition to inotify, not as a replacement (plus some other bits around
open blocking for HSM etc)

> Clearly, I'm going to have to explain with working code :-)

Always a good demo

> > but it is somewhat inadequate for indexers
>
> For indexers, the real inadequacy is the need to attach inotify
> watches to every directory at system startup, and to stat() everything
> to check it hasn't changed since the indexer was last running. Both

stat doesn't help you - inode numbers are only guaranteed unqiue (and
constant) while a reference to the object is held.

> Descriptors don't tell you which subtree a file is in any better than
> inotify watches. I.e. they do, if you track them and their containing
> directories all individually.

Don't get me wrong - I don't think fanotify is sufficient on its own -
and this is one reason. Some things care about the namespace, some about
getting the exact content.

> > chroot isn't a security model. You can already do this with AF_UNIX
> > sockets (and there are apps that intentionally use fchdir that way)
>
> Ah, no. AF_UNIX works with explicit sender cooperation.
>
> fanotify gives you access to files without sender cooperation, as it
> intercepts every open().

and is currently not general user accessible for this reason.

> > Inside of containers - unlikely.
>
> Why not? Some people run entire distributions in containiners, and
> present them as VMs to the world for other people to admin.

In a word - performance. In two words performance and security. It isn't
a sensible setup because you want to scan the most efficient way possible
and you want to keep your malware scan as far away from attackers, so it
makes sense to keep it outside of the containers and do the job once
- one scan, one database of things to keep current.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jamie Lokier on 16 Sep 2009 08:30

Evgeniy Polyakov wrote:
> It all sounds good and simple, but what if you will need modify command
> with new arguments? Instead of adding new typed option you will need to
> add another syscall. I already did that for inotify but via ioctl and
> pretty sure there will be such need for much wider fanotify some time in
> the future.

Ew, that's unpleasant (adding to inotify via ioctl).

You're right, it's not easy to add new syscalls, but it's not that
hard either.

I'd forgotten about Linus' strace argument. That's a good one.

Of course everything should be a syscall by that argument :-)

And strace can trace some ioctls and setsockopts. (But it's never
pretty to see isatty() showing in strace as SNDCTL_TMR_TIMEBASE :-)

Strace never shows structure of reads and writes to devices, so
although efficient (you can batch), it's not nice for tracing.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jamie Lokier on 16 Sep 2009 09:00

Alan Cox wrote:
> > You can't rely on the name being non-racy, but you _can_ reliably
> > invalidate application-level caches from the sequence of events
> > including file writes, creates, renames, links, unlinks, mounts. And
> > revalidate such caches by the absence of pending events.
>
> You can't however create the caches reliably because you've no idea if
> you are referencing the right object in the first place - which is why
> you want a handle in these cases. I see fanotify as a handle producing
> addition to inotify, not as a replacement (plus some other bits around
> open blocking for HSM etc)

There are two sets of events getting mixed up here. Inode events -
reads, writes, truncates, chmods; and directory events - renames,
links, creates, unlinks.

Inode events alone _not enough_ to maintain caches, and here's why.

With a file descriptor for an _inode_ event, that's fine. If you have
{ int fd1 = open("/foo/bar"), fd2 = open("/foo/baz"); } early in your
program, and later cached_file_read(fd1) and cached_file_read(fd2),
you have to recognise the inode number and invalidate both.

You have to call fstat() on the event's descriptor and then look up a
device+inode number in your own table. (The inotify way doesn't need
the fstat() but is otherwise the same).

That's fine for files you're keeping open and only want to know if the
content changes _of an open file_.

But that's not so useful.

More often, you want to validate cached_file_read("/foo/bar"). That
is, validate what you'd get if you opened that path _now_ and read it.
Same for cached_stat("/foo/bar") to cache permissions, and other
things like that.

That needs to validate the path lookup _and_ the inode state.

For that, we need directory events, and they must include the name in
the directory that's affected. If you receive a directory event
involving name "bar" in directory (identified by inode) "/foo", you
invalidate cached_file_read("/foo/bar") and cached_stat("/foo/bar").

Oh, but wait, how do we know the inode for the directory in our event
still refers to "/foo"? Answer: We're also watching it's parent
directory "/". Assuming no reordering of certain events, that's ok.

That way, by watching "/", "/foo" and "/foo/bar", when you receive no
events you validate the results of cached_file_read("/foo/bar") and
cached_stat("/foo/bar"). A lot to set up, but fast to check. Worth
it if you're checking a lot of things that rarely change.

If you receive inode events while watching the parent directory of the
path used to access the inode, then you can avoid watching "/foo/bar",
and just watch the path of parent directories. That saves an order of
magnitude of watches typically. fanotify offers something similar,
and in this case the event is probably more useful than inotify's.

(The above is even hard-link-safe, if you do it right. I won't
complicate the explanation with details).

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Eric Paris on 16 Sep 2009 12:00

On Wed, 2009-09-16 at 13:56 +0100, Jamie Lokier wrote:
> Alan Cox wrote:
> > > You can't rely on the name being non-racy, but you _can_ reliably
> > > invalidate application-level caches from the sequence of events
> > > including file writes, creates, renames, links, unlinks, mounts. And
> > > revalidate such caches by the absence of pending events.
> >
> > You can't however create the caches reliably because you've no idea if
> > you are referencing the right object in the first place - which is why
> > you want a handle in these cases. I see fanotify as a handle producing
> > addition to inotify, not as a replacement (plus some other bits around
> > open blocking for HSM etc)
>
> There are two sets of events getting mixed up here. Inode events -
> reads, writes, truncates, chmods; and directory events - renames,
> links, creates, unlinks.

My understanding of you argument is that fanotify does not yet provide
all inotify events, namely those of directories operations and thus is
not suitable to wholesale replace everything inotify can do. I've
already said that working towards that goal is something I plan to
pursue, but for now, you still have inotify.

The mlocate/updatedb people ask me about fanotify and it's on the todo
list to allow global reception of of such events. The fd you get would
be of the dir where the event happened. They didn't care, and I haven't
decided if we would provide the path component like inotify does. Most
users are perfectly happy to stat everything in the dir.

It's hopefully feasible, but it's going to take some fsnotify hook
movements and possibly so arguments with Al to get the information I
want where I want it. But there is nothing about the interface that
precludes it and it has been discussed and considered.

Am I still missing it?

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Jamie Lokier on 16 Sep 2009 18:00

Eric Paris wrote:
> On Wed, 2009-09-16 at 13:56 +0100, Jamie Lokier wrote:
> > Alan Cox wrote:
> > > > You can't rely on the name being non-racy, but you _can_ reliably
> > > > invalidate application-level caches from the sequence of events
> > > > including file writes, creates, renames, links, unlinks, mounts. And
> > > > revalidate such caches by the absence of pending events.
> > >
> > > You can't however create the caches reliably because you've no idea if
> > > you are referencing the right object in the first place - which is why
> > > you want a handle in these cases. I see fanotify as a handle producing
> > > addition to inotify, not as a replacement (plus some other bits around
> > > open blocking for HSM etc)
> >
> > There are two sets of events getting mixed up here. Inode events -
> > reads, writes, truncates, chmods; and directory events - renames,
> > links, creates, unlinks.
>
> My understanding of you argument is that fanotify does not yet provide
> all inotify events, namely those of directories operations and thus is
> not suitable to wholesale replace everything inotify can do.

Largely that, plus seeing fanotify look like it'll acquire some
capabilities that would be useful with those inotify events and
inotify not getting them. Bothered by the apparent direction of
development, really.

Btw, I'm not sure you can use inotify+fanotify together simultaneously
in this way, which may be of benefit - caching might help the
anti-malware-style access controls. I'll have to think carefully
about ordering of some events, and using fanotify and inotify
independently at the same time loses that ordering.

> I've already said that working towards that goal is something I plan
> to pursue,

Sorry, I missed that, just as I didn't find a reply to Evigny's "I
need pids". And from another mail, I thought you were stopping at the
things with file descriptors.

> but for now, you still have inotify.

That's right. And it sucks for subtrees, so that's why I'd like to
absorb improvements on subtree inclusions, and exclusion nodes look
useful too.

> The mlocate/updatedb people ask me about fanotify and it's on the todo
> list to allow global reception of of such events. The fd you get would
> be of the dir where the event happened. They didn't care, and I haven't
> decided if we would provide the path component like inotify does. Most
> users are perfectly happy to stat everything in the dir.

mlocate/updatedb is relatively low performance and of course wants to
be system-wide. It's not looking so good if a user wants an indexer
just on their /home, and the administrator does not want everyone else
to pay the cost.

But I think we're quite agreed on how useful subtrees would be.
System-wide events won't be needed if we can monitor the / subtree
to get the same effect, and that'll also sort out namespaces and chroots.

Stat'ing every entry in a dir event. Thinking out loud:

1. Stat'ing everything in a dir just to find out which 1 file was
deleted can be quite expensive for some uses (if you have a
large dir and it happens a lot), and is unpleasant just because
each change _should_ result in about O(1) work. Taste, style,
elegance ;-)

For POSIX filesystems, I don't see any logical problem with
this, actually. You don't need to call stat()! It's enough to
call readdir() and look at d_ino to track
renames/links/creates/unlinks - assuming only directory-change
events are relevant here.

Just an unpleasant O(n) scaling with directory size.

(Note that I ignore mount points not returning the correct
d_ino, because apps can track the mount list and
compensate; they should be doing this anyway).

2. updatedb-style indexing apps don't care about the
readdir/stat-all-entries cost, because they don't need to read
the directory after every change, they only need to do it once
every 24 hours if any events were received in that interval!

(Obviously this isn't the same for pseudo-real-time indexers.)

For Samba-style caching, on the other hand, the cost of
rescanning a large directory when one file is being read often
and another file in it is changing often might be prohibitive,
forcing it to use heuristics to decide when to monitor a
directory and when not to to cache it, depending on directory
size. I'd rather avoid that.

3. Non-POSIX filesystems don't always have stable inode numbers.
You can't tell that foo was renamed to bar by reading the
directory and looking at d_ino, or by calling stat on each entry.

You can assume stable inode numbers for inodes where there's an
open file descriptor; that *might* be just enough to squeeze
through the logic of a cache. I'm not sure right now.

4. You can't tell when file contents are changed from stat info.

That means you have to receive an inode event, not a directory
event for data changes, but that's not a problem of course - the
name-used-for-access isn't useful for data changes anyway
(except for debugging perhaps).

5. stat() doesn't tell you about xattr and ACL changes. xattrs can
be large and slow to read on a whole directory. But as point 4,
if attribute changes count as inode changes, there's no problem.

6. Calling stat() pulls a lot into cache that doesn't need to be in
cache: all those inodes. But as I mentioned in points 1, 4 and
5, provided only directory name operations pass the directory to
be scanned, and inode operations always pass the inode, you can
use readdir() and avoid stat(), so the inodes don't have to be
pulled into cache after all.

Except for non-POSIX inode instability. Would be good to work
out if that breaks the algorithm.

In summary, calling readdir() and maybe stat/getxattr on each entry in
a directory might be workable, but I'd rather it was avoidable.
Simple apps may prefer to do it anyway - and let multiple events in a
directory be merged as a result.

While I'm here it would be nice to receive one event instead of two
for operations which involve two paths: link, rename and bind mount.
Having to pair up two events from inotify isn't helpful in any way.

Imho an API that satisfies everything we've talking about would let
you specify which fields you want to receive in the event when you
bind a listener. Not _everything_ is selectable of course, but
whether you want:

For inode events (data read/write, attribute/ACL/xattr changes):

- Open file descriptor of the affected file [Optional].
- The inode number and device number (always?).
- A way to identify the vfsmount (due to bind mounts making the
device number insufficient to identify a directory; always?).

For directory events (create/unlink/link/rename/reflink/mkdir/rmdir
/mount/umount):

- Same as inode above, for the object created/linked/deleted.

- Same as inode above, for the directory containing the source name.
- Source name [Optional].
- Same as above, for the directory containing the target name
- Target name [Optional]

Source and target are the two names for
rename/link/reflink/bind-mount operations. Otherwise there is
only one name to include.

Ironically, it begins to look a bit like netlink ;-)

As you can see, I've made the open descriptors optional, and the names
for directory events optional. For directory events, the object
descriptor option should be independent from the source/target
directory descriptor option.

Add one more option: wait for Ack before file accessing process can
proceed, or don't require Ack. That basically distinguishes inotify
behaviour from fsnotify behaviour.

It's not obvious, but that option's useful for directory events too,
if you think about it: Think like an anti-malware or other access
control manager, and ask: what if I have to block something which
depends on the layout of files? Just as directory events are enough
for caching, they are enough for complete access control of
layout-dependent state too. For example, some line of text is no
problem in a random file, but might be forbidden by the access manager
from appearing in .bash_login, including by "mv harmless .bash_login".

The above is not a final proposal, but I'd be delighted if you'd take
a look at whether it's suitable. I realise some things may not work
out for implementation reasons.

Meanwhile, I'll take a look at userspace code for my caching algorithm
and see how well that works out. I think we'll get subtree monitors
out of this before the month is over...

> It's hopefully feasible, but it's going to take some fsnotify hook
> movements and possibly so arguments with Al to get the information I
> want where I want it.

That may, indeed, be a sticking point :-)

> But there is nothing about the interface that
> precludes it and it has been discussed and considered.
>
> Am I still missing it?

No I think we're on the same wavelength now. Thanks for being
patient. (And thanks, Alan, for stepping in and making me describe
what I had in mind better).

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: BUG kmalloc-64: Poison overwritten, INFO: Allocated in bdi_alloc_work+0x2b/0x100 age=175 cpu=1 pid=3514
Next: [PATCH 62/72] Blackfin: bf537-stamp: add adp5588 gpio resources