Prev: BUG kmalloc-64: Poison overwritten, INFO: Allocated in bdi_alloc_work+0x2b/0x100 age=175 cpu=1 pid=3514
Next: [PATCH 62/72] Blackfin: bf537-stamp: add adp5588 gpio resources
From: Jamie Lokier on 16 Sep 2009 04:00 Eric Paris wrote: > On Tue, 2009-09-15 at 16:49 -0700, Linus Torvalds wrote: > > And btw, I still want to know what's so wonderful about fanotify that we > > would actually want yet-another-filesystem-notification-interface. So I'm > > not sayying that I'll take a system call interface. > > The real thing that fanotify provides is an open fd with the event > rather than some arbitrary 'watch descriptor' that userspace must > somehow magically map back to data on disk. This means that it could be > used to provide subtree notification, which inotify is completely > incapable of doing. That's a bit of a spurious claim. - fanotify does not provide subtree notification in it's present form. When it is extended to do that, why wouldn't inotify be as well? That's an fsnotify feature, common to both. - fanotify does not provide notification at all for some events that you get with inotify. It is not a superset, so you can't use fanotify to provide a subtree-capable equivalent to inotify. What a mess when you need the combination of both features! - fanotify requires you call readlink(/proc/fd/N) for every event to get the path. It's not a particularly efficient way to get it, especially when an apps wants to know if it's something in it's region of interest but doesn't care about the actual path. When an apps knows it needs the map back to to path, why make it slow to get it? That "extensible data format" is being underutilised... - fanotify's descriptor may be race-prone as a way to get the subtree used for access, because any of the parent directories could have moved and even been deleted before the app calls readlink(/proc/fd/N). I don't know if a _reliable_ way to track changes in a subtree can be built on it. Maybe it can but it appears this hasn't been analysed. It depends on readlink(/proc/fd/N)'s behaviour when the dentry's have been changed, among other things. - Does the descriptor cause umount to fail when user does "do some stuff in baz; umount baz", or does it serialise nicely? That's one of inotify's nice features - it doesn't cause umounts to fail. > And it can be used to provide system wide notification. We all know > who wants that. People who want to break out of chroot/namespace jails using the conveniently provided open file descriptor? :-) Seriously, what does system-wide fanotify do when run from a chroot/namespace/cgroup, and a file outside them is accessed? If the event is delivered with file desciptor, that's a security hole. If it's not delivered, that sounds like working subtree support? I'd expect anti-malware to want to be run inside VMs quite often... Note that there's no such thing as "the real system root" any more. > It provides an extensible data format which allows growth impossible in > inotify. I don't know if anyone remember the inotify patches which > wanted to overload the inotify cookie field for some other information, > but inotify information extension is not reasonable or backwards > compatible. I agree with this (although that's what flags are for -- see clone). I don't have a problem with the next interface being fanotify (despite arguing a lot); I just want to see the next one being useful for the things I would otherwise be proposing my own yet-another-interface for. So we don't need a fourth one soon after the third due to easily foreseen limitations. > I've got private commitments for two very large anti malware companies, > both of which unprotect and hack syscall tables in their customer's > kernels, that they would like to move to an fanotify interface. Both > Red Hat and Suse have expressed interest in these patches and have > contributed to the patch set. > > The patch set is actually rather small (entire set of about 20 patches > is 1800 lines) as it builds on the fsnotify work already in 2.6.31 to > reuse code from inotify rather than reimplement the same things over and > over (like we previously had with inotify and dnotify) I don't have any problem with either of these, and _fs_notify generally seems like an improvement. I don't have a problem with fanotify either. For what it does, it's ok. > Don't know what else to say..... Answer questions about use-cases that you're not interested in? Why block them? What about Evigny's request for an event without an open fd - because he needs the pid information (inotify doesn't provide) but not the fd? Sorry to be so harsh. I'm really trying to make sure we don't repeat the mistakes of dnotify and inotify, and end up with a third interface which also is too restrictive (because it's good enough for your anti-malware and HSM customers) so that a fourth interface will be needed soon after. I'd like to be able to use it from some applications to accelerate userspace caching of things (faster Make, faster Samba) without penalising all other applications touching unrelated parts of the filesystem. The attitude "you can live with 10% slowdown" worries me. I'm sure that can be fixed with a bit of care. If the intention is to maintain fanotify and inotify side-by-side for different uses (because fanotify returns open descriptors and blocks the accessing process until acked), that's ok with me. It makes sense. But then it's messy that neither offers a superset of the other regarding which files and events are tracked. If it's right that inotify has no room for extensibility (I'm not sure about this), than it appears we already made a mess with dnotify and inotify, so it would be a shame to repeat the same mistakes again. Let's get the next one right, even it takes a bit longer, ok? -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Alan Cox on 16 Sep 2009 06:50 > - fanotify does not provide subtree notification in it's > present form. When it is extended to do that, why wouldn't > inotify be as well? That's an fsnotify feature, common to both. Because inotify gives you no reliable access to the object monitored as the name passed back is not an object reference and is racy. Inotify is fine for making pretty icons pop up on desktops and making file selectors update, but it is somewhat inadequate for indexers and completely useless for stuff like HSM. > - fanotify requires you call readlink(/proc/fd/N) for every event to > get the path. It's not a particularly efficient way to get it, IFF you want the path, but the path isn't usually the most valuable bit. Plus you'll find the readlink is extremely quick anyway. > People who want to break out of chroot/namespace jails using the > conveniently provided open file descriptor? :-) chroot isn't a security model. You can already do this with AF_UNIX sockets (and there are apps that intentionally use fchdir that way) > I'd expect anti-malware to want to be run inside VMs quite often... Inside of containers - unlikely. Inside of guests sure but thats not going to relevant to fanotify() > the accessing process until acked), that's ok with me. It makes > sense. But then it's messy that neither offers a superset of the > other regarding which files and events are tracked. Agreed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Arnd Bergmann on 16 Sep 2009 07:40 On Tuesday 15 September 2009, Eric Paris wrote: > fanotify_modify_mark_at() --- like inotify_add_watch and rm_watch > fanotify_modify_mark_fd() --- same but with an fd instead of a path I think these two can be merged into one without adding complexity, in the same way that sys_utimensat can take a file descriptor or a path or both. > fanotify_response() --- userspace tells the kernel what to do if requested/allowed > (could probably be done using write() to the fanotify fd) > fanotify_exclude() --- a horrid syscall which might be better as an ioctl since it isn't strongly typed.... Please don't use an ioctl here. While ioctl is fine for character devices and sort of fine for sockets, I think it would be very bad style to use it on file descriptors that you get back from specialized syscalls like fanotify_init. Do one or the other, but do it consistently. Why is it not strongly typed anyway? Something like int fanotify_ignore_sb(int fanotify_fd, unsigned int flag, long f_type, fsid_t f_fsid); would be type safe, although I think it would be better to only handle one of the two cases. Can you think of a case that you can't handle if you have to decide between them and only do one interface (f_type or fsid)? Arnd <>< -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Jamie Lokier on 16 Sep 2009 07:50 Alan Cox wrote: > > - fanotify does not provide subtree notification in it's > > present form. When it is extended to do that, why wouldn't > > inotify be as well? That's an fsnotify feature, common to both. > > Because inotify gives you no reliable access to the object monitored as > the name passed back is not an object reference and is racy. Inotify is > fine for making pretty icons pop up on desktops and making file > selectors update, but it is somewhat inadequate for indexers and > completely useless for stuff like HSM. That was my point. (Why do people keep not getting it?) You can't rely on the name being non-racy, but you _can_ reliably invalidate application-level caches from the sequence of events including file writes, creates, renames, links, unlinks, mounts. And revalidate such caches by the absence of pending events. (There is one obscure case which inotify is missing, though, which means it cannot detect file changes in certain cases with hard links. I intend to fix that one.) For that, an inode isn't useful, a descriptor isn't useful, a directory descriptor/inode and pathname isn't useful, and file write events by themselves aren't useful. None of them quite do it by themselves. But with the correct combination of events, you can maintain very efficient application-level caching of file data / directory listing and lookups / stat results you have previously read from the filesystem. That's because the information you have previously depended upon, including path lookups, are all notified as one sort of inotify event or another when changed. Which doesn't sound all that special until you realise you can very quickly revalidate application-caches of any data structure calculated from reading things from the filesystem, no matter how many prerequisites or how complex the data structures, in a single system call. Amortised over many revalidations if you have them in parallel. That can apply to things like git, make, ccache, samba, rsync, httpd path walks, and virtually any "web templating" framework. Of course it takes userspace support as well, but that's where I'm coming from regarding "acceleration" and the essential kernel infrastructure. Clearly, I'm going to have to explain with working code :-) > but it is somewhat inadequate for indexers For indexers, the real inadequacy is the need to attach inotify watches to every directory at system startup, and to stat() everything to check it hasn't changed since the indexer was last running. Both are very slow on a large directory tree. The former can be dealt with using subtree watches (yes, even with hard links - I have proposed an algorithm for this but I think nobody understood it ;-). The latter needs filesystem support for a persistent change attribute. > > - fanotify requires you call readlink(/proc/fd/N) for every event to > > get the path. It's not a particularly efficient way to get it, > > IFF you want the path, but the path isn't usually the most valuable bit. > Plus you'll find the readlink is extremely quick anyway. I agree, you don't usually want the whole path. So what was the point about fanotify making subtree tracking possible with it's file descriptor, if not by readlink(/proc/fd/N)? Descriptors don't tell you which subtree a file is in any better than inotify watches. I.e. they do, if you track them and their containing directories all individually. > > People who want to break out of chroot/namespace jails using the > > conveniently provided open file descriptor? :-) > > chroot isn't a security model. You can already do this with AF_UNIX > sockets (and there are apps that intentionally use fchdir that way) Ah, no. AF_UNIX works with explicit sender cooperation. fanotify gives you access to files without sender cooperation, as it intercepts every open(). > > I'd expect anti-malware to want to be run inside VMs quite often... > > Inside of containers - unlikely. Why not? Some people run entire distributions in containiners, and present them as VMs to the world for other people to admin. > > the accessing process until acked), that's ok with me. It makes > > sense. But then it's messy that neither offers a superset of the > > other regarding which files and events are tracked. > > Agreed. In the end this is my main gripe. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Evgeniy Polyakov on 16 Sep 2009 08:10
On Tue, Sep 15, 2009 at 05:54:59PM -0400, Eric Paris (eparis(a)redhat.com) wrote: > Nothing's impossible, but is netlink a square peg for this round hole? > One of the great benefits of netlink, the attribute matching and > filtering, although possibly useful isn't some panacea as we have to do > that well before netlink to have anything like decent performance. > Imagine every single fs event creating an skb and sending it with > netlink only to have most of them dropped. There is no problem with performance even with single IO per skb. Consider usual send/recv calls which may end up with the same skb per syscall - most of the overhead comes from data copy or syscall machinery (for small writes) and not from allocation path. I have a 3.5 years old performance graph at http://www.ioremap.net/gallery/netlink_perf.png which shows 400 MB/s of bandwidth for 4k writes, I'm pretty sure it is limited by copy performance only. > The only other benefit to netlink that I know of is the reasonable, > easy, and clean addition of information later in time with backwards > compatibility as needed. That's really cool, I admit, but with the > limited amount of additional info that users have wanted out of inotify > I think my data type extensibility should be enough. I want alot from inotify which I'm afraid will not be easy with fanotify either, but its existing model just does not allow its extension. I would not be 100% sure that there will be no additional needs in a year or so for fanotify. > > Moreover you can implement a pool of working threads and > > postpone all the work to them and appropriate event queues, which will > > allow to use rlimits for the listeners and open files 'kind of' on > > behalf of those processes. > > I'm sorry, I don't userstand. I don't see how worker threads help > anything here. Can you explain what you are thinking? I meant that it could be possible to postpone all the work of queueing, event allocation, fd opening and population all be done on behalf of some other threads in the system and only original process credentials would be checked to satisfy various limits. In this case there will be no questions in which context given fd was created and it is possible to use async netlink nature. I do not force you to do this of course, but there is already quite huge infrastructure for similar tasks and it could be worth to change/reconsider things to use existing models and not invent own. Of course this is a matter of overall benefit. > > But it is quite diferent from the approach you selected and which is > > more obvious indeed. So if you ask a question whether fanotify should > > use sockets or syscalls, I would prefer sockets > > I've heard someone else off list say this as well. I'm not certain why. > I actually spent the day yesterday and have fanotify working over 5 new > syscalls (good thing I wrote the code with separate back and and front > ends for just this purpose) And I really don't hate it. I think 3 > might be enough. > > fanotify_init() ---- very much like inotify_init > fanotify_modify_mark_at() --- like inotify_add_watch and rm_watch > fanotify_modify_mark_fd() --- same but with an fd instead of a path Those two can be combined I think. > fanotify_response() --- userspace tells the kernel what to do if requested/allowed > (could probably be done using write() to the fanotify fd) > fanotify_exclude() --- a horrid syscall which might be better as an ioctl since it isn't strongly typed.... It all sounds good and simple, but what if you will need modify command with new arguments? Instead of adding new typed option you will need to add another syscall. I already did that for inotify but via ioctl and pretty sure there will be such need for much wider fanotify some time in the future. > I don't see what's gained using netlink. I am already reusing the > fsnotify code to do all my queuing. Someone help me understand the > benefit of netlink and help me understand how we can reasonably meet the > needs and I'll try to prototype it. > > 1) fd's must be opened in the recv process Or just injected into registered process' fd table with appropriate limit checks? In this case it can be done on behalf of whatever other worker. > 2) reliability, if loss must know on the send side You have this knowledge at netlink sending time, but there is no way to wait until 'fail' condition is removed like when you can block writing into socket waiting for buffer space to become large enough. And there is no way to tell how many listeners got message and how many was dropped in multicast deliver except that there were drops. This can be trivially extended though. -- Evgeniy Polyakov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |