Prev: [PATCH v4] OMAP: Fix for bus width which improves SD card's peformance.
Next: MIPS: Fix sibyte watchdog initialization
From: Phillip Susi on 22 Apr 2010 17:30 On 4/22/2010 4:35 PM, Jamie Lokier wrote: > POSIX requires concurrent, overlapping writes don't interleave the > data (at least, I have read that numerous times), which is usually > implemented with a mutex even though there are other ways. I think what you are getting at here is that write() needs to atomically update the file pointer, which does not need a mutex. > The trickier stuff in proper AIO is sleeping waiting for memory to be > freed up, sleeping waiting for a rate-limited request queue entry > repeatedly, prior to each of the triple, double, single indirect > blocks, which you then sleep waiting to complete, sleeping waiting for > an atime update journal node, sleeping on requests and I/O on every There's no reason to wait for updating the atime, and I already said if there isn't enough memory then you just return -EAGAIN or -ENOMEM instead of waiting. Whether it's reading indirect blocks or b-trees doesn't make much difference; the fs ->get_blocks() tries not to sleep if possible, and if it must, returns -EAGAIN and the calling code can punt to a work queue to try again in a context that can sleep. > step through b-trees, etc... That's just reads; writing adds just as > much again. Changing those to async callbacks in every > filesystem - it's not worth it and it'd be a maintainability > nightmare. We're talking about changes to the kernel > memory allocator among other things. You can't gfp_mask it away - > except for readahead() because it's an abortable hint. The fs specific code just needs to support a flag like gfp_mask so it can be told we aren't in a context that can sleep; do your best and if you must block, return -EAGAIN. It looks like it almost already does something like that based on this comment from fs/mpage.c: * We pass a buffer_head back and forth and use its buffer_mapped() flag to * represent the validity of its disk mapping and to decide when to do the next * get_block() call. */ If it fixes up a buffer_head for the blocks it needs to finish and returns, then do_mpage_readpage() could queue those reads with a completion routine that would call get_block() again when the data has been read, and when get_block() maps the blocks, then queue reads for those blocks. > Oh, and fine-grained locking makes the async transformation harder, > not easier :-) How so? With fine grained locking you can avoid the use of mutexes and opt for atomic functions or spin locks, so no need to sleep. > For readahead yes because it's just an abortable hint. > For general AIO, no. Why not? aio_read() is perfectly allowed to fail if there is not enough memory to satisfy the request. > Ah, you didn't mention defragging for optimising readahead before. > > In that case, just trace the I/O done a few times and order your > defrag to match the trace, it should handle consistent patterns > without special defrag rules. I'm surprised it doesn't already. > Does ureadahead not use prior I/O traces for guidance? Yes, it traces the IO then on the next boot calls readahead() on the files that were read during the trace, after sorting them by on disk block location. I've been trying to improve things by having defrag pack those files tightly at the start of the disk, and have run into the problem with the indirect blocks and the open() calls blocking because the directories have not been read yet, hence, my desire to readahead() on the directories. Right now defrag lays down the indirect block immediately after the 12 direct blocks, which makes the most sense if you are just reading that one file. Threading the readahead() calls and moving the indirect block to after the next file's direct blocks would make ureadahead faster, at the expense of any one single file read. Probably a good tradeoff that I will have to try. That still leaves the problem of all the open() calls blocking to read one disk directory block at a time, since ureadahead opens all of the files first, then calls readahead() on each of them. This is where it would really help to be able to readahead() the directories first, then try to open all of the files. > Also, having defragged readahead files into a few compact zones, and > gotten the last boot's I/O trace, why not readahead those areas of the > blockdev first in perfect order, before finishing the job with > filesystem operations? The redundancy from no-longer needed blocks is > probably small compared with the gain from perfect order in few big > zones, and if you store the I/O trace of the filesystem stage every > time to use for the block stage next time, the redundancy should stay low. Good point, though I was hoping to be able to accomplish effectively the same thing purely with readahead() and other filesystem calls instead of going direct to the block device. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Jamie Lokier on 22 Apr 2010 18:50 Phillip Susi wrote: > On 4/22/2010 4:35 PM, Jamie Lokier wrote: > > POSIX requires concurrent, overlapping writes don't interleave the > > data (at least, I have read that numerous times), which is usually > > implemented with a mutex even though there are other ways. > > I think what you are getting at here is that write() needs to atomically > update the file pointer, which does not need a mutex. No, that is not the reason. pwrite needs the mutex too. > > The trickier stuff in proper AIO is sleeping waiting for memory to be > > freed up, sleeping waiting for a rate-limited request queue entry > > repeatedly, prior to each of the triple, double, single indirect > > blocks, which you then sleep waiting to complete, sleeping waiting for > > an atime update journal node, sleeping on requests and I/O on every > > There's no reason to wait for updating the atime, and > Whether it's reading indirect blocks or b-trees > doesn't make much difference; the fs ->get_blocks() tries not to sleep > if possible, and if it must, returns -EAGAIN and the calling code can > punt to a work queue to try again in a context that can sleep. Now you are describing using threads in the blocking cases. (Work queues, thread pools, same thing.) Earlier you were saying threads are the wrong approach.... Good, good :-) > The fs specific code just needs to support a flag like gfp_mask so it > can be told we aren't in a context that can sleep; do your best and if > you must block, return -EAGAIN. It looks like it almost already does > something like that based on this comment from fs/mpage.c: Yes, it's not a bad pattern. Simple to understand. There's a slight overhead compared with saving the stack frame fibril-style: The second, sleepable call has to redo much of the work done in the non-sleepable call, and queuing the work queue requires serialising etc. plus extra code for that. Plus the work queue is a bit more scheduling On the other hand, the queue uses less memory than a stack frame. For the in-cache cases, there's no overhead so it's fine. A big problem with it, apart from having to change lots of places in all the filesystems, is that the work-queues run with the wrong security and I/O context. Network filesystems break permissions, quotas break, ionice doesn't work, etc. It's obviously fixable but more involved than just putting a read request on a work queue. That's why the fibril/acall discussions talked about spawning threads from the caller's context or otherwise magically swizzling contexts around to do it with the efficiency of a preexisting thread pool. Once you're doing task security & I/O context swizzling (which turns out to be quite fiddly), the choice between swizzling stack frames or using EAGAIN and work queue type objects becomes a less radical design decision, and could even be a per-filesystem, per-operation choice. > > Oh, and fine-grained locking makes the async transformation harder, > > not easier :-) > > How so? With fine grained locking you can avoid the use of mutexes and > opt for atomic functions or spin locks, so no need to sleep. Fine-grained locking isn't the same thing as using non-sleepable locks. > > For readahead yes because it's just an abortable hint. > > For general AIO, no. > > Why not? aio_read() is perfectly allowed to fail if there is not enough > memory to satisfy the request. So is read(). And then the calling application usually exits, because there's nothing else it can do usefully. Same if aio_read() ever returns ENOMEM. That way lies an application getting ENOMEM often and having to retry aio_read in a loop, probably a busy one, which isn't how the interface is supposed to work, and is not efficient either. The only atomic allocation you might conceivably want is a small one to enqueue the AIO and return immediately. But really even that should sleep. That's the one case where you actually do want aio_read() to sleep. > That still leaves the problem of all the open() calls blocking to read > one disk directory block at a time, since ureadahead opens all of the > files first, then calls readahead() on each of them. This is where it > would really help to be able to readahead() the directories first, then > try to open all of the files. Put open() in threads too! Actually I don't have any idea how well that really goes. > > Also, having defragged readahead files into a few compact zones, and > > gotten the last boot's I/O trace, why not readahead those areas of the > > blockdev first in perfect order, before finishing the job with > > filesystem operations? The redundancy from no-longer needed blocks is > > probably small compared with the gain from perfect order in few big > > zones, and if you store the I/O trace of the filesystem stage every > > time to use for the block stage next time, the redundancy should stay low. > > Good point, though I was hoping to be able to accomplish effectively the > same thing purely with readahead() and other filesystem calls instead of > going direct to the block device. It depends on how accurate your block-level traces are, but if the blocks are consolidated into few contiguous zones, readahead on the blockdev should give perfect seek order, minimal IOPS and maximum I/O sizes. It won't even need any particular order from the defrag. It's hard to see even file readahead() approaching that for speed because it's so simple. Consider: Boot may read 50MB data in countless files including scripts, parts of shared libs etc. Just 0.5 second on any modern system. How long does the ureadahead run take? -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Brad Boyer on 22 Apr 2010 03:10 On Wed, Apr 21, 2010 at 11:06:12PM +0100, Jamie Lokier wrote: > For specific filesystems, you could do it. readahead() on directories > is not an unreasonable thing to add on. > > Generically is not likely. It's not about blocking, it's about the > fact that directories don't always consist of data blocks on the store > organised similarly to a file. For example NFS, CIFS, or (I'm not > sure), maybe even reiserfs/btrfs? Some non-UNIX file systems don't have anything that looks like a directory either. Just as an example, HFS and HFS+ both have a single catalog file for the whole file system. The directory listing method involves walking the tree from this file and picking a few fields out of each record matching the appropriate parent directory. This would make it hard to do something generic, although it would be possible to readahead some range of blocks of the catalog and produce a similar effect. This would really need to be FS specific, and the current readahead impementation is mostly common code. Brad Boyer flar(a)allandria.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Jamie Lokier on 22 Apr 2010 14:00 Phillip Susi wrote: > On 4/21/2010 6:06 PM, Jamie Lokier wrote: > > Generically is not likely. It's not about blocking, it's about the > > fact that directories don't always consist of data blocks on the store > > organised similarly to a file. For example NFS, CIFS, or (I'm not > > sure), maybe even reiserfs/btrfs? > > The contents are stored in blocks somewhere. It doesn't really matter > how or where as long as the fs figures out what it will need to resolve > names in that directory and reads that into the cache. Right, but finding those blocks is highly filesystem-dependent which is why making it a generic feature would need support in each filesystem. However, there could be generic readahead() support in the VFS for filesystems with the right block-getting hook. All those which support FIEMAP on directories should work. We're back to why not do it yourself then, as very few programs need directory readahead. > > Why am I reading all over the place that Linux AIO only works with O_DIRECT? > > Is it out of date? :-) > > Dunno, where did you read that? If you are using O_DIRECT then you > really should be using aio or you will suffer a pretty heavy performance > loss from all of the sleeping, but strictly speaking the two do not have > to be used together. Linux AIO is fickle: Whether it's _really_ async depends on the filesystem type, kernel version, distro-specific patches, and whether you used O_DIRECT or not. Unfortunately there is no API to find out. Even when really async, it's not guaranteed to never block. So it's a lot like the "compromise" readahead we've discussed: Mostly async (if you've checked filesystem type etc.), but not fully async. > As an added optimization, you only need to allocate one bio in > readahead() since it is likely that only one will be needed if all of > the blocks are sequential. Then the callback can use the gfp_mask flags > to prevent allocations from sleeping and if more can not be allocated, > then you sumbit what you've got and when THAT completes, you try to > build more requests. True, you can use gfp_mask flags for allocations and stop readahead where it fails. Probably a good plan. But you can't use it for, say, taking mutexes. > > Plus every little mutex / rwlock is another place where you need those > > callback functions. We don't even _have_ an async mutex facility in > > the kernel. So every user of a mutex has to be changed to use > > waitqueues or something. No more lockdep checking, no more RT > > priority inheritance. > > Yes, it looks like ext4_get_blocks() does use mutexes so it can't be > called from bh context. Perhaps it could be changed to avoid this if > possible and if it must, return -EWOULDBLOCK and the completion callback > would have to punt to a work queue to retry. In the common case though, > it looks like it would be possible for ext4_get_blocks() to avoid using > mutexes and just parse the newly read indirect block and return, then > the completion callback can queue its bios and be done. If you're interested, try finding all the places which could sleep for a write() call... Note that POSIX requires a mutex for write; you can't easily change that. Reading is easier to make fully async than writing. > > To read an indirect block, you have to allocate memory: another > > callback after you've slept waiting for memory to be freed up. > > You allocate the cache pages in the initial readahead() before > returning. No need to do it from the bio completion callback. Then readahead() isn't async, which was your request... It can block waiting for memory and other things when you call it. So that would be the "compromise" version which you complain about below. So I don't know why you're suggesting it :-) > > A compromise where just a few synchronisation points are made async is > > ok. But then it's a compromise... so you still need a multi-threaded > > caller to keep the queues full in all situations. > > Right, which tends to negate most of the gains of having any async at all. Exactly. And making it so it _never_ blocks when called is a ton of work, more lines of code (in C anyway), a maintainability nightmare, and adds some different bottlenecks you've not thought off. At this point I suggest you look up the 2007 discussions about fibrils which are quite good: They cover the overheads of setting up state for async calls when unnecessary, and the beautiful simplicty of treating stack frames as states in their own right. > For example, if we have multiple threads calling readahead() > instead of just one since it may sleep reading an indirect block, then > we can end up with this: > > Thread 1 queues reads of the first 12 blocks of the first file, and the > indirect block. Thread then sleeps waiting for the indirect block. > > Thread 2 queues reads of the first 12 blocks of the second file and its > indirect block. Thread then sleeps waiting for the indirect block. > > Now we have the disk read 12 contiguous blocks of data + indirect from > the first file, then 12 contiguous blocks of data + indirect from the > second file, which are further down the disk, so the head has to seek > forward. Then thread 1 wakes up and parses the indirect block and > queues reading of the subsequent sectors, which now requires a backwards > seek since we skipped reading those sectors to move ahead to the second > file. > > So in our attempt to use threads to keep the queue full, we have > introduced more seeking, which tends to have a higher penalty than just > using a single thread and having the queue drain and the disk idle for a > few ns while we wake up and parse the indirect block. > > Come to think of it, I guess that is a good argument NOT to make > readahead() fully async. No: In that particular case, waiting while the indirect block is parsed is advantageous. But suppose the first indirect block is located close to the second file's data blocks. Or the second file's data blocks are on a different MD backing disk. Or the disk has different seeking characteristics (flash, DRBD). Then the I/O scheduler _should_ overlap the reads from both files. The information to decide that is dependent on filesystem layout (and block device layout if MD). Forcing the order at the generic readahead() level would be suboptimal, even if it sort of works out in common situations. Going a bit off-topic (the 'd' key if bored...) What you've almost identified ;-) is a theoretical scheduling advantage of AIO _sometimes_: If the original I/O submission order has good properties _and_ the I/O queue is close to empty preventing sorting, the hints implicit in the submission order are better preserved with AIO. For example: If you use O_DIRECT with AIO to read a file in block-sequential order, the I/Os reach the disk in that order. If you implement a queue in userspace serviced by multiple threads to overlap synchronous I/O, there will be _slight_ timing randomness which shuffles the order. It doesn't matter when the I/O queue has enough entries, but it'll make a difference when starting up in that example - but even than only rarely - most times even the threads will end up queuing things in the preferred order. But if your AIO submission order doesn't have any particularly special properties anyway, or if you can submit enough in parallel that the kernel can sort things, or if the randomness from threads is small enough that it doesn't matter... that AIO benefit is lost entirely. With Linux AIO, because submission can block unpredictably, I'm guessing the sweet spot is probably to use a small number of threads - number not easy to determine, and dependent on many factors. I reckon the same applies to your readahead() calls: A queue which you make sure is always full enough that threads never block, sorted by inode number or better hints where known, with a small number of threads calling readahead() for files, and doing whatever is useful for directories. It is most unfortunate that such ideas are rather complicated to implemented well, just to try them out :-) Fwiw, my view of this is after writing userspace programs using a very async-callback sort of structure and trying to make the most of O_DIRECT + AIO (i.e. I'm not thinking like an "I love threads and don't understand the point of state machines" sort of person), only to read the fibril discussion and agree that the same issues occur in userspace: There is a balance between thread-like scheduling and event-like scheduling: Certain complex things are better using coroutine-like call stacks, using explicit state callbacks only at particular well-defined sleeping points. In some ways (zoom way, way out of the picture...) the balance between implicit state via threads and explicit state ("event-driven...") is more fundamentally a continuum than it appears close up. Both for simplicity of code, and performance. Neither extreme is optimal in general. -- Jamie -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Phillip Susi on 23 Apr 2010 00:20 On Thu, 2010-04-22 at 23:43 +0100, Jamie Lokier wrote: > No, that is not the reason. pwrite needs the mutex too. Which mutex and what for? > Now you are describing using threads in the blocking cases. (Work > queues, thread pools, same thing.) Earlier you were saying threads > are the wrong approach.... Good, good :-) Sure, in some cases, just not ALL. If you can't control whether or not the call blocks then you HAVE to use threads. If you can be sure it won't block most of the time, then most of the time you don't need any other threads, and when you finally do, you need very few. > A big problem with it, apart from having to change lots of places in > all the filesystems, is that the work-queues run with the wrong > security and I/O context. Network filesystems break permissions, quotas > break, ionice doesn't work, etc. It's obviously fixable but more > involved than just putting a read request on a work queue. Hrm... good point. > Fine-grained locking isn't the same thing as using non-sleepable locks. Yes, it is not the same, but non-sleepable locks can ONLY be used with fine grained locks. The two reasons to use a mutex instead of a spin lock are that you can sleep while holding it, and so it isn't a problem to hold it for an extended period of time. > So is read(). And then the calling application usually exits, because > there's nothing else it can do usefully. Same if aio_read() ever returns ENOMEM. > > That way lies an application getting ENOMEM often and having to retry > aio_read in a loop, probably a busy one, which isn't how the interface > is supposed to work, and is not efficient either. Simply retrying in a loop would be very stupid. The programs using aio are not simple stupid, so they would take more appropriate action. For example a server might decide it already has enough data in the pipe and forget about asking for more until the queues empty, or it might decide to drop that client, which would free up some more memory, or it might decide it has some cache it can free up. Something like readahead could decide that if there isn't enough memory left then it has no business trying to read any more, and exit. Both of these are preferable to waiting for something else to free up enough memory to continue. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 5 Prev: [PATCH v4] OMAP: Fix for bus width which improves SD card's peformance. Next: MIPS: Fix sibyte watchdog initialization |