From: Christoph Hellwig on 25 Jun 2010 03:20 If you actuall want to get this work in reposting huge patchkit again and again probably doesn't help. Start to prioritize areas and work on small sets to get them ready. files_lock and vfsmount_lock seem like rather easy targets to start with. But for files_lock I really want to see something to generalize the tty special case. If you touch that are in detail that wart needs to go. Al didn't seem to like my variant very much, so he might have a better idea for it - otherwise it really makes the VFS locking simple by removing any tty interaction with the superblock files list. The other suggestion would be to only open regular (maybe even just writeable) files to the list. In addition to reducing the number of list operations require it will also make the tty code a lot easier. As for the other patches: I don't think the massive fine-grained locking in the hash tables is a good idea. I would recommend to defer them for now, and then look into better data structures for these caches instead of working around the inherent problems of global hash tables. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on 25 Jun 2010 04:10 On Fri, Jun 25, 2010 at 03:12:21AM -0400, Christoph Hellwig wrote: > If you actuall want to get this work in reposting huge patchkit again and > again probably doesn't help. Start to prioritize areas and work on small > sets to get them ready. Sure, I haven't been posting the same thing (haven't posted it for a long time). This simply had a lot of new stuff and improvements to all existing patches. I didn't cc anyone in particular because it's only for interested people to take a look at. As you saw last time I cc'ed Al I exactly was just trying to get those easier targets merged. > files_lock and vfsmount_lock seem like rather easy targets to start > with. But for files_lock I really want to see something to generalize > the tty special case. If you touch that are in detail that wart needs > to go. Al didn't seem to like my variant very much, so he might have > a better idea for it - otherwise it really makes the VFS locking simple > by removing any tty interaction with the superblock files list. Actually I didn't like it because the error handling in the tty code was broken and difficult to fix properly. The concept was OK though. But the fact is that today already tty "knows" that vfs doesn't need its files on the superblock list, and so it may take them off and use that list_head privately. Currently it is also using files lock to protect that private usage. These are two independent problems. My patch fixes the second, and anything that fixes the first also needs to fix the second in exactly the same way. > The > other suggestion would be to only open regular (maybe even just > writeable) files to the list. In addition to reducing the number of > list operations require it will also make the tty code a lot easier. This was my suggestion, yes. Either way is conceptually the same, this one just avoids the memory allocation and error handling problems that yours had. But again, locking change is still required and it would look exactly the same as my patch really. > As for the other patches: I don't think the massive fine-grained > locking in the hash tables is a good idea. I would recommend to defer > them for now, and then look into better data structures for these caches > instead of working around the inherent problems of global hash tables. I don't agree actually. I don't think there is any downside to fine grained locking the hash with bit spinlocks. Until I see one, I will keep them. I agree that some other data structure may be better, but it should be compared with the best possible hash implementation, which is a scalable hash like this one. Also, our big impending performance problem is SMP scalability, not hash lookup, AFAIKS. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on 30 Jun 2010 07:40 On Thu, Jun 24, 2010 at 01:02:12PM +1000, npiggin(a)suse.de wrote: > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/ Can you put a git tree up somewhere? > Update to vfs scalability patches: ..... Now that I've had a look at the whole series, I'll make an overall comment: I suspect that the locking is sufficiently complex that we can count the number of people that will be able to debug it on one hand. This patch set didn't just fall off the locking cliff, it fell into a bottomless pit... > Performance: > Last time I was testing on a 32-node Altix which could be considered as not a > sweet-spot for Linux performance target (ie. improvements there may not justify > complexity). So recently I've been testing with a tightly interconnected > 4-socket Nehalem (4s/32c/64t). Linux needs to perform well on this size of > system. Sure, but I have to question how much of this is actually necessary? A lot of it looks like scalability for scalabilities sake, not because there is a demonstrated need... > *** Single-thread microbenchmark (simple syscall loops, lower is better): > Test Difference at 95.0% confidence (50 runs) > open/close -6.07% +/- 1.075% > creat/unlink 27.83% +/- 0.522% > Open/close is a little faster, which should be due to one less atomic in the > dput common case. Creat/unlink is significantly slower, which is due to RCU > freeing inodes. That's a pretty big ouch. Why does RCU freeing of inodes cause that much regression? The RCU freeing is out of line, so where does the big impact come from? > *** 64 parallel git diff on 64 kernel trees fully cached (avg of 5 runs): > vanilla vfs > real 0m4.911s 0m0.183s > user 0m1.920s 0m1.610s > sys 4m58.670s 0m5.770s > After vfs patches, 26x increase in throughput, however parallelism is limited > by test spawning and exit phases. sys time improvement shows closer to 50x > improvement. vanilla is bottlenecked on dcache_lock. So if we cherry pick patches out of the series, what is the bare minimum set needed to obtain a result in this ballpark? Same for the other tests? > *** Reclaim > I have not done much reclaim testing yet. It should be more scalable and lower > latency due to significant reduction in lru locks interfering with other > critical sections in inode/dentry code, and because we have per-zone locks. > Per-zone LRUs mean that reclaim is targetted to the correct zone, and that > kswapd will operate on lists of node-local memory objects. This means we no longer have any global LRUness to inode or dentry reclaim, which is going to significantly change caching behaviour. It's also got interesting corner cases like a workload running on a single node with a dentry/icache working set larger than the VM wants to hold on a single node. We went through these sorts of problems with cpusets a few years back, and the workaround for it was not to limit the slab cache to the cpuset's nodes. Handling this sort of problem correctly seems distinctly non-trivial, so I'm really very reluctant to move in this direction without clear evidence that we have no other alternative.... Cheers, Dave. -- Dave Chinner david(a)fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Nick Piggin on 30 Jun 2010 10:40 On Wed, Jun 30, 2010 at 09:30:54PM +1000, Dave Chinner wrote: > On Thu, Jun 24, 2010 at 01:02:12PM +1000, npiggin(a)suse.de wrote: > > http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fs-scale/ > > Can you put a git tree up somewhere? I suppose I should. I'll try to set one up. > > Update to vfs scalability patches: > > .... > > Now that I've had a look at the whole series, I'll make an overall > comment: I suspect that the locking is sufficiently complex that we > can count the number of people that will be able to debug it on one > hand. As opposed to everyone who understood it beforehand? :) > This patch set didn't just fall off the locking cliff, it > fell into a bottomless pit... I actually think it's simpler in ways. It has more locks, but a lot of those protect small, well defined data. Filesystems have required surprisingly minimal changes (except autofs4, but that's fairly special case). > > Performance: > > Last time I was testing on a 32-node Altix which could be considered as not a > > sweet-spot for Linux performance target (ie. improvements there may not justify > > complexity). So recently I've been testing with a tightly interconnected > > 4-socket Nehalem (4s/32c/64t). Linux needs to perform well on this size of > > system. > > Sure, but I have to question how much of this is actually necessary? > A lot of it looks like scalability for scalabilities sake, not > because there is a demonstrated need... People are complaining about vfs scalability already (at least Intel, Google, IBM, and networking people). By the time people start shouting, it's too late because it will take years to get the patches merged. I'm not counting -rt people who have a bad time with global vfs locks. You saw the "batched dput+iput" hacks that google posted a couple of years ago. Those were in the days of 4 core Core2 CPUs, long before 16 thread Nehalems that will scale well to 4/8 sockets at low cost. At the high end, vaguely extrapolating from my numbers, a big POWER7 may do under 100 open/close operations per second per hw thread. A big UV probably under 10 per core. But actually it's not all for scalability. I have some follow on patches (that require RCU inodes, among other things) that actually improve single threaded performance significnatly. git diff workload IIRC was several % improved from speeding up stat(2). > > *** Single-thread microbenchmark (simple syscall loops, lower is better): > > Test Difference at 95.0% confidence (50 runs) > > open/close -6.07% +/- 1.075% > > creat/unlink 27.83% +/- 0.522% > > Open/close is a little faster, which should be due to one less atomic in the > > dput common case. Creat/unlink is significantly slower, which is due to RCU > > freeing inodes. > > That's a pretty big ouch. Why does RCU freeing of inodes cause that > much regression? The RCU freeing is out of line, so where does the big > impact come from? That comes mostly from inability to reuse the cache-hot inode structure, and the cost to go over the deferred RCU list and free them after they get cache cold. > > *** 64 parallel git diff on 64 kernel trees fully cached (avg of 5 runs): > > vanilla vfs > > real 0m4.911s 0m0.183s > > user 0m1.920s 0m1.610s > > sys 4m58.670s 0m5.770s > > After vfs patches, 26x increase in throughput, however parallelism is limited > > by test spawning and exit phases. sys time improvement shows closer to 50x > > improvement. vanilla is bottlenecked on dcache_lock. > > So if we cherry pick patches out of the series, what is the bare > minimum set needed to obtain a result in this ballpark? Same for the > other tests? Well it's very hard to just scale up bits and pieces because the dcache_lock is currently basically global (except for d_flags and some cases of d_count manipulations). Start chipping away at bits and pieces of it as people hit bottlenecks and I think it will end in a bigger mess than we have now. I don't think this should be done lightly, but I think it is going to be required soon. > > *** Reclaim > > I have not done much reclaim testing yet. It should be more scalable and lower > > latency due to significant reduction in lru locks interfering with other > > critical sections in inode/dentry code, and because we have per-zone locks. > > Per-zone LRUs mean that reclaim is targetted to the correct zone, and that > > kswapd will operate on lists of node-local memory objects. > > This means we no longer have any global LRUness to inode or dentry > reclaim, which is going to significantly change caching behaviour. > It's also got interesting corner cases like a workload running on a > single node with a dentry/icache working set larger than the VM > wants to hold on a single node. > > We went through these sorts of problems with cpusets a few years > back, and the workaround for it was not to limit the slab cache to > the cpuset's nodes. Handling this sort of problem correctly seems > distinctly non-trivial, so I'm really very reluctant to move in this > direction without clear evidence that we have no other > alternative.... As I explained in the other mail, that's not actaully how the per-zone reclaim works. Thanks, Nick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Frank Mayhar on 30 Jun 2010 13:10 On Wed, 2010-06-30 at 21:30 +1000, Dave Chinner wrote: > Sure, but I have to question how much of this is actually necessary? > A lot of it looks like scalability for scalabilities sake, not > because there is a demonstrated need... Well, we've repeatedly run into problems with contention on the dcache_lock as well as the inode_lock; changes that improve those paths are extremely interesting to us. I've also seen numbers from systems with large (i.e. 32 to 64) numbers of cores that clearly show serious problems in this area. Further, while this seems like a bunch of patches, a close look shows that it basically just pushes the dcache and inode locks down as far as possible, making other improvements (such as removal of a few atomics and no longer batching inode reclaims, among other things) based on that work. I would be hard-pressed to find much to cherry-pick from this patch set. One interesting thing might be to do a set of performance tests for kernels with increasingly more of the patchset, just to see the effect of the earlier patches against a vanilla kernel and to measure the cumulative effect of the later patches. (I'm not volunteering, however: ENOTIME.) -- Frank Mayhar <fmayhar(a)google.com> Google, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
|
Next
|
Last
Pages: 1 2 3 Prev: PM: Avoid losing wakeup events during suspend Next: watchdog docs: add an entry for imx2_wdt |