Prev: [PATCH 1/1] NET: netpoll, fix potential NULL ptr dereference
Next: Failure with the 2.6.34-rc1 kernel
From: Ben Gamari on 25 Mar 2010 23:20 On Tue, 16 Mar 2010 21:31:03 -0700 (PDT), Ben Gamari <bgamari.foss(a)gmail.com> wrote: > On Tue, 16 Mar 2010 23:30:10 -0400, tytso(a)mit.edu wrote: > > .... so did switching to Btrfs solve your latency issues, or are you > > still having problems? > > Still having troubles although I'm now running 2.6.34-rc1 and things seem > mildly better. I'll try doing a backup tonight and report back. > I stand by my assertion that 2.6.34 does seem better in some regards. While there certainly are still latency issues, it's now less often that heavy I/O spills over into over processes' interactive performance. That being said, earlier this evening Tracker and notmuch were both indexing and I saw several events of tens of seconds of latency. - Ben -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Ben Gamari on 25 Mar 2010 23:20 On Tue, 16 Mar 2010 08:31:12 -0700 (PDT), Ben Gamari <bgamari.foss(a)gmail.com> wrote: > Hey all, I apologize for my extreme tardiness in replying to your responses. I was hoping to have more time during Spring break in dealing with this issue than I did (as always). Nevertheless, I'll hopefully be able to keep up with things at this point. Specific replies will follow. - Ben -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Ben Gamari on 25 Mar 2010 23:30 On Wed, 17 Mar 2010 15:53:50 +1100, Nick Piggin <npiggin(a)suse.de> wrote: > Where are the unrelated processes waiting? Can you get a sample of > several backtraces? (/proc/<pid>/stack should do it) > I wish. One of the incredibly frustrating characteristics of this issue is the difficulty in measuring it. By the time processes begin blocking, it's already far too late to open a terminal and cat to a file. By the time the terminal has opened, tens of seconds have passed and things have started to return to normal. > > > Moreover, the hit on unrelated processes is so bad > > that I would almost suspect that swap I/O is being serialized by fsync() as > > well, despite being on a separate swap partition beyond the control of the > > filesystem. > > It shouldn't be, until it reaches the bio layer. If it is on the same > block device, it will still fight for access. It could also be blocking > on dirty data thresholds, or page reclaim though -- writeback and > reclaim could easily be getting slowed down by the fsync activity. > Hmm, this sounds interesting. Is there a way to monitor writeback throughput. > Swapping tends to cause fairly nasty disk access patterns, combined with > fsync it could be pretty unavoidable. > This is definitely a possibility. However, it seems to me like swapping should be at least mildly favored over other I/O by the I/O scheduler. That being said, I can certainly see how it would be difficult to implement such a heuristic in a fair way so as not to block out standard filesystem access during a thrashing spree. > > > > Xapian, however, is far from the first time I have seen this sort of > > performance cliff. Rsync, which also uses fsync(), can also trigger this sort > > of thrashing during system backups, as can rdiff. slocate's updatedb > > absolutely kills interactive performance as well. > > > > Issues similar to this have been widely reported[1-5] in the past, and despite > > many attempts[5-8] within both I/O and memory managements subsystems to fix > > it, the problem certainly remains. I have tried reducing swappiness from 60 to > > 40, with some small improvement and it has been reported[20] that these sorts > > of symptoms can be negated through use of memory control groups to prevent > > interactive process pages from being evicted. > > So the workload is causing quite a lot of swapping as well? How much > pagecache do you have? It could be that you have too much pagecache and > it is pushing out anonymous memory too easily, or you might have too > little pagecache causing suboptimal writeout patterns (possibly writeout > from page reclaim rather than asynchronous dirty page cleaner threads, > which can really hurt). > As far as I can tell, the workload should fit in memory without a problem. This machine has 4 gigabytes of memory, of which currently 2.8GB is page cache. Seems high perhaps? I've included meminfo below. I can completely see how overly-aggressive page-cache would result in this sort of behavior. - Ben MemTotal: 4048068 kB MemFree: 47232 kB Buffers: 48 kB Cached: 2774648 kB SwapCached: 1148 kB Active: 2353572 kB Inactive: 1355980 kB Active(anon): 1343176 kB Inactive(anon): 342644 kB Active(file): 1010396 kB Inactive(file): 1013336 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 4883756 kB SwapFree: 4882532 kB Dirty: 24736 kB Writeback: 0 kB AnonPages: 933820 kB Mapped: 88840 kB Shmem: 750948 kB Slab: 150752 kB SReclaimable: 121404 kB SUnreclaim: 29348 kB KernelStack: 2672 kB PageTables: 31312 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 6907788 kB Committed_AS: 2773672 kB VmallocTotal: 34359738367 kB VmallocUsed: 364080 kB VmallocChunk: 34359299100 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 8552 kB DirectMap2M: 4175872 kB -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Ben Gamari on 25 Mar 2010 23:40 On Wed, 17 Mar 2010 10:37:04 +0100, Ingo Molnar <mingo(a)elte.hu> wrote: > > A call-graph profile will show the precise reason for IO latencies, and their > relatively likelihood. > Once I get home I'll try to reproduce the issue and get a call graph. Thanks! - Ben -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Ben Gamari on 27 Mar 2010 21:30
Hey all, I have posted another profile[1] from an incident yesterday. As you can see, both swapper and init (strange?) show up prominently in the profile. Moreover, most processes seem to be in blk_peek_request a disturbingly large percentage of the time. Both of these profiles were taken with 2.6.34-rc kernels. Anyone have any ideas on how to proceed? Is more profile data necessary? Are the existing profiles at all useful? Thanks, - Ben -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |