Prev: nvidia controller failed command, possibly related to SMART selftest (2.6.32)
Next: powernow-k8: Core Performance Boost and effective frequency support
From: Tejun Heo on 26 Apr 2010 06:10 Hello, On 04/26/2010 09:22 AM, Pekka Enberg wrote: > On Mon, Apr 26, 2010 at 9:59 AM, Zhang, Yanmin > <yanmin_zhang(a)linux.intel.com> wrote: >>> I haven't been able to reproduce this either on my Core 2 machine. >> Mostly, the regression exists on Nehalem machines. I suspect it's related to >> hyper-threading machine. > > OK, so does anyone know why hyper-threading would change things for > the per-CPU allocator? My wild speculation is that previously the cpu_slub structures of two neighboring threads ended up on the same cacheline by accident thanks to the back to back allocation. W/ the percpu allocator, this no longer would happen as the allocator groups percpu data together per-cpu. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Pekka Enberg on 26 Apr 2010 06:10 Hi, On Mon, Apr 26, 2010 at 9:59 AM, Zhang, Yanmin <yanmin_zhang(a)linux.intel.com> wrote: >>>> I haven't been able to reproduce this either on my Core 2 machine. >>> Mostly, the regression exists on Nehalem machines. I suspect it's related to >>> hyper-threading machine. On 04/26/2010 09:22 AM, Pekka Enberg wrote: >> OK, so does anyone know why hyper-threading would change things for >> the per-CPU allocator? On Mon, Apr 26, 2010 at 1:02 PM, Tejun Heo <tj(a)kernel.org> wrote: > My wild speculation is that previously the cpu_slub structures of two > neighboring threads ended up on the same cacheline by accident thanks > to the back to back allocation. �W/ the percpu allocator, this no > longer would happen as the allocator groups percpu data together > per-cpu. Yanmin, do we see a lot of remote frees for your hackbench run? IIRC, it's the "deactivate_remote_frees" stat when CONFIG_SLAB_STATS is enabled. Pekka -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Tejun Heo on 26 Apr 2010 07:00 On 04/26/2010 12:09 PM, Pekka Enberg wrote: >> My wild speculation is that previously the cpu_slub structures of two >> neighboring threads ended up on the same cacheline by accident thanks >> to the back to back allocation. W/ the percpu allocator, this no >> longer would happen as the allocator groups percpu data together >> per-cpu. > > Yanmin, do we see a lot of remote frees for your hackbench run? IIRC, > it's the "deactivate_remote_frees" stat when CONFIG_SLAB_STATS is > enabled. I'm not familiar with the details or scales here so please take whatever I say with a grain of salt. For hyperthreading configuration I think operations don't have to be remote to be affected. If the data for cpu0 and cpu1 were on the same cache line, and cpu0 and cpu1 are occupying the same physical core thus sharing all the resources it would benefit from the sharing whether any operation was remote or not as it saves the physical processor one cache line. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Pekka J Enberg on 26 Apr 2010 10:20 On 04/26/2010 12:09 PM, Pekka Enberg wrote: >>> My wild speculation is that previously the cpu_slub structures of two >>> neighboring threads ended up on the same cacheline by accident thanks >>> to the back to back allocation. W/ the percpu allocator, this no >>> longer would happen as the allocator groups percpu data together >>> per-cpu. >> >> Yanmin, do we see a lot of remote frees for your hackbench run? IIRC, >> it's the "deactivate_remote_frees" stat when CONFIG_SLAB_STATS is >> enabled. On Mon, 26 Apr 2010, Tejun Heo wrote: > I'm not familiar with the details or scales here so please take > whatever I say with a grain of salt. For hyperthreading configuration > I think operations don't have to be remote to be affected. If the > data for cpu0 and cpu1 were on the same cache line, and cpu0 and cpu1 > are occupying the same physical core thus sharing all the resources it > would benefit from the sharing whether any operation was remote or not > as it saves the physical processor one cache line. Even if the cacheline is dirtied like in the struct kmem_cache_cpu case? If that's the case, don't we want the per-CPU allocator to support back to back allocation for cores that are in the same package? Btw, I focused on remote frees initially before I understood what you actually meant and scetched the following untested patch to take advantage of the fact that struct kmem_cache_cpu doesn't fill a whole cache line. It tries amortize remote free costs by "queuing" objects. It would be interesting to see if it helps here (or in the other SLUB regressions like netperf and the famous Intel one). Pekka diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index 0249d41..b554a67 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -34,10 +34,14 @@ enum stat_item { ORDER_FALLBACK, /* Number of times fallback was necessary */ NR_SLUB_STAT_ITEMS }; +#define SLUB_MAX_NR_REMOTES 5 + struct kmem_cache_cpu { void **freelist; /* Pointer to first free per cpu object */ struct page *page; /* The slab from which we are allocating */ int node; /* The node of the page (or -1 for debug) */ + int nr_remotes; /* Number of remotely free'd objects */ + void *remotelist[SLUB_MAX_NR_REMOTES]; /* List of remotely free'd objects */ #ifdef CONFIG_SLUB_STATS unsigned stat[NR_SLUB_STAT_ITEMS]; #endif diff --git a/mm/slub.c b/mm/slub.c index 7d6c8b1..e8e5523 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1480,6 +1480,24 @@ static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) unfreeze_slab(s, page, tail); } +static void __slab_free(struct kmem_cache *s, struct page *page, void *x, unsigned long addr); + +static void flush_remotelist(struct kmem_cache *s, struct kmem_cache_cpu *c) +{ + int i; + + for (i = 0; i < c->nr_remotes; i++) { + struct page *page; + void *x; + + x = c->remotelist[i]; + page = virt_to_head_page(x); + + __slab_free(s, page, x, _RET_IP_); + } + c->nr_remotes = 0; +} + static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) { stat(s, CPUSLAB_FLUSH); @@ -1496,7 +1514,12 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu) { struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu); - if (likely(c && c->page)) + if (unlikely(!c)) + return; + + flush_remotelist(s, c); + + if (likely(c->page)) flush_slab(s, c); } @@ -1709,6 +1732,8 @@ static __always_inline void *slab_alloc(struct kmem_cache *s, local_irq_save(flags); c = __this_cpu_ptr(s->cpu_slab); + if (unlikely(c->nr_remotes == SLUB_MAX_NR_REMOTES)) + flush_remotelist(s, c); object = c->freelist; if (unlikely(!object || !node_match(c, node))) @@ -1865,8 +1890,12 @@ static __always_inline void slab_free(struct kmem_cache *s, set_freepointer(s, object, c->freelist); c->freelist = object; stat(s, FREE_FASTPATH); - } else - __slab_free(s, page, x, addr); + } else { + if (unlikely(c->nr_remotes == SLUB_MAX_NR_REMOTES)) + flush_remotelist(s, c); + + c->remotelist[c->nr_remotes++] = x; + } local_irq_restore(flags); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Tejun Heo on 26 Apr 2010 10:40
Hello, Pekka. On 04/26/2010 04:17 PM, Pekka J Enberg wrote: > Even if the cacheline is dirtied like in the struct kmem_cache_cpu case? If my hypothesis is the case, I don't think dirtying or not would matter. It's about two cpus sharing a cache line which usually is a bad idea but in this case happens to be a good idea because the two cpus sit on the same cache. > If that's the case, don't we want the per-CPU allocator to support back > to back allocation for cores that are in the same package? I think it's probably gonna be an over-engineering effort. W/ percpu allocator the rest of the cacheline would likely be occupied by another percpu item for the cpu, so it's not really wasted. It's just used differently. It would be good if we have a way to better pack small hot ones (for the same cpu) into the same cachelines but I don't think it would be wise to interleave stuff from different cpus. It's not like there's only single way to save a cacheline after all. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |