Prev: [PATCH] coda: rename REQ_* to CODA_REQ_*
Next: CRED: Fix __task_cred()'s lockdep check and banner comment
From: Andrea Arcangeli on 3 Aug 2010 12:10 Transparent Hugepage Support worst case macro benchmark: kernel build. host: 24-way SMP (12cores, 2 sockets) 16G RAM guest: 24-way SMP (24 vcpus) 15G RAM - Same kernel in guest and host: aa.git tag THP-29 (2.6.35 based) http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=summary (see linux-mm for more info) - Same gcc (and userland) patched with the patch at the end. - Same kernel source, same .config, on tmpfs (tmpfs only eliminates the measurement error across different runs but ext4 leads to the same average results). - khugepaged default settings (khugepaged taking 0% CPU) - no glibc align tweak (that would improve performance a little further with THP always) The measurement also includes the "make clean", and the full "make -j32" that includes lots of other time consuming operations not getting any benefit from transparent hugepages. If this was pure "gcc" the percentage speedup would be much higher than this. This is a very real life workload that we run on a daily basis, not a microbenchmark at all. Kernel build on bare metal (note the dTLB-load-misses): ====== build ====== #!/bin/bash make clean >/dev/null; make -j32 >/dev/null =================== perf stat -e cycles -e instructions -e dtlb-loads -e dtlb-load-misses --repeat 3 ./build =================== THP always host (fastest base result) Performance counter stats for './build' (3 runs): 4420734012848 cycles ( +- 0.007% ) 2692414418384 instructions # 0.609 IPC ( +- 0.000% ) 696638665612 dTLB-loads ( +- 0.001% ) 2982343758 dTLB-load-misses ( +- 0.051% ) 83.855147696 seconds time elapsed ( +- 0.058% ) THP never host (slowdown 4.06%) Performance counter stats for './build' (3 runs): 4599325985460 cycles ( +- 0.013% ) 2747874065083 instructions # 0.597 IPC ( +- 0.000% ) 710631792376 dTLB-loads ( +- 0.000% ) 4425816093 dTLB-load-misses ( +- 0.039% ) 87.260443531 seconds time elapsed ( +- 0.075% ) Kernel build on KVM powered guest: ======= time (make clean; make -j32) >/dev/null ======= THP always guest, EPT on + THP always host (slowdown 5.67%) NOTE: the total KVM virtualization slowdown for the kernel build with THP always in guest and host compared with bare metal with THP never (like current upstream) is only 1.54%. real 1m28.612s -> 88.612 seconds user 26m13.862s sys 2m11.376s THP never guest, EPT on + THP always host (slowdown 12.71%) real 1m34.516s -> 94.516 seconds user 26m52.929s sys 3m35.509s THP never guest, EPT on + THP never host (slowdown 24.81%) real 1m44.663s -> 104.663 seconds user 28m13.382s sys 5m39.373s THP always guest, EPT off + THP always host (slowdown 198.33%) real 4m10.166s -> 250.166 seconds user 41m5.674s sys 47m37.671s THP never guest, EPT off + THP always host (slowdown 254.43%) real 4m57.211s -> 297.211 seconds user 53m44.302s sys 53m21.600s THP never guest, EPT off + THP never host (slowdown 260.15%) real 5m2.006s -> 302.006 seconds user 53m25.876s sys 53m32.649s This is trivial to reproduce, you can try yourself with aa.git and the below gcc patch. The results are similar to what I got with NPT some time ago with a less complete benchmark (no gcc patched in guest) and only 4 cores. After this worst case macro benchmark, I'll go ahead with more optimal benchmarks (with bigger memory footprint and longer-living tasks not quitting so fast and not including make -j32/make clean etc...) and I expect more pronounced speedups, like the qemu-kvm translate.o gcc build (the file with the automatic generated .c source for JIT emulation), Java etc... And I'll include these results and all other results I'll be getting, in my KVM Forum 2010 talk on Transparent Hugepage Support next week in Boston. Thanks! Andrea --- /var/tmp/portage/sys-devel/gcc-4.4.2/work/gcc-4.4.2/gcc/ggc-page.c 2008-07-28 16:33:56.000000000 +0200 +++ /tmp/gcc-4.4.2/gcc/ggc-page.c 2010-04-25 06:01:32.829753566 +0200 @@ -450,6 +450,11 @@ #define BITMAP_SIZE(Num_objects) \ (CEIL ((Num_objects), HOST_BITS_PER_LONG) * sizeof(long)) +#ifdef __x86_64__ +#define HPAGE_SIZE (2*1024*1024) +#define GGC_QUIRE_SIZE 512 +#endif + /* Allocate pages in chunks of this size, to throttle calls to memory allocation routines. The first page is used, the rest go onto the free list. This cannot be larger than HOST_BITS_PER_INT for the @@ -457,7 +462,7 @@ can override this by defining GGC_QUIRE_SIZE explicitly. */ #ifndef GGC_QUIRE_SIZE # ifdef USING_MMAP -# define GGC_QUIRE_SIZE 256 +# define GGC_QUIRE_SIZE 512 # else # define GGC_QUIRE_SIZE 16 # endif @@ -654,6 +659,23 @@ #ifdef HAVE_MMAP_ANON char *page = (char *) mmap (pref, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); +#ifdef HPAGE_SIZE + if (!(size & (HPAGE_SIZE-1)) && + page != (char *) MAP_FAILED && (size_t) page & (HPAGE_SIZE-1)) { + char *old_page; + munmap(page, size); + page = (char *) mmap (pref, size + HPAGE_SIZE-1, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + old_page = page; + page = (char *) (((size_t)page + HPAGE_SIZE-1) + & ~(HPAGE_SIZE-1)); + if (old_page != page) + munmap(old_page, page-old_page); + if (page != old_page + HPAGE_SIZE-1) + munmap(page+size, old_page+HPAGE_SIZE-1-page); + } +#endif #endif #ifdef HAVE_MMAP_DEV_ZERO char *page = (char *) mmap (pref, size, PROT_READ | PROT_WRITE, -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |