[X86] performance improvement for memcpy_64.S by fast string. [Kernel]

Prev: V4L/DVB fixes
Next: Fix syntax of mdio.txt to match actual syntax.

From: Andi Kleen on 9 Nov 2009 14:00

> Ling's numbers didn't seem to show a significant slowdown on Core 2 (it
> was something like 0.95x baseline in the worst case, and most of the
> cases were positive) so Core 2 doesn't seem to have a problem.

I ran quite a lot of micro benchmarks with various alignments and sizes
the 'q' variant was not always a win. I haven't checked that particular
version though.

There's also K8 of course.

-Andi
--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Cyrill Gorcunov on 11 Nov 2009 15:50

On Wed, Nov 11, 2009 at 03:05:34PM +0800, Ma, Ling wrote:
> Hi All
> Please use the memcpy.c(cc -o memcpy memcpy.c -O2) to test more cases,
> if you have interest. In this program we did simple modification
> on memcpy_new function.
>
> Thanks
> Ling

Just my 0.2$ :)

-- Cyrill
---
memcpy_orig memcpy_new
TPT: Len 1024, alignment 8/ 0: 490 570
TPT: Len 2048, alignment 8/ 0: 826 329
TPT: Len 3072, alignment 8/ 0: 441 464
TPT: Len 4096, alignment 8/ 0: 579 596
TPT: Len 5120, alignment 8/ 0: 723 729
TPT: Len 6144, alignment 8/ 0: 859 861
TPT: Len 7168, alignment 8/ 0: 996 994
TPT: Len 8192, alignment 8/ 0: 1165 1127
TPT: Len 9216, alignment 8/ 0: 1273 1260
TPT: Len 10240, alignment 8/ 0: 1402 1395
TPT: Len 11264, alignment 8/ 0: 1543 1525
TPT: Len 12288, alignment 8/ 0: 1682 1659
TPT: Len 13312, alignment 8/ 0: 1869 1815
TPT: Len 14336, alignment 8/ 0: 1982 1951
TPT: Len 15360, alignment 8/ 0: 2185 2110
---

I've run this test a few times and results almost the same,
with alignment 1024, 3072, 4096, 5120, 6144, new version a bit slowly.

---
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Duo CPU T8100 @ 2.10GHz
stepping : 6
cpu MHz : 800.000
cache size : 3072 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm ida tpr_shadow vnmi flexpriority
bogomips : 4189.60
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Duo CPU T8100 @ 2.10GHz
stepping : 6
cpu MHz : 800.000
cache size : 3072 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm ida tpr_shadow vnmi flexpriority
bogomips : 4189.46
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Pavel Machek on 12 Nov 2009 18:10

On Mon 2009-11-09 15:24:03, Ma, Ling wrote:
> Hi All
>
> Today we run our benchmark on Core2 and Sandy Bridge:
>
> 1. Retrieve result on Core2
> Speedup on Core2
> Len Alignement Speedup
> 1024, 0/ 0: 0.95x
> 2048, 0/ 0: 1.03x

Well, so you are running cache hot and it is only a win on huge
copies... how common are those?

> Application run through perf
> For (i= 1024; i < 1024 * 16; i = i + 64)
> do_memcpy(0, 0, i);

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Ma, Ling on 13 Nov 2009 00:40

>Well, so you are running cache hot and it is only a win on huge
>copies... how common are those?
>
Hi Pavel Machek
Yes, we intend to introduce movsq for huge hot size(over 1024bytes)
and avoid regression for less 1024bytes. I guess you suggest using
prefetch instruction for cold data (if I was wrong please correct me).
memcpy don't know whether data has been in cache or not,
so only when copy size is over (first level 1 cache)/2 and lower
(last level cache)/2 , prefetch will get benefit. Currently first
level cache size of most cpus is around 32KB, so it is useful for prefetch
when copy size is over 16KB, but as H. Peter Anvin mentioned in last email,
over 16KB copy in kernel is rare.

Thanks
Ling

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: V4L/DVB fixes
Next: Fix syntax of mdio.txt to match actual syntax.