x86 rwsem optimization extreme [Kernel]

Prev: 2.6.33-rc8-git: nouveaufb hangs on boot on MacBookPro5,3
Next: x86-32: panic on !CX8 && XMM

From: Linus Torvalds on 17 Feb 2010 17:20

On Wed, 17 Feb 2010, Zachary Amsden wrote:
>
> The x86 instruction set provides the ability to add an additional
> bit into addition or subtraction by using the carry flag.
> It also provides instructions to directly set or clear the
> carry flag. By forcibly setting the carry flag, we can then
> represent one particular 64-bit constant, namely
>
> 0xffffffff + 1 = 0x100000000
>
> using only 32-bit values. In particular we can optimize the rwsem
> write lock release by noting it is of exactly this form.

Don't do this.

Just shift the constants down by two, and suddenly you don't need any
clever tricks, because all the constants fit in 32 bits anyway,
regardless of sign issues.

So just change the

# define RWSEM_ACTIVE_MASK 0xffffffffL

line into

# define RWSEM_ACTIVE_MASK 0x3fffffffL

and you're done.

The cost of 'adc' may happen to be identical in this case, but I suspect
you didn't test on UP, where the 'lock' prefix goes away. An unlocked
'add' tends to be faster than an unlocked 'adc'.

(It's possible that some micro-architectures don't care, since it's a
memory op, and they can see that 'C' is set. But it's a fragile assumption
that it would always be ok).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: H. Peter Anvin on 17 Feb 2010 17:40

On 02/17/2010 02:10 PM, Linus Torvalds wrote:
>
>
> On Wed, 17 Feb 2010, Zachary Amsden wrote:
>>
>> The x86 instruction set provides the ability to add an additional
>> bit into addition or subtraction by using the carry flag.
>> It also provides instructions to directly set or clear the
>> carry flag. By forcibly setting the carry flag, we can then
>> represent one particular 64-bit constant, namely
>>
>> 0xffffffff + 1 = 0x100000000
>>
>> using only 32-bit values. In particular we can optimize the rwsem
>> write lock release by noting it is of exactly this form.
>
> Don't do this.
>
> Just shift the constants down by two, and suddenly you don't need any
> clever tricks, because all the constants fit in 32 bits anyway,
> regardless of sign issues.
>

Why bother at all? I thought it mattered when I saw __downgrade_write()
as an inline, but in fact it is only ever used inside the
downgrade_write() out-of-line function, so we're talking about saving
*five bytes* across the whole kernel in the best case. I vote for
leaving it the way it is, and get the very slight extra readability.
There is no point in moving bits around, either.

-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: H. Peter Anvin on 17 Feb 2010 18:40

On 02/17/2010 02:10 PM, Linus Torvalds wrote:
>
> The cost of 'adc' may happen to be identical in this case, but I suspect
> you didn't test on UP, where the 'lock' prefix goes away. An unlocked
> 'add' tends to be faster than an unlocked 'adc'.
>
> (It's possible that some micro-architectures don't care, since it's a
> memory op, and they can see that 'C' is set. But it's a fragile assumption
> that it would always be ok).
>

FWIW, I don't know of any microarchitecture where adc is slower than
add, *as long as* the setup time for the CF flag is already used up.
However, as I already commented, I don't think this is worth it. This
inline appears to only be instantiated once, and as such, it takes a
whopping six bytes across the entire kernel.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Zachary Amsden on 17 Feb 2010 20:10

>
> On 02/17/2010 02:10 PM, Linus Torvalds wrote:
>
>> The cost of 'adc' may happen to be identical in this case, but I suspect
>> you didn't test on UP, where the 'lock' prefix goes away. An unlocked
>> 'add' tends to be faster than an unlocked 'adc'.
>>
>> (It's possible that some micro-architectures don't care, since it's a
>> memory op, and they can see that 'C' is set. But it's a fragile assumption
>> that it would always be ok).
>>
>>
> FWIW, I don't know of any microarchitecture where adc is slower than
> add, *as long as* the setup time for the CF flag is already used up.
> However, as I already commented, I don't think this is worth it. This
> inline appears to only be instantiated once, and as such, it takes a
> whopping six bytes across the entire kernel.
>
>

Without the locks,

stc; adc %rdx, (%rax)

vs.

add %rdx, (%rax)

Shows no statistical difference on Intel.
On AMD, the first form is about twice as expensive.

Course this is all completely useless, but it would be if the locks were
inline (which is actually an askable question now). There was just so
much awesomeness going on with the 64-bit rwsem constructs I felt I had
to add even more awesomeness to the plate. For some definition of
awesomeness.

Zach
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 17 Feb 2010 21:00

On Wed, 17 Feb 2010, H. Peter Anvin wrote:
>
> FWIW, I don't know of any microarchitecture where adc is slower than
> add, *as long as* the setup time for the CF flag is already used up.

Oh, I think there are lots.

Look at just about any x86 latency/throughput table, and you'll see:

- adc latencies are typically much higher than a single cycle

But you are right that this is likel not an issue on any out-of-order
chip, since the 'stc' will schedule perfectly.

- but adc _throughput_ is also typically much higher, which indicates
that even if you do flag renaming, the 'adc' quite likely only
schedules in a single ALU unit.

For example, on a Pentium, adc/sbb can only go in the U pipe, and I think
the same is true of 'stc'. Now, nobody likely cares about Pentiums any
more, but the point is, 'adc' does often have constraints that a regular
'add' does not, and there's an example of a 'stc+adc' pair would at the
very least have to be scheduled with an instruction in between.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3
Prev: 2.6.33-rc8-git: nouveaufb hangs on boot on MacBookPro5,3
Next: x86-32: panic on !CX8 && XMM