From: H. Peter Anvin on
On 02/17/2010 05:53 PM, Linus Torvalds wrote:
>>
>> FWIW, I don't know of any microarchitecture where adc is slower than
>> add, *as long as* the setup time for the CF flag is already used up.
>
> Oh, I think there are lots.
>
> Look at just about any x86 latency/throughput table, and you'll see:
>
> - adc latencies are typically much higher than a single cycle
>
> But you are right that this is likel not an issue on any out-of-order
> chip, since the 'stc' will schedule perfectly.
>

STC actually tends to schedule poorly, since it has a partial register
stall. In-order or out-of-order doesn't really matter, though; what
matters is that the scoreboarding used for the flags has to settle, or
you will take a huge hit.

> - but adc _throughput_ is also typically much higher, which indicates
> that even if you do flag renaming, the 'adc' quite likely only
> schedules in a single ALU unit.
>
> For example, on a Pentium, adc/sbb can only go in the U pipe, and I think
> the same is true of 'stc'. Now, nobody likely cares about Pentiums any
> more, but the point is, 'adc' does often have constraints that a regular
> 'add' does not, and there's an example of a 'stc+adc' pair would at the
> very least have to be scheduled with an instruction in between.

No doubt. I doubt it much matters in this context, but either way I
think the patch is probably a bad idea... much for the same as my incl
hack was - since the code isn't actually inline, saving a handful bytes
is not the right tradeoff.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Zachary Amsden on
>
> On 02/17/2010 05:53 PM, Linus Torvalds wrote:
>
>> - but adc _throughput_ is also typically much higher, which indicates
>> that even if you do flag renaming, the 'adc' quite likely only
>> schedules in a single ALU unit.
>>
>> For example, on a Pentium, adc/sbb can only go in the U pipe, and I think
>> the same is true of 'stc'. Now, nobody likely cares about Pentiums any
>> more, but the point is, 'adc' does often have constraints that a regular
>> 'add' does not, and there's an example of a 'stc+adc' pair would at the
>> very least have to be scheduled with an instruction in between.
>>
> No doubt. I doubt it much matters in this context, but either way I
> think the patch is probably a bad idea... much for the same as my incl
> hack was - since the code isn't actually inline, saving a handful bytes
> is not the right tradeoff.
>
> -hpa
>
>

Incidentally, the cost of putting all the rwsem code inline, using the
straightforward approach, for git-tip, using defconfig on x86_64 is 3565
bytes / 20971778 bytes total, or 0.0168%, using gcc 4.4.3.

That's small enough to actually consider it.

Even smaller if you leave trylock as a function... actually no, that
didn't work, size increased. I'm guessing many call sites also end up
calling the explicit form as a fallback.

If you inline only read_lock functions and write release, nope, that
didn't work either.

If you inline only read_lock functions, that still isn't it. Many other
permutations are possible, but I've wasted enough time.

Although, with a more clever inline implementation, if some of the
constraints to %rdx go away...

Zach
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on
Zachary Amsden <zamsden(a)redhat.com> writes:
>
> Incidentally, the cost of putting all the rwsem code inline, using the
> straightforward approach, for git-tip, using defconfig on x86_64 is
> 3565 bytes / 20971778 bytes total, or 0.0168%, using gcc 4.4.3.

The nice advantage of putting lock code inline is that it gets
accounted to the caller in all profilers.

-Andi

--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Zachary Amsden on
>
> Zachary Amsden<zamsden(a)redhat.com> writes
>> Incidentally, the cost of putting all the rwsem code inline, using the
>> straightforward approach, for git-tip, using defconfig on x86_64 is
>> 3565 bytes / 20971778 bytes total, or 0.0168%, using gcc 4.4.3.
>>
> The nice advantage of putting lock code inline is that it gets
> accounted to the caller in all profilers.
>
> -Andi
>
>

Unfortunately, only for the uncontended case. The hot case still ends
up in a call to the lock text section.

Zach
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Andi Kleen on
On Wed, Feb 17, 2010 at 10:24:58PM -1000, Zachary Amsden wrote:
>>
>> Zachary Amsden<zamsden(a)redhat.com> writes
>>> Incidentally, the cost of putting all the rwsem code inline, using the
>>> straightforward approach, for git-tip, using defconfig on x86_64 is
>>> 3565 bytes / 20971778 bytes total, or 0.0168%, using gcc 4.4.3.
>>>
>> The nice advantage of putting lock code inline is that it gets
>> accounted to the caller in all profilers.
>>
>> -Andi
>>
>>
>
> Unfortunately, only for the uncontended case. The hot case still ends up
> in a call to the lock text section.

I removed those some time ago because it breaks unwinding.
Did that get undone?

-Andi
--
ak(a)linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/