From: Denys Vlasenko on
On Friday 06 August 2010 00:38, Michal Nazarewicz wrote:
> The disadvantage is that the proposed function is 2.5-3.5 bigger.
> Those are not big functions though -- we are talking here about
> proposed function being below 512.

It's a slippery slope. Here's where it ends: glibc
has memcpy() function which is "only" 8k of code or so.
I'm not joking.



> +#if BITS_PER_LONG == 64
> +
....
> +#else
....
> +/*
> + * Based on code by Douglas W. Jones found at
> + * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html#sixtyfour>. This
> + * performs no 64-bit division and hence should be faster on 32-bit
> + * machines then the version of the function above.
> + */
> +static noinline_for_stack
> +char *put_dec(char *buf, unsigned long long n)
> +{
> + uint32_t d3, d2, d1, q;
> +
> + if (!n) {
> + *buf++ = '0';
> + return buf;
> + }
> +
> + d1 = (n >> 16) & 0xFFFF;
> + d2 = (n >> 32) & 0xFFFF;
> + d3 = (n >> 48) & 0xFFFF;

Are you assuming that sizeof(long long) == 8, always?

--
vda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Michal Nazarewicz on
Denys Vlasenko <vda.linux(a)googlemail.com> writes:

> On Friday 06 August 2010 00:38, Michal Nazarewicz wrote:
>> The disadvantage is that the proposed function is 2.5-3.5 bigger.
>> Those are not big functions though -- we are talking here about
>> proposed function being below 512.

> It's a slippery slope. Here's where it ends: glibc
> has memcpy() function which is "only" 8k of code or so.
> I'm not joking.

I'm aware of that. I assume that someone more clever then me will
decide whether to accept this patch or not. (Also we win a few bytes on
put_dec_full() and put_dec_8bit()). :P

>> +#if BITS_PER_LONG == 64
>> +
> ...
>> +#else
> ...
>> +/*
>> + * Based on code by Douglas W. Jones found at
>> + * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html#sixtyfour>. This
>> + * performs no 64-bit division and hence should be faster on 32-bit
>> + * machines then the version of the function above.
>> + */
>> +static noinline_for_stack
>> +char *put_dec(char *buf, unsigned long long n)
>> +{
>> + uint32_t d3, d2, d1, q;
>> +
>> + if (!n) {
>> + *buf++ = '0';
>> + return buf;
>> + }
>> +
>> + d1 = (n >> 16) & 0xFFFF;
>> + d2 = (n >> 32) & 0xFFFF;
>> + d3 = (n >> 48) & 0xFFFF;
>
> Are you assuming that sizeof(long long) == 8, always?

Well... yes. C requires long long to be at least 64-bit and I don't
see it being larger in any foreseeable feature. Wouldn't it be enough
to put a static assert here?

--
Best regards, _ _
.o. | Liege of Serenly Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Denys Vlasenko on
On Friday 06 August 2010 09:08, Michal Nazarewicz wrote:
> >> +static noinline_for_stack
> >> +char *put_dec(char *buf, unsigned long long n)
> >> +{
> >> + uint32_t d3, d2, d1, q;
> >> +
> >> + if (!n) {
> >> + *buf++ = '0';
> >> + return buf;
> >> + }
> >> +
> >> + d1 = (n >> 16) & 0xFFFF;
> >> + d2 = (n >> 32) & 0xFFFF;
> >> + d3 = (n >> 48) & 0xFFFF;
> >
> > Are you assuming that sizeof(long long) == 8, always?
>
> Well... yes. C requires long long to be at least 64-bit and I don't
> see it being larger in any foreseeable feature.

"640k will be enough for everybody"?

> Wouldn't it be enough to put a static assert here?

I'd prefer the code which works with arbitrarily wide long long.
If needed, use

if (sizeof(long long) == 8)
64-bit code
else
generic code

--
vda

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Michał Nazarewicz on
On Fri, 06 Aug 2010 09:35:26 +0200, Denys Vlasenko <vda.linux(a)googlemail.com> wrote:

> On Friday 06 August 2010 09:08, Michal Nazarewicz wrote:
>> >> +static noinline_for_stack
>> >> +char *put_dec(char *buf, unsigned long long n)
>> >> +{
>> >> + uint32_t d3, d2, d1, q;
>> >> +
>> >> + if (!n) {
>> >> + *buf++ = '0';
>> >> + return buf;
>> >> + }
>> >> +
>> >> + d1 = (n >> 16) & 0xFFFF;
>> >> + d2 = (n >> 32) & 0xFFFF;
>> >> + d3 = (n >> 48) & 0xFFFF;
>> >
>> > Are you assuming that sizeof(long long) == 8, always?
>>
>> Well... yes. C requires long long to be at least 64-bit and I don't
>> see it being larger in any foreseeable feature.
>
> "640k will be enough for everybody"?
>
>> Wouldn't it be enough to put a static assert here?
>
> I'd prefer the code which works with arbitrarily wide long long.
> If needed, use
>
> if (sizeof(long long) == 8)
> 64-bit code
> else
> generic code

Thanks. That seems like a perfect solution. I rearrange the code
and try to post updated version after the weekend.

--
Best regards, _ _
| Humble Liege of Serenely Enlightened Majesty of o' \,=./ `o
| Computer Science, Michał "mina86" Nazarewicz (o o)
+----[mina86*mina86.com]---[mina86*jabber.org]----ooO--(_)--Ooo--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Denys Vlasenko on
On Sunday 08 August 2010 21:29, Michal Nazarewicz wrote:
> Compared to previous version: the code is used only if:
> 1. if long long is 64-bit (ie. ULLONG_MAX == 2**64-1), and
> 2. user did not select optimisation for size with Kconfig.

I measured the size and it does not seem to make sense
to exclude it on -Os. On x86:

put_dec_full change: 0x93 -> 0x47 bytes
put_dec change: 0x12c -> 0x137 bytes

IOW, there is net code size reduction (compared to current kernel,
it may be a slight growth compared to patch 1).

So, please use the optimized code even for CONFIG_CC_OPTIMIZE_FOR_SIZE.


> Here are the results (normalised to the fastest/smallest):
> : ARM Atom
> -- Speed ----------------------------------
> orig_put_dec : 9.333822 2.083110 Original
> mod1_put_dec : 9.282045 1.904564
> mod2_put_dec : 9.260409 1.910302
> mod3_put_dec : 9.320053 1.905689 Proposed by previous patch
> mod4_put_dec : 9.297146 1.933971
> mod5_put_dec : 13.034318 2.434942
> mod6_put_dec : 1.000000 1.000000 Proposed by this patch
> mod7_put_dec : 1.009574 1.014147
> mod8_put_dec : 7.226004 1.953460
> -- Size -----------------------------------
> orig_put_dec : 1.000000 1.000000 Original
> mod1_put_dec : 1.000000 1.000000
> mod2_put_dec : 1.361111 1.403226
> mod3_put_dec : 1.000000 1.000000 Proposed by previous patch
> mod4_put_dec : 1.361111 1.403226
> mod5_put_dec : 1.000000 1.000000
> mod6_put_dec : 2.555556 3.508065 Proposed by this patch
> mod7_put_dec : 2.833333 3.911290
> mod8_put_dec : 2.027778 2.258065

I believe these are old results? Size growth is just too big.


> As it can be obsevred, proposed version of the put_dec function is
> twice as fast as the original version on Atom and almost 10 times
> faster on ARM. I imagine that it may be similar on other "embedded"
> processors.
>
> This may be skewed by the fact that the benchmark is using GCC's
> 64-bit division operator instead of kernel's do_div but it would
> appear that by avoiding 64-bit division something can be gained.

Re speed: on Phenom II in 32-bit mode, I see ~x3.3 speedup
on conversions involving large integers (might be skewed
by gcc's full-blown 64-bit division in "old" code - kernel's
div is smarter).


> PS. From Mr. Jones site: "Nonetheless, before relying on the material
> here, it would be prudent to check the arithmetic!" hence I checked
> all the calculations myself and everything seemed fine. I've also run
> test applitacion several times so it tested a few 64-bit numbers..."

I tested [0, 100 million] and [2^64-100 million, 2^64-1] ranges.
No errors.


> +#if BITS_PER_LONG != 32 || defined CONFIG_CC_OPTIMIZE_FOR_SIZE || \
> + ULLONG_MAX != 18446744073709551615ULL

I think it's better to say "if BITS_PER_LONG > 32 and ULLONG_MAX > 2^64-1",
since it expresses your intent better. Also, add comments explaining
what case you optimize for:

#if BITS_PER_LONG > 32 || ULLONG_MAX > 18446744073709551615ULL

/* Generic code */
....

#else /* BITS_PER_LONG <= 32 && ULLONG_MAX <= 2^64-1 */

/* Optimized code for arches with 64-bit long longs */
....


> +static noinline_for_stack
> +char *put_dec(char *buf, unsigned long long n)
> +{
> + uint32_t d3, d2, d1, q;
> +
> + if (!n) {
> + *buf++ = '0';
> + return buf;
> + }

You may as well use the above shortcut for n <= 9, not only for 0.

> + buf = put_dec_full4(buf, q % 10000);
> + q = q / 10000;
> +
> + d1 = q + 7671 * d3 + 9496 * d2 + 6 * d1;
> + buf = put_dec_full4(buf, d1 % 10000);
> + q = d1 / 10000;

I experimented with moving division up, before put_dec_full4:
q = d1 / 10000;
buf = put_dec_full4(buf, d1 % 10000);
but gcc appears to be smart emough to do this transformation
itself. But you may still do it for older (dumber) gcc's.

--
vda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/