reducing NUMERIC size for 9.1 [PgSql]

Prev: SHOW TABLES
Next: [HACKERS] Listen/Notify in 9.0

From: Robert Haas on 29 Jul 2010 16:45

On Thu, Jul 29, 2010 at 4:37 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas(a)gmail.com> writes:
>> On Thu, Jul 29, 2010 at 1:20 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
>>> On-disk is what I'm thinking about. �Right now, a NaN's first word is
>>> all dscale except the sign bits. �You're proposing to change that
>>> but I think it's unnecessary to do so.
>
>> *Where* am I proposing this?
>
> Um, your patch has the comment
>
> ! �* If the high bits of n_scale_dscale are NUMERIC_NAN, the two-byte header
> ! �* format is also used, but the low bits of n_scale_dscale are discarded in
> ! �* this case.
>
> but now that I look a bit more closely, I don't think that's what the
> code is doing. �You've got the NUMERIC_DSCALE and NUMERIC_WEIGHT access
> macros testing specifically for NUMERIC_IS_SHORT, not for high-bit-set
> which I think is what I was assuming they'd do. �So actually that code
> is good as is: a NAN still has the old header format. �It's just the
> comment that's wrong.

OK. I think you're misinterpreting the point of that comment, which
may mean that it needs some clarification. By "the two byte header
format is also used", I think I really meant "the header (and in fact
the entire value) is just 2 bytes". Really, the low order bits have
neither the old interpretation nor the new interpretation: they don't
have any interpretation at all - they're completely meaningless.
That's what the part after the word "but" was intended to clarify.
Every routine in numeric.c checks for NUMERIC_IS_NAN() and gives it
some special handling before doing anything else, so NUMERIC_WEIGHT()
and NUMERIC_DSCALE() are never called in that case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 30 Jul 2010 07:22

On Fri, Jul 30, 2010 at 1:13 AM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas(a)gmail.com> writes:
>> But, looking at it a bit more carefully, isn't the maximum-size logic
>> for numeric rather bogus?
>
> Perhaps, but I think you're confused on at least one point.
> numeric(2,1) has to be able to hold 2 decimal digits, not 2
> NumericDigits (which'd actually be 8 decimal digits given
> the current code).

I get that. The point is: if one of those 2 decimal digits is before
the decimal point and the other is after it, then two NumericDigits
will be used. The value '11'::numeric is only size 10 (untoasted),
but the value '1.1'::numeric is size 12 (untoasted).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 30 Jul 2010 09:16

Robert Haas <robertmhaas(a)gmail.com> writes:
> On Fri, Jul 30, 2010 at 1:13 AM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
>> Perhaps, but I think you're confused on at least one point.
>> numeric(2,1) has to be able to hold 2 decimal digits, not 2
>> NumericDigits (which'd actually be 8 decimal digits given
>> the current code).

> I get that. The point is: if one of those 2 decimal digits is before
> the decimal point and the other is after it, then two NumericDigits
> will be used.

Ah, I see. Maybe we should allow for one more NumericDigit in the
calculation for such cases. I guess you could look at the scale too
to detect if the case is possible, but not sure it's worth the trouble.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 30 Jul 2010 13:42

On Fri, Jul 30, 2010 at 9:16 AM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas(a)gmail.com> writes:
>> On Fri, Jul 30, 2010 at 1:13 AM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
>>> Perhaps, but I think you're confused on at least one point.
>>> numeric(2,1) has to be able to hold 2 decimal digits, not 2
>>> NumericDigits (which'd actually be 8 decimal digits given
>>> the current code).
>
>> I get that. �The point is: if one of those 2 decimal digits is before
>> the decimal point and the other is after it, then two NumericDigits
>> will be used.
>
> Ah, I see. �Maybe we should allow for one more NumericDigit in the
> calculation for such cases. �I guess you could look at the scale too
> to detect if the case is possible, but not sure it's worth the trouble.

Since this just needs to be a reasonable upper bound, probably not.

Another small but related issue is this comment is out-of-date:

/* Numeric stores 2 decimal digits/byte, plus header */
return (precision + 1) / 2 + NUMERIC_HDRSZ;

Currently, numeric stores 4 decimal digits/2 bytes, which is not quite
the same thing. I think that for clarity it would make sense to break
this calculation in two parts. First, estimate the number of
NumericDigits that could be needed; then, estimate the amount of space
that could be needed to store them. Maybe something like this,
obviously with a suitable comment which I haven't written yet:

numeric_digits = (precision + 6) / 4;
return (numeric_digits * sizeof(int16)) + NUMERIC_HDRSZ;

The first line would be incorrect for precision 0 (it would produce 1,
not 0) but precision 0 isn't allowed anyway, and overestimating just
that one case would be OK even if it did. As far as I can see, it's
correct for all higher values. If you have 1 decimal digit, you can't
need more than one NumericDigit, but if you have 2 decimal digits, you
can need one for each side of the decimal point. But you can't manage
to get a third NumericDigit until you get up to 6 decimal digits,
because with only 5 you can split them 1-4 or 3-2 across the decimal
point, but getting to a second NumericDigit on either side of the
decimal point uses up all 5 digits, leaving nothing for the other
side. Similarly, to get a fourth NumericDigit, you have to have
enough decimal digits to split them 1-4-4-1 (with the decimal point
somewhere in the middle), so 10 is the minimum.

Another question here is whether we should just fix this in CVS HEAD,
or whether there's any reason to back-patch it. I can't see much
reason to back-patch, unless there's third-party code depending on
this for something more than what we use it for. The only callers of
type_maximum_size are get_typavgwidth(), which doesn't need to be
exact, and needs_toast_table(). So it seems that the worst thing that
could happen as a result of this problem is that we might fail to
create a toast table when one is needed, and even that seems extremely
unlikely. First, you'd need to have a table containing no text or
unlimited-length varchar columns, because those will force toast table
creation anyway. Second, numerics are typically going to be stored
with a short varlena header on disk, so a 2-byte underestimate of the
max size is going to be swamped by a 3-byte savings on the varlena
header. Third, even if you can find a case where the above doesn't
matter, it's not that hard to force a toast table to get created.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 29 Jul 2010 13:04

On Wed, Jul 28, 2010 at 3:00 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas(a)gmail.com> writes:
>> On Fri, Jul 16, 2010 at 2:39 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
>>> I don't like the way you did that either (specifically, not the kluge
>>> in NUMERIC_DIGITS()). �It would probably work better if you declared
>>> two different structs, or a union of same, to represent the two layout
>>> cases.
>>>
>>> n_sign_dscale is now pretty inappropriately named, probably better to
>>> change the field name. �This will also help to catch anything that's
>>> not using the macros. �(Renaming the n_weight field, or at least burying
>>> it in an extra level of struct, would be helpful for the same reason.)
>
>> I'm not sure what you have in mind here. �If we create a union of two
>> structs, we'll still have to pick one of them to use to check the high
>> bits of the first word, so I'm not sure we'll be adding all that much
>> in terms of clarity.
>
> No, you can do something like this:
>
> typedef struct numeric_short
> {
> � � � �uint16 �word1;
> � � � �NumericDigit digits[1];
> } numeric_short;
>
> typedef struct numeric_long
> {
> � � � �uint16 �word1;
> � � � �int16 � weight;
> � � � �NumericDigit digits[1];
> } numeric_long;
>
> typedef union numeric
> {
> � � � �uint16 �word1;
> � � � �numeric_short � short;
> � � � �numeric_long � �long;
> } numeric;

That doesn't quite work because there's also a varlena header that has
to be accounted for, so the third member of the union can't be a
simple uint16. I'm wondering if it makes sense to do something along
these lines:

typedef struct NumericData
{
int32 varlen;
int16 n_header;
union {
struct {
char n_data[1];
} short;
struct {
uint16 n_weight;
char n_data[1];
} long;
};
} NumericData;

Why n_data as char[1] instead of NumericDigit, you ask? It's that way
now, mostly I think so that the rest of the system isn't allowed to
know what underlying type is being used for NumericDigit; it looks
like previously it was signed char, but now it's int16.

>>> It seems like you've handled the NAN case a bit awkwardly. �Since the
>>> weight is uninteresting for a NAN, it's okay to not store the weight
>>> field, so I think what you should do is consider that the dscale field
>>> is still full-width, ie the format of the first word remains old-style
>>> not new-style. �I don't remember whether dscale is meaningful for a NAN,
>>> but if it is, your approach is constraining what is possible to store,
>>> and is also breaking compatibility with old databases.
>
>> There is only one NaN value. �Neither weight or dscale is meaningful.
>> I think if the high two bits of the first word are 11 we never examine
>> anything else - do you see somewhere that we're doing otherwise?
>
> I hadn't actually looked. �I think though that it's a mistake to break
> compatibility on both dscale and weight when you only need to break one.
> Also, weight is *certainly* uninteresting for NaNs since it's not even
> meaningful unless there are digits. �dscale could conceivably be worth
> something.

I don't think I'm breaking compatibility on anything. Can you clarify
what part of the code you're referring to here? I'm sort of lost.

>>> The sign extension code in the NUMERIC_WEIGHT() macro seems a bit
>>> awkward; I wonder if there's a better way. �One solution might be to
>>> offset the value (ie, add or subtract NUMERIC_SHORT_WEIGHT_MIN) rather
>>> than try to sign-extend per se.
>
>> Hmm... so, if the weight is X we store the value
>> X-NUMERIC_SHORT_WEIGHT_MIN as an unsigned integer? �That's kind of a
>> funny representation - I *think* it works out to sign extension with
>> the high bit flipped. �I guess we could do it that way, but it might
>> make it harder/more confusing to do bit arithmetic with the weight
>> sign bit later on.
>
> Yeah, it was just an idea. �It seems like there should be an easier way
> to extract the sign-extended value, though.

It seemed a bit awkward to me, too, but I'm not sure there's a better one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4 5 6
Prev: SHOW TABLES
Next: [HACKERS] Listen/Notify in 9.0