Range types [PgSql]

Prev: Winflex
Next: [HACKERS] Fast or immediate shutdown

From: Jeff Davis on 15 Dec 2009 16:25

On Tue, 2009-12-15 at 13:15 -0500, Tom Lane wrote:
> You probably need some flag bits anyway, so flailing frantically to
> avoid that doesn't seem like a profitable use of time.

I think "need" and "flailing" are both a little too strong here. The
biggest use case will almost certainly be ranges of timestamps, and most
of those people will have no use for flag bits (or if they do, it might
not be worth an 8-byte-per-value overhead).

> One pretty obvious use for a flag bit is open-ended ranges, ie
> range(something, infinity)
> You could only do this without a flag bit if the underlying datatype
> has an "infinity" value, which not all do.

True, but frustrating for people whose primary use case is timestamps or
floats, which already have 'infinity'. Also frustrating for people who
don't mind using the min/max integer as a workaround.

> I'm also wondering what null range boundaries would do. Maybe that's
> the same as the infinity case, or maybe not.

I would prefer to avoid allowing NULL range boundaries for the following
reasons:
* it reminds me of MySQL dates with zeros in them
* we are not guided by any kind of standard
* we'd have to invent semantics that are pretty far outside of those
defined for mathematical intervals
* we could easily create more confusion or allow subtle traps for
users
* we aren't guided by a clear use case (or I haven't seen it) that
isn't equally solvable (or better) using some other method
* would impose that extra 8-byte storage burden on people who may not
need any flag bits

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 15 Dec 2009 16:52

Jeff Davis <pgsql(a)j-davis.com> writes:
> I think "need" and "flailing" are both a little too strong here. The
> biggest use case will almost certainly be ranges of timestamps, and most
> of those people will have no use for flag bits (or if they do, it might
> not be worth an 8-byte-per-value overhead).

When the alternatives are a crippled implementation that might not do
what I want at all, or a full-featured implementation that takes another
8 bytes per value, I'll take the second. The 8 bytes don't matter if it
doesn't solve my problem.

> I would prefer to avoid allowing NULL range boundaries for the following
> reasons:
> * it reminds me of MySQL dates with zeros in them

If we use that notation to represent an open-ended interval, it seems
perfectly reasonable to me. And it doesn't require any assumptions
about whether the underlying type has an infinity.

I think it's a seriously bad idea to tell people that they should depend
on min or max values of a datatype to substitute for the lack of
open-ended intervals. That sort of technique creates unnecessary
implementation dependencies, and magic numbers (especially ones with a
dozen or two digits in them) are bad things for readability in any case.

To take just one example that's pretty near at hand: if type date had
had an exact published max value that applications were hard-wiring into
their code, we'd not have been able to change it to add 'infinity' as a
special value.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Scott Bailey on 15 Dec 2009 17:01

David Fetter wrote:
> On Tue, Dec 15, 2009 at 11:31:05AM -0800, Scott Bailey wrote:
>> Jeff Davis wrote:
>>> On Tue, 2009-12-15 at 10:19 -0500, Tom Lane wrote:
>> Would it be OK if we handled float timestamp ranges as continuous
>> and int64 timestamps discrete?
>
> That sounds like a recipe for disaster. Whatever timestamp ranges
> are, float and int64 should be treated the same way so as not to get
> "surprises" due to implementation details.
>
>> You effectively lose the ability to build non-contiguous sets with
>> continuous ranges. Which is integral to the work I'm doing (union,
>> intersect, coalesce and minus sets of ranges)
>>
>> As for the extra bits, would it be better to just require continuous
>> ranges to be either [] or [)? But I don't know which would be
>> preferred. My inclination would be toward [), but Tom seemed to
>> indicate that perhaps [] was the norm.
>
> [] makes certain operations--namely the important ones in
> calendaring--impossible, or at least incredibly kludgy, to do. I
> think we ought to leave openness at each end up to the user,
> independent of the underlying implementation details.
>
> FWIW, I think it would be a good idea to treat timestamps as
> continuous in all cases.

Ok, let me give an example of what we can do with the current
implementations that would not be possible with timestamps if we
implement as suggested. Jeff's implementation uses a 1 microsecond step
size or granule. And my implementation uses an interval step size and
can be configured database wide, but default is 1 second.

The function below takes two period arrays that can have overlapping and
adjacent elements. It subtracts all values in pa1 that intersect with
values in pa2. So perhaps pa1 is all of your work shifts for the month
and pa2 is a combination of your leave and holidays. The result is a
coalesced non-contiguous set of the times you would actually be working.
But to do this kind of thing you need to be able to determine prior,
first, last and next. I need an implementation that can do this for
timestamps and not just ints and dates.

CREATE OR REPLACE FUNCTION period_minus(
pa1 IN period[],
pa2 IN period[]
) RETURNS period[] AS
$$
SELECT array_agg(prd)
FROM (
SELECT period((t_in).start_time,
MIN((t_out).end_time)) AS prd
FROM (
SELECT DISTINCT first(p) AS start_time
FROM unnest($1) p
WHERE NOT contains($2, first(p))
AND NOT contains($1, prior(p))

UNION

SELECT DISTINCT next(p)
FROM unnest($2) p
WHERE contains($1, next(p))
AND NOT contains($2, next(p))
) t_in
JOIN (
SELECT next(p) AS end_time
FROM unnest($1) p
WHERE NOT contains($1, next(p))

UNION ALL

SELECT first(p)
FROM unnest($2) p
WHERE contains($1, first(p))
AND NOT contains($2, prior(p))
) t_out ON t_in.start_time < t_out.end_time
GROUP BY t_in.start_time
ORDER BY t_in.start_time
) sub;
$$ LANGUAGE 'sql' IMMUTABLE STRICT;

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Jeff Davis on 15 Dec 2009 17:02

On Tue, 2009-12-15 at 11:49 -0800, David Fetter wrote:
> That sounds like a recipe for disaster. Whatever timestamp ranges
> are, float and int64 should be treated the same way so as not to get
> "surprises" due to implementation details.

It's a compile-time option that will change the semantics of timestamps
regardless. Anyone who changes between float and int64 timestamps may
experience problems for a number of different reasons. What was unique
before might no longer be unique, for instance.

> FWIW, I think it would be a good idea to treat timestamps as
> continuous in all cases.

I disagree. There is a lot of value in treating timestamp ranges as
discrete.

One big reason is that the ranges can be translated between the
different input/output forms, and there's a canonical form. As we know,
a huge amount of the value in an RDBMS is unifying data from multiple
applications with different conventions.

So, let's say one application uses (] and another uses [). If you are
mixing the data and returning it to the application, you want to be able
to provide the result according to its convention. You can't do that
with a continuous range.

And things get more interesting: if you mix (] and [), then range_union
will produce () and range_intersect will produce []. So now you have all
four conventions floating around the same database.

Another reason you might mix conventions: say you have log data from
several sources, and some sources provide timestamps for an event which
is essentially "instantaneous" and other sources will log the period of
time over which the event occurred, in [) format. To mix the data
coherently, the correct thing to do is call the instantaneous points a
singleton range; but the only way to represent a singleton continuous
range is by using [].

Whenever you mix conventions, you either have to be able to change the
format (which is only possible with discrete ranges) or teach the
application to understand your convention. And if you don't have a
canonical form (which is only possible with discrete ranges), you can't
reliably compare values for equality, or see if they are adjacent.

Saying that discrete ranges are unnecessary is essentially saying that
there's only a use for one convention; or that the conventions will
never be mixed; or that applications will always be smart enough to sort
it out for themselves.

Regards,
Jeff Davis

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 15 Dec 2009 17:17

Jeff Davis <pgsql(a)j-davis.com> writes:
> On Tue, 2009-12-15 at 11:49 -0800, David Fetter wrote:
>> FWIW, I think it would be a good idea to treat timestamps as
>> continuous in all cases.

> I disagree. There is a lot of value in treating timestamp ranges as
> discrete.

> One big reason is that the ranges can be translated between the
> different input/output forms, and there's a canonical form. As we know,
> a huge amount of the value in an RDBMS is unifying data from multiple
> applications with different conventions.

Actually, that is exactly one of the reasons why what you propose is
a *bad* idea. You want to institutionalize application dependence on
a non-portable implementation detail, namely the granularity of machine
representation of what's in principle a continuous value. That's one
of the fastest routes to non-unifiable data I can think of.

> So, let's say one application uses (] and another uses [). If you are
> mixing the data and returning it to the application, you want to be able
> to provide the result according to its convention. You can't do that
> with a continuous range.

The above is nonsense. [1,2) and [1,2] are simply different objects.
A design that assumes that it is always possible to replace one by
the other is broken so badly it's not even worth discussing.

The only reason you'd have applications that fail to handle both open
and closed intervals would be if someone were to create an
implementation that didn't support both from the outset. Which we
need not and should not do.

> And things get more interesting: if you mix (] and [), then range_union
> will produce () and range_intersect will produce []. So now you have all
> four conventions floating around the same database.

Which is why it's a good idea to support all four...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Prev: Winflex
Next: [HACKERS] Fast or immediate shutdown