Review: listagg aggregate [PgSql]

Prev: [HACKERS] further explain changes
Next: Sugerencia de opcion

From: Pavel Stehule on 24 Jan 2010 04:19

Hello

2010/1/22 David E. Wheeler <david(a)kineticode.com>:
> Pavel,
>
> My review of your listagg patch.
>
> Submission Review
> -----------------
> * The diff is a context diff and applies cleanly to HEAD (with just two hunks offset by 2 lines each).
>
> * There is documentation, though I'm not sure it needs to be mentioned in the string functions documentation. No harm in it, I guess.
>
> Â I would like to see an example, though, and the documentation does not currently explain what each of the parameters are for. In fact, it looks like all the existing aggregates take only one parameter, so there was not previously a need to explain it. But listagg() has an optional second param. I think that the description should explain what it's for.
>

can I help with it, please. My English is terrible.

> * There are tests and they look fine.
>
> Usability Review
> ----------------
> * The patch does in fact implement the aggregate function it describes, and OH YES do we want it (I've written my own in SQL a few times).
>
> * No, we don't already have it.
>
> * Yes it follows community-agreed behavior. I'm assuming that there is no special parsing of aggregate functions, so the simple use of commas to separate the two parameters is appropriate, rather than using a keyword like MySQL's SEPARATOR in the group_concat() aggregate.
>
> * No need to have pg_dump support, no dangers that I can see, looks like all the bases have been covered.
>
> Feature Test
> ------------
> * Everything built cleanly, but I got an OID dupe error when I tried to init the DB. Looks like 2997 and 2998 have been used for something else since you created the patch. I changed them to 2995 and 2996 and then it worked.
> * The feature appears to work. I didn't see any tests for encodings or other data types, so I ran a few myself and they work fine:
>
> postgres=# select listagg(a, U&'-\0441\043B\043E\043D-') from (values('aaaa'),('bbbb'),('cccc'
> Â Â Â Â listagg
> --------------------------
> Â aaaa-ÑÐ»Ð¾Ð½-bbbb-ÑÐ»Ð¾Ð½-cccc
> (1 row)
>
> postgres=# select listagg(a, U&'\2014') from (values(U&'\0441\043B\043E\043D'),(U&'d\0061t\+000061'),(U&'\0441\043B\043E\043D')) AS g(a);
> Â Â listagg
> ----------------
> Â ÑÐ»Ð¾Ð½âdataâÑÐ»Ð¾Ð½
> (1 row)
>
>
> postgres=# select listagg(a::text) from (values(1),(2),(3)) AS g(a);
> Â listagg
> ---------
> Â 123
> (1 row)
>
>
> Performance Review
> ------------------
>
> No performance issues, except that it should be faster than a custom aggregate that does the same thing. To test, I created a quick custom aggregate (no second argument, alas, so listagg() is more flexible) like so:
>
> Â Â CREATE OR REPLACE FUNCTION a2s(ANYARRAY)
> Â Â RETURNS TEXT LANGUAGE SQL AS $$
> Â Â Â Â SELECT array_to_string($1, ',');
> Â Â $$;
>
> Â CREATE AGGREGATE string_accum (
> Â Â Â Â SFUNC Â Â = array_accum,
> Â Â Â Â BASETYPE = ANYELEMENT,
> Â Â Â Â STYPE Â Â = ANYARRAY,
> Â Â Â Â INITCOND = '{}',
> Â Â Â Â FINALFUNC = a2s
> Â Â );
>
> Then I ran some simple tests (thanks for the clue, depesz):
>
> postgres=# select count(*) from (select string_accum(a) from (values('aaaa'),('bbbb'),('cccc')) AS g(a), generate_series(1,10000) i) AS x(i);
> Â count
> -------
> Â Â 1
> (1 row)
>
> Time: 1365.382 ms
>
> postgres=# select count(*) from (select listagg(a) from (values('aaaa'),('bbbb'),('cccc')) AS g(a), generate_series(1,10000) i) AS x(i);
> Â count
> -------
> Â Â 1
> (1 row)
>
> Time: 17.989 ms
>
> So overall, it looks like listagg() is 1-2 orders of magnitude faster. YMMV, and my system is built with --enable-cassert and --enable-debug. Still, good job.
>
> Coding Review
> -------------
>
> * Is varchar.c really the best place to put the ListAggState struct and the listagg() function? I grepped the source for array_agg() and it's in src/backend/utils/adt/array_userfuncs.c. Maybe there's an equivalent file for string functions? Otherwise, the style of the C code looks fine to my untrained eye.
>

array user functions are used more time in pg core. The complexity of
array functions are much higher, so I don't think we need special
file.

> Â Actually, shouldn't it return text rather than varchar?
>

I'll recheck it. I am sure so all parameters should be a text.

> * Does it really require four functions to do its work? Might there be some way to use the array_agg() C functions and then just a different final function to turn it into a string (using the internal array_to_string() function, perhaps)?

We can, but it isn't good way. Processing of arrays is little bit more
expensive then processing plain text. It is reason why listagg is
faster, than your custom aggregate. More, the final function could be
faster - the content is final.

I'm not at all sure about it, but given how little code was required
to create the same basic functionality in SQL, I'm surprised that the
C implementation requires four functions (accumStringResult(),
listagg1_transfn(), listagg2_transfn(), and listagg_finalfn()). Maybe
they're required to make it fast and avoid the overhead of an array?

It normal for aggregate functions. We need more transfn function,
because we need two two variant: listagg(col), listagg(col, sep). Our
coding guidlines doesn't advice share C functions - but these
functions are +/- wrapper for accumStringResult - so there is zero
overhead.

>
> * No compiler warnings, I never made it crash, good comments, does what it says on the tin. I doubt that there are any portability issues, as the code seems to use standard PostgreSQL internal macros and functions.
>
> Architecture Review
> -------------------
>
> * No dependencies, things seem to make sense overall, notwithstanding my questions in the Coding Review.
>
> Review Review
> -------------
>
> The only thing I haven't covered so far is the name. I agree with Tom's assertion that the name is awful. Sure there may be a precedent in Oracle, but I hardly find that convincing (some of the big corporations seem to do a really shitty job naming things in their APIs). Given that we already have array_agg(), what about simply concat_agg(), as Tom (wincingly, I grant you) suggested? Or string_agg()?

I don't think. When we have function, with same parameters, same
behave like some Oracle function, then I am strongly prefer Oracle
name. I don't see any benefit from different name. It can only confuse
developers and add the trable to people who porting applications.

>
> My bike shed is puce.
>
> Attached is a new patch with the changed OIDs and an added phrase in the documentation that indicates that the second argument can be used to separate the concatenated values.
>
> Best,
>

Thank you very much
Regards
Pavel

> David
>
>
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "David E. Wheeler" on 24 Jan 2010 11:57

On Jan 24, 2010, at 1:19 AM, Pavel Stehule wrote:

> can I help with it, please. My English is terrible.

Yes, I added a bit in the patch I submitted.

> array user functions are used more time in pg core. The complexity of
> array functions are much higher, so I don't think we need special
> file.

Okay. Should have tried it in PL/pgSQL then, perhaps.

> I'll recheck it. I am sure so all parameters should be a text.

Probably shouldn't go into varchar.c then, yes?

> We can, but it isn't good way. Processing of arrays is little bit more
> expensive then processing plain text. It is reason why listagg is
> faster, than your custom aggregate. More, the final function could be
> faster - the content is final.

Understood.

> It normal for aggregate functions. We need more transfn function,
> because we need two two variant: listagg(col), listagg(col, sep). Our
> coding guidlines doesn't advice share C functions - but these
> functions are +/- wrapper for accumStringResult - so there is zero
> overhead.

Ah, okay, it's because of the second argument. Now I understand.

> I don't think. When we have function, with same parameters, same
> behave like some Oracle function, then I am strongly prefer Oracle
> name. I don't see any benefit from different name. It can only confuse
> developers and add the trable to people who porting applications.

Meh. If the name is terrible, we don't have to use it, and it's easy enough to create an alias in SQL for those who need it.

Best,

David

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on 24 Jan 2010 13:02

On Fri, 2010-01-22 at 11:14 -0800, David E.Wheeler wrote:
> No performance issues

ISTM that this class of function is inherently dangerous performance
wise.

* It looks incredibly easy to construct enormous lists. We should test
the explosion limit of this to see how it is handled. Perhaps we need
some parameter limits to control that, depending upon results.

* Optimizer doesn't consider whether the result type of an aggregate get
bigger as the aggregate processes more rows. If we're adding this
function we should give some thought in that area also, or at least a
comment to note that it can and will cause the optimizer problems in
complex queries.

--
Simon Riggs www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Pavel Stehule on 24 Jan 2010 13:40

2010/1/24 Simon Riggs <simon(a)2ndquadrant.com>:
> On Fri, 2010-01-22 at 11:14 -0800, David E.Wheeler wrote:
>> No performance issues
>
> ISTM that this class of function is inherently dangerous performance
> wise.

there is potencial risk, but this risk isn't new. The almost all what
you say is true for array aggregates.

>
> * It looks incredibly easy to construct enormous lists. We should test
> the explosion limit of this to see how it is handled. Perhaps we need
> some parameter limits to control that, depending upon results.

There are no limit for generating large values - like bytea, xml, or
text. There are not limit for array_accum or array(query). So, I don't
think we need some special mechanism for listagg. If somebody will
generate too large a value, then he will get "out of memory"
exception.

>
> * Optimizer doesn't consider whether the result type of an aggregate get
> bigger as the aggregate processes more rows. If we're adding this
> function we should give some thought in that area also, or at least a
> comment to note that it can and will cause the optimizer problems in
> complex queries.
>

this is true, but this isn't some new. array_accum working well
without optimizer problems.

> --
> Â Simon Riggs Â Â Â Â Â www.2ndQuadrant.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 24 Jan 2010 13:43

Simon Riggs <simon(a)2ndQuadrant.com> writes:
> On Fri, 2010-01-22 at 11:14 -0800, David E.Wheeler wrote:
>> No performance issues

> ISTM that this class of function is inherently dangerous performance
> wise.

> * It looks incredibly easy to construct enormous lists. We should test
> the explosion limit of this to see how it is handled. Perhaps we need
> some parameter limits to control that, depending upon results.

> * Optimizer doesn't consider whether the result type of an aggregate get
> bigger as the aggregate processes more rows. If we're adding this
> function we should give some thought in that area also, or at least a
> comment to note that it can and will cause the optimizer problems in
> complex queries.

We have that problem already with array_agg(), and I don't recall many
complaints about it. It might be worth worrying about at some point,
but I don't think it's reasonable to insist that it be fixed before
any more such aggregates are created.

I agree that testing-to-failure would be a good idea just to be sure it
fails cleanly.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

| Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11
Prev: [HACKERS] further explain changes
Next: Sugerencia de opcion