From: Pavel Stehule on
Hello

I am planning to start to implement median function. I wrote some
array based implementation - it is fast, but I hope, so can be much
faster.

The basic question is method of implementation. It can be implemented
via a) custum aggregate functions or b) executor node.

Adventage of @a variant is simplicity - we need to teach aggregates to
ensure ordered input and ensure to use index only (maybe add flag
ORDERED INPUT [DESC|ASC]). Now PostgreSQL doesn't use a index for scan
to orderd aggregate - it can be a problem for large datasets. I found
some missing info in EXPLAIN about ordered aggregates - there are
showed nothing about sort

pavel(a)postgres:5432=# explain analyze verbose select array_agg(a order
by a) from omega;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1643.00..1643.01 rows=1 width=4) (actual
time=555.091..555.092 rows=1 loops=1)
Output: array_agg(a ORDER BY a)
-> Seq Scan on public.omega (cost=0.00..1393.00 rows=100000
width=4) (actual time=0.050..177.547 rows=100000 loops=1)
Output: a
Total runtime: 555.839 ms
(5 rows)

Probably we have to access a tuple store inside sfunc - when the data
size is out of work memory. And for effective evaluating it needs
patch https://commitfest.postgresql.org/action/patch_view?id=292

variant @b is more complex - but allows more possibilities - idea:
median is one from aggregate executor nodes (theoretically it can call
some custom final function in future - but I don't think about it
now). It has a few advantages:

a) we don't need to modify current aggregates
b) for datasets smaller than working_mem can be used a quickselect
algorithms - www-stat.stanford.edu/~ryantibs/papers/median.pdf
c) for larger datasets we can use integrated external sort with direct
reading - we don't need to stack result to array

I prefer a variant b. It offers a more possibilities - and there are
less chance to break some existing.

comments are welcome

Pavel Stehule

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers