From: Heikki Linnakangas on
Josh Berkus wrote:
>> That is exactly the core idea I was trying to suggest in my rambling
>> message. Just that small additional bit of information transmitted and
>> published to the master via that route, and it's possible to optimize
>> this problem in a way not available now. And it's a way that I believe
>> will feel more natural to some users who may not be well served by any
>> of the existing tuning possibilities.
>
> Well, if both you and Tom think it would be relatively easy (or at least
> easier that continuing to pursue query cancel troubleshooting), then
> please start coding it. It was always a possible approach, we just
> collectively thought that query cancel would be easier.

You still need query cancels. A feedback loop just makes it happen less
frequently.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Heikki Linnakangas on
Dimitri Fontaine wrote:
> Bruce Momjian <bruce(a)momjian.us> writes:
>> Doesn't the system already adjust the delay based on the length of slave
>> transactions, e.g. max_standby_delay. It seems there is no need for a
>> user switch --- just max_standby_delay really high.
>
> Well that GUC looks like it allows to set a compromise between HA and
> reporting, not to say "do not ever give the priority to the replay while
> I'm running my reports". At least that's how I understand it.

max_standby_delay=-1 does that. The documentation needs to be updated to
reflect that, it currently says:

> There is no wait-forever setting because of the potential for deadlock which that setting would introduce. This parameter can only be set in the postgresql.conf file or on the server command line.

but that is false, -1 means wait forever. Simon removed that option at
one point, but it was later put back and apparently the documentation
was never updated.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Heikki Linnakangas on
Heikki Linnakangas wrote:
> Dimitri Fontaine wrote:
>> Bruce Momjian <bruce(a)momjian.us> writes:
>>> Doesn't the system already adjust the delay based on the length of slave
>>> transactions, e.g. max_standby_delay. It seems there is no need for a
>>> user switch --- just max_standby_delay really high.
>> Well that GUC looks like it allows to set a compromise between HA and
>> reporting, not to say "do not ever give the priority to the replay while
>> I'm running my reports". At least that's how I understand it.
>
> max_standby_delay=-1 does that. The documentation needs to be updated to
> reflect that, it currently says:
>
>> There is no wait-forever setting because of the potential for deadlock which that setting would introduce. This parameter can only be set in the postgresql.conf file or on the server command line.
>
> but that is false, -1 means wait forever. Simon removed that option at
> one point, but it was later put back and apparently the documentation
> was never updated.

I've put back the mention of max_standby_delay=-1 option in the docs.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Smith on
Bruce Momjian wrote:
>> "The first option is to connect to the primary server and keep a query
>> active for as long as needed to run queries on the standby. This
>> guarantees that a WAL cleanup record is never generated and query
>> conflicts do not occur, as described above. This could be done using
>> contrib/dblink and pg_sleep(), or via other mechanisms."
>>
>
> I am unclear how you would easily advance the snapshot as each query
> completes on the slave.
>

The idea of the workaround is that if you have a single long-running
query to execute, and you want to make sure it doesn't get canceled
because of a vacuum cleanup, you just have it connect back to the master
to keep an open snapshot the whole time. That's basically the same idea
that vacuum_defer_cleanup_age implements, except you don't have to
calculate a value--you just hold open the snapshot to do it.

When that query ended, its snapshot would be removed, and then the
master would advance to whatever the next latest one is. Nothing
fancier than that. The only similarity is that if you made every query
that happened on the standby do that, it would effectively be the same
behavior I'm suggesting could be available via the standby->master xmin
publication.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(a)2ndQuadrant.com www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Josh Berkus on
All,

First, from the nature of the arguments, we need to eventually have both
versions of SR: delay-based and xmin-pub. And it would be fantastic if
Greg Smith and Tom Lane could work on xmin-pub to see if we can get it
ready as well.

I also think, based on the discussion and Greg's test case, that we
could do two things which would make the shortcomings of delay-based SR
a vastly better experience for users:

1) Automated retry of cancelled queries on the slave. I have no idea
how hard this would be to implement, but it makes the difference between
writing lots of exception-handling code for slave connections
(unacceptable) to just slow response times on the slave (acceptable).

2) A more usable vacuum_defer_cleanup_age. If it was feasible for a
user to configure the master to not vacuum records less than, say, 5
minutes dead, then that would again offer the choice to the user of
slightly degraded performance on the master (acceptable) vs. lots of
query cancel (unacceptable). I'm going to test Greg's case with
vacuum_cleanup_age used fairly liberally to see if this approach has merit.

Why do I say that "lots of query cancel" is "unacceptable"? For the
simple reason that one cannot run the same application code against an
HS+SR cluster with lots of query cancel as one runs against a standalone
database. And if that's true, then the main advantage of HS+SR over
Slony and Londiste is gone. MySQL took great pains to make sure that
you could run the same code against replicated MySQL as standalone, and
that was based on having a fairly intimate relationship with their users
(at the time, anyway).

Another thing to keep in mind in these discussions is the
inexpensiveness of servers today. This means that, if slaves have poor
performance, that's OK; one can always spin up more slaves. But if each
slave imposes a large burden on the master, then that limits your
scalability.

--Josh Berkus


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers