Hot Standby query cancellation and Streaming Replication integration [PgSql]

Prev: Why isn't stats_temp_directory automatically created?
Next: Hot Standby query cancellation and Streaming Replicationintegration

From: Greg Stark on 26 Feb 2010 21:30

On Sat, Feb 27, 2010 at 1:53 AM, Greg Smith <greg(a)2ndquadrant.com> wrote:
> Greg Stark wrote:
>>
>> Well you can go sit in the same corner as Simon with your high
>> availability servers.
>>
>> I want my ability to run large batch queries without any performance
>> or reliability impact on the primary server.
>>
>
> Thank you for combining a small personal attack with a selfish commentary
> about how yours is the only valid viewpoint. Saves me a lot of trouble
> replying to your messages, can just ignore them instead if this is how
> you're going to act.

Eh? That's not what I meant at all. Actually it's kind of the exact
opposite of what I meant.

What I meant was that your description of the "High Availability first
and foremost" is only one possible use case. Simon in the past
expressed the same single-minded focus on that use case. It's a
perfectly valid use case and I would probably agree if we had to
choose just one it would be the most important.

But we don't have to choose just one. There are other valid use cases
such as load balancing and isolating your large batch queries from
your production systems. I don't want us to throw out all these other
use cases because we only consider high availability as the only use
case we're interested in.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Stark on 26 Feb 2010 23:02

On Fri, Feb 26, 2010 at 9:44 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> Greg Stark <gsstark(a)mit.edu> writes:
>
>> What extra entries?
>
> Locks, just for starters. I haven't read enough of the code yet to know
> what else Simon added. In the past it's not been necessary to record
> any transient information in WAL, but now we'll have to.

Haven't we been writing locks to the WAL since two-phase commit?

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 27 Feb 2010 20:00

On Fri, Feb 26, 2010 at 1:53 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> Greg Stark <gsstark(a)mit.edu> writes:
>> In the model you describe any long-lived queries on the slave cause
>> tables in the master to bloat with dead records.
>
> Yup, same as they would do on the master.
>
>> I think this model is on the roadmap but it's not appropriate for
>> everyone and I think one of the benefits of having delayed it is that
>> it forces us to get the independent model right before throwing in
>> extra complications. It would be too easy to rely on the slave
>> feedback as an answer for hard questions about usability if we had it
>> and just ignore the question of what to do when it's not the right
>> solution for the user.
>
> I'm going to make an unvarnished assertion here. I believe that the
> notion of synchronizing the WAL stream against slave queries is
> fundamentally wrong and we will never be able to make it work.
> The information needed isn't available in the log stream and can't be
> made available without very large additions (and consequent performance
> penalties). As we start getting actual beta testing we are going to
> uncover all sorts of missed cases that are not going to be fixable
> without piling additional ugly kluges on top of the ones Simon has
> already crammed into the system. Performance and reliability will both
> suffer.
>
> I think that what we are going to have to do before we can ship 9.0
> is rip all of that stuff out and replace it with the sort of closed-loop
> synchronization Greg Smith is pushing. It will probably be several
> months before everyone is forced to accept that, which is why 9.0 is
> not going to ship this year.

Somewhat unusually for me, I haven't been able to keep up with my
email over the last few days, so I'm weighing in on this one a bit
late. It seems to me that if we're forced to pass the xmin from the
slave back to the master, that would be a huge step backward in terms
of both scalability and performance, so I really hope it doesn't come
to that. I wish I understood better exactly what you mean by "the
notion of synchronizing the WAL stream against slave queries" and why
you don't think it will work. Can you elaborate?

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Stark on 28 Feb 2010 08:54

On Sun, Feb 28, 2010 at 6:07 AM, Greg Smith <greg(a)2ndquadrant.com> wrote:
> Not forced to--have the option of. There are obviously workloads where you
> wouldn't want this. At the same time, I think there are some pretty common
> ones people are going to expect HS+SR to work on transparently where this
> would obviously be the preferred trade-off to make, were it available as one
> of the options. The test case I put together shows an intentionally
> pathological but not completely unrealistic example of such a workload.

Well if we're forced to eventually have both then it kind of takes the
wind out of Tom's arguments. We had better get both features working
so it becomes only a question of which is worth doing first and which
can be held off. Since there aren't any actual bugs in evidence for
the current setup and we already have it that's a pretty easy
decision.

> What I am sure of is that a SR-based xmin passing approach is simpler,
> easier to explain, more robust for some common workloads, and less likely to
> give surprised "wow, I didn't think *that* would cancel my standby query"
> reports from the field

Really? I think we get lots of suprised wows from the field from the
idea that a long-running read-only query can cause your database to
bloat. I think the only reason that's obvious to us is that we've been
grappling with that problem for so long.

> And since I never like to bet against Tom's gut feel, having it
> around as a "plan B" in case he's right about an overwhelming round of bug
> reports piling up against the max_standby_delay etc. logic doesn't hurt
> either.

Agreed. Though I think it'll be bad in that case even if we have a
plan B. It'll mean no file-based log shipping replicas and no
guarantee that what you run on the standby can't affect the master --
which is a pretty nice guarantee. It'll also mean it'll be much more
fragile against network interruptions.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Joachim Wieland on 28 Feb 2010 10:56

On Sun, Feb 28, 2010 at 2:54 PM, Greg Stark <gsstark(a)mit.edu> wrote:
> Really? I think we get lots of suprised wows from the field from the
> idea that a long-running read-only query can cause your database to
> bloat. I think the only reason that's obvious to us is that we've been
> grappling with that problem for so long.

It seems to me that the scenario that you are looking at is one where
people run different queries with and without HS, i.e. that they will
run longer read-only queries than now once they have HS. I don't think
that is the case. If it isn't you cannot really speak of a master
"bloat".

Instead, I assume that most people who will grab 9.0 and use HS+SR do
already have a database with a certain query profile. Now with HS+SR
they will try to put the most costly and longest read-only queries to
the standby but in the end will run the same number of queries with
the same overall complexity.

Now let's take a look at both scenarios from the administrators' point of view:

1) With the current implementation they will see better performance on
the master and more aggressive vacuum (!), since they have less
long-running queries now on the master and autovacuum can kick in and
clean up with less delay than before. On the other hand their queries
on the standby might fail and they will start thinking that this HS+SR
feature is not as convincing as they thought it was... Next step for
them is to take the documentation and study it for a few days to learn
all about vacuum, different delays, transaction ids and age parameters
and experiment a few weeks until no more queries fail - for a while...
But they can never be sure... In the end they might also modify the
parameters in the wrong direction or overshoot because of lack of time
to experiment and lose another important property without noticing
(like being as close as possible to the master).

2) On the other hand if we could ship 9.0 with the xmin-propagation
feature, people would still see a better performance and have a hot
standby system but this time without query cancellations. Again: the
read-only queries that will be processed by the HS in the future are
being processed by the master today anyway, so why should it get
worse? The first impression will be that it just works nicely out of
the box, is easy to set up and has no negative effect (query
cancellation) that has not already shown up before (vacuum lag).

I guess that most people will just run fine with this setup and never
get to know about the internals. Of course we should still offer an
expert mode where you can turn all kinds of knobs and where you can
avoid the vacuum dependency but it would be nice if this could be the
expert mode only. Tuning this is highly installation specific and you
need to have a deep understanding of how PostgreSQL and HS work
internally and what you actually want to achieve...

> Agreed. Though I think it'll be bad in that case even if we have a
> plan B. It'll mean no file-based log shipping replicas and no
> guarantee that what you run on the standby can't affect the master --
> which is a pretty nice guarantee. It'll also mean it'll be much more
> fragile against network interruptions.

Regarding the network interruptions... in reality if you have network
interruptions of several minutes between your primary and your
standby, you have worse problems anyway... If the standby does not
renew its xmin for n seconds, log a message and just go on...

Joachim

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: Why isn't stats_temp_directory automatically created?
Next: Hot Standby query cancellation and Streaming Replicationintegration