From: Heikki Linnakangas on
Tom Lane wrote:
> Josh Berkus <josh(a)agliodbs.com> writes:
>> On 2/26/10 10:53 AM, Tom Lane wrote:
>>> I think that what we are going to have to do before we can ship 9.0
>>> is rip all of that stuff out and replace it with the sort of closed-loop
>>> synchronization Greg Smith is pushing. It will probably be several
>>> months before everyone is forced to accept that, which is why 9.0 is
>>> not going to ship this year.
>
>> I don't think that publishing visibility info back to the master ... and
>> subsequently burdening the master substantially for each additional
>> slave ... are what most users want.
>
> I don't see a "substantial additional burden" there. What I would
> imagine is needed is that the slave transmits a single number back
> --- its current oldest xmin --- and the walsender process publishes
> that number as its transaction xmin in its PGPROC entry on the master.

The additional burden comes from the old snapshot effect. It makes it
unusable for offloading reporting queries, for example. In general, it
is a very good architectural property that the master is not affected by
what happens in a standby, and a closed-loop synchronization would break
that.

I don't actually understand how tight synchronization on its own would
solve the problem. What if the connection to the master is lost? Do you
kill all queries in the standby before reconnecting?

One way to think about this is to first consider a simple a stop-and-go
system. Clearly the database must be consistent at any point in the WAL
sequence, if recovery was stopped and the database started up. So it is
always safe to pause recovery and run a read-only query against the
database as it is at that point in time (this assumes that the index
"cleanup" operations are not required for consistent query results BTW).
After the read-only transaction is finished, you can continue recovery.

The next step up is to relax that so that you allow replay of those WAL
records that are known to not cause trouble to the read-only queries.
For example, heap_insert records are very innocent, they only add rows
with a yet-uncommitted xmin.

Things get more complex when you allow the replay of commit records; all
the known-assigned-xids tracking is related to that, so that
transactions that are not committed when a snapshot is taken in the
standby to be considered uncommitted by the snapshot even after the
commit record is later replayed. If that feels too fragile, there might
be other methods to achieve that. One I once pondered is to not track
all in-progress transactions in shared memory like we do now, but only
OldestXmin. When a backend wants to take a snapshot in the slave, it
memcpy()s clog from OldestXmin to the latest committed XID, and includes
it in the snapshot. The visibility checks use the copy instead of the
actual clog, so they see the situation as it was when the snapshot was
taken. To keep track of the OldestXmin in the slave, the master can emit
that as a WAL record every now and then; it's ok if it lags behind.

Then there's the WAL record types that remove data that might still be
required by the read-only transactions. This includes vacuum and index
deletion records.


If you really think the current approach is unworkable, I'd suggest that
we fall back to a stop-and-go system, where you either let the recovery
to progress or allow queries to run, but not both at the same time. But
FWIW I don't think the situation is that grave.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Josh Berkus on

> Well, as Heikki said, a stop-and-go WAL management approach could deal
> with that use-case. What I'm concerned about here is the complexity,
> reliability, maintainability of trying to interlock WAL application with
> slave queries in any sort of fine-grained fashion.

This sounds a bit brute-force, but what about simply having some form of
automatic query retry on the slave?

--Josh Berkus


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Smith on
Tom Lane wrote:
> I don't see a "substantial additional burden" there. What I would
> imagine is needed is that the slave transmits a single number back
> --- its current oldest xmin --- and the walsender process publishes
> that number as its transaction xmin in its PGPROC entry on the master.
>

That is exactly the core idea I was trying to suggest in my rambling
message. Just that small additional bit of information transmitted and
published to the master via that route, and it's possible to optimize
this problem in a way not available now. And it's a way that I believe
will feel more natural to some users who may not be well served by any
of the existing tuning possibilities.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(a)2ndQuadrant.com www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Smith on
Bruce Momjian wrote:
> Doesn't the system already adjust the delay based on the length of slave
> transactions, e.g. max_standby_delay. It seems there is no need for a
> user switch --- just max_standby_delay really high.
>

The first issue is that you're basically saying "I don't care about high
availability anymore" when you increase max_standby_delay to a high
value. Want to offload an 8 hour long batch report every day to the
standby? You can do it with max_standby_delay=8 hours. But the day
your master crashes 7 hours into that, you're in for a long wait before
your standby is available while it replays all the queued up segments.
Your 'hot standby' has actually turned into the old form of 'cold
standby' just when you need it to be responsive.

This is also the reason why the whole "pause recovery" idea is a
fruitless path to wander down. The whole point of this feature is that
people have a secondary server available for high-availability, *first
and foremost*, but they'd like it to do something more interesting that
leave it idle all the time. The idea that you can hold off on applying
standby updates for long enough to run seriously long reports is
completely at odds with the idea of high-availability.

The second major problem is that the day the report actually takes 8.1
hours instead, because somebody else ran another report that slowed you
down a little, you're screwed if that's something you depend on being
available--it just got canceled only *after* wasting 8 hours of
reporting resource time.

max_standby_delay is IMHO only useful for allowing non-real-time web-app
style uses of HS (think "Facebook status updates"), where you say "I'm
OK giving people slightly out of date info sometimes if it lets me split
the query load over two systems". Set max_standby_delay to a few
seconds or maybe a minute, enough time to service a typical user query,
make your app tolerate the occasional failed query and only send big
ones to the master, and you've just scaled up all the small ones.
Distributed queries with "eventual consistency" on all nodes is where
many of the web app designs are going, and this feature is a reasonable
match for that use case.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(a)2ndQuadrant.com www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Smith on
Bruce Momjian wrote:
> 5 Early cleanup of data still visible to the current query's
> snapshot
>
> #5 could be handled by using vacuum_defer_cleanup_age on the master.
> Why is vacuum_defer_cleanup_age not listed in postgresql.conf?
>

I noticed that myself and fired off a corrective patch to Simon
yesterday, he said it was intentional but not sure why that is yet.
We'll sort that out.

You are correct that my suggestion is targeting primarily #5 on this
list. There are two problems with the possible solutions using that
parameter though:

-vacuum_defer_cleanup_age is set in a unit that people cannot be
expected to work in--transactions ids. The UI is essentially useless,
and there's no obvious way how to make a better one. The best you can do
will still be really fragile.

-If you increase vacuum_defer_cleanup_age, it's active all the time.
You're basically making every single transaction that could be cleaned
up pay for the fact that a query *might* be running on the standby it
needs to avoid.

You can think of the idea of passing an xmin back from the standby as
being like an auto-tuning vacuum_defer_cleanup_age. It's 0 when no
standby queries are running, but grows in size to match longer ones. And
you don't have to have to know anything to set it correctly; just toggle
on the proposed "feedback xid from the standby" feature and you're safe.

Expecting that anyone will ever set vacuum_defer_cleanup_age correctly
in the field in its current form is pretty unreasonable I think. Since
there's no timestamp-based memory of past xid activity, it's difficult
to convert it to that form instead, and I think something in terms of
time is what people would like to set this in.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(a)2ndQuadrant.com www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers