Replication documentation addition [PgSql]

Prev: Lock partitions
Next: Piggybacking vacuum I/O

From: Richard Troy on 25 Oct 2006 12:24

Hi Hannu, everyone,

I apologize for not having read the document in question - will do
shortly. My comments are brought about by the dialogue I read on list this
morning...

> > Here is a new replication documentation section I want to add for 8.2:
> >
> > ftp://momjian.us/pub/postgresql/mypatches/replication
>

> > Data Partitioning
> > -----------------
> >
> > Data partitioning splits the database into data sets. To achieve
> > replication, each data set can only be modified by one server. For
> > example, data can be partitioned by offices, e.g. London and Paris.
> > While London and Paris servers have all data records, only London can
> > modify London records, and Paris can only modify Paris records. Such
> > partitioning is usually accomplished in application code, though rules
> > and triggers can help enforce partitioning and keep the read-only data
> > sets current. Slony can also be used in such a setup. While Slony
> > replicates only entire tables, London and Paris can be placed in
> > separate tables, and inheritance can be used to access from both tables
> > using a single table name.
>
> Maybe another use of partitioning should also be mentioned. That is ,
> when partitioning is used to overcome limitations of single servers
> (especially IO and memory, but also CPU), and only a subset of data is
> stored and processed on each server.

> > I think the "official" term for this kind of "replication" is
> > Shared-Nothing Clustering.

"Data partitioning" has two fundamental flavors, "horizontal" and
"vertical", quite a handful of implementations, and even more motivations
behind why one uses either strategy and whatever implementation. The same
is true for "clustering" - a few fundamental strategies, with a larger
number of implementations and yet more motivations. Replication,
meanwhile, is yet another beast altogether, sharing the same fundamentals
of multiple flavors, implementations and motivations. ? I strongly urge
keeping any documentation on these (and related) topics strictly distinct
and separate.

In my view, one should define the terms first, separately, distinctly, and
as succinctly as possible, and, following this, a dialogue on how these
may be combined can be entertained. The definitions of each should be both
complete and academic in flavor and may include implementation and
motivational information, but never "muddy the water" by mixing with
other concepts - not yet, not until after all the fundamentals have been
introduced.

I don't know much about what PostgreSql has been doing in these areas of
late - nothing, I gather from someone's post this morning - but I'll try
to help out as I can with a paragraph or two - whatever you want,
whatever's welcome - as "I was there" when Randy Eash created the first
commercial RDBMS replicator - for Ingres - and since I created the first
commercial RDBMS front-end failover technology, also for Ingres, so I have
a pretty good handle on all the issues.

Also, I liked what Markus Schiltknecht wrote, but will have to read the
original before I can comment on his specific points.

>> I am not inclined to add commercial offerings. If people wanted
>> commercial database offerings, they can get them from companies that
>> advertize. People are coming to PostgreSQL for open source solutions,
>> and I think mentioning commercial ones doesn't make sense.
>>
>> If we are to add them, I need to hear that from people who haven't
>> worked in PostgreSQL commerical replication companies.
>
> I'm not coming to PostgreSQL for open source solutions. I'm coming
> to PostgreSQL for _good_ solutions.
>
> I want to see what solutions might be available for a problem I have.
> I certainly want to know whether they're freely available, commercial
> or some flavour of open source, but I'd like to know about all of them.
>
> A big part of the value of Postgresql is the applications and extensions
> that support it. Hiding the existence of some subset of those just
> because of the way they're licensed is both underselling postgresql
> and doing something of a disservice to the user of the document.

> If potential new users look through the docs and it says no options
> available for what they want or consider they will need in the future
> then they go elsewhere, if they know that some options are available
> then they will look further if they want that feature.

I agree that people look through the materials on the web site,
documentation especially, and make choices based upon what they see. Many
of us don't have time to spend a day searching the web for things we don't
even know exist. By including more information, more users will be
attracted to PostgreSql, whether it be in the documentation or web site. I
have been SURE that certain things must exist in the PG world, but haven't
known about them with certainty due to time constraints, but would gladly
point our customers at Postgres solutions if only I knew about them. Count
this paragraph as praise for doing _something_more_ to help get more
information to (prospective) users.

Consider someone like me; my company supports five RDBMSes, one of them
being Postgres. We are probably not unique in that we've written an SQL
dialect translator so we could write our own code in one code line to run
anywhere, against any RDBMS (it can learn new dialects) - or perhaps
others keep multiple code lines containing varriant dialects. Either way,
we "don't care" whether our customer has Oracle, or PostgreSql, so long as
they buy our stuff. But when our customers - or prospects - come to us
with a given scenario, the more we know about Postgres - and its community
- the more likely we can steer them to a PG solution, which we would
prefer anyway, for lots of reasons, historical, personal, and technical -
not to mention cost. The trouble is, Oracle, for example, has already told
them (sold them?) on whatever, and we need a rebuttal ready at hand or
they'll go with Oracle. We just don't have the time to fight that battle,
nor do we wish to risk the sale when we can work with Oracle just fine.

In sum, I agree with Tom Lane and the others who chimed in with "keep the
docs clean, use the web site for mentioning other projects/products." And
again I applaud this new effort.

Regards,
Richard

--
Richard Troy, Chief Scientist
Science Tools Corporation
510-92

From: Casey Duncan on 25 Oct 2006 12:49

Totally agree. The docs will tend to outlive whatever projects or
websites they mention. Best to not bake that into stone.

-Casey

On Oct 25, 2006, at 3:36 AM, Magnus Hagander wrote:

>> I don't think the PostgreSQL documentation should be
>> mentioning commercial solutions.
>
> I think maybe the PostgreSQL documentation should be careful about
> trying to list a "complete list" of commercial *or* free solutions.
> Instead linking to something on the main website or on techdocs
> that can
> more easily be updated.
>
> //Magnus
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

From: Richard Troy on 25 Oct 2006 14:40

> Here is a new replication documentation section I want to add for 8.2:
>
> ftp://momjian.us/pub/postgresql/mypatches/replication
>

....Read the document, as promissed...

First paragraph, "(fail over)" is inconsistent with title, "failover", as
are other spots throughout the document. The whole document should be
consistent and I vote for "failover" and not "fail over."

Fourth paragraph, "This "sync problem" is the fundamental difficulty for
servers working together"; "Sync problem" hasn't been defined. Actually,
you're talking about the consistent attribute of the "acid" properties of
all competent databases: Atomic, Consistency, Isolation, and Durability.
At least define the term you are using - probably most easily done in the
preceeding paragraph.

The fifth paragraph needs a lot more help, I think. Howabout this
alternative:

So called "two phaised commit" was developed as a strategy in which two or
more databases are updated simultaneously and none of the data is
committed until all are committed. This guarantees consistency between the
databases with all propagation delay being absorbed by the writer at write
time. There are times when this propagation delay is large, so sometimes
alternatives are worked out which we'll call here "asynchronous updates,"
however, in these cases, there is always a window of time in which some
transaction can be lost should a failure occurr. For this reason,
asynchronous updates are only used when the possibility of such losses is
acceptible.

Paragraphs six through to "shared disk failover" seem very awkward to me.
I don't like them at all.

"Shared disk failover" has nothing to do with "the sync problem" as it's
not a multiple-database solution. It's an uptime, "24 X 7 X 365" issue.
Further, it also has nothing to do with disk arrays, though it is often
used with RAID to help avoid disk based corruption problems.

The point about Warm Standby needs to include a warning about WAL that it
MUST be sensitive to the semantics of the database design or else it's
fatally flawed. I'm talking about "referential integrety". That is to say,
it's inappropriate to capture updates on a table by table basis, as some
such systems do, (I have no idea what's done by anyone in the PG world on
this right now) because an update to one table (esp. inserts) very often
go hand in glove with updates in other tables and to get one without the
other can corrupt a database.

The description of "Continuously running replication server" should
include the critical caveat - repeated if you think it's already said
elsewhere - that it is ONLY suitable for applications in which a loss of
(missing) update data doesn't matter. For example, an airline reservation
system would be an inappropriate application for such a "solution" because
what seats are available cannot be guaranteed to be correct.

Regarding data partitioning, I strongly disagree with the opening sentence
in that it doesn't split a database into sets, it splits tables into sets.
Data partitioning is often done within a single database on a single
server and therefore, as a concept, has nothing whatsoever to do with
different servers. Similarly, the second paragraph of this section is
problematic. Please define your term first, then talk about some
implementations - this is muddying the water. Further, there are both
vertical and horizontal partitioning - you mention neither - and each has
its own distinct uses. If partitioning is mentioned, it should be more
complete.

Next, Query Broadcast Load Balancing... also needs a lot of work. First,
it's foremost in my memory that sending read queries everywhere and
returning the first result set back is a key way to improve application
performance at the cost of additional load on other systems - I guess
that's not at all what the document is after here, but it's a worthy part
of a dialogue on broadcasting queries. In other words, this has more parts
to it than just what the document now entertains. Secondly, the document
doesn't address _at_all_ whether this is a two-phaise-commit environment
or not. If not, how are updates managed? If each server operates
independently and one of them fails, what do you do then? How do you know
_any_ server got an insert/update? ... Each server _can't_ operate
independently unless the application does its own insert/update commits to
every one of them - and that can't be fast, nor does it load balance,
though it may contribute to superior uptime performance by the
application.

Next up; I'm not aware of any current products or projects that provide
parallel query execution, though Informix might - I can ask a colleague or
two. Either way, it's probably best to simply define the term (perhaps in
a little more detail), and not mention solutions - they change with time
anyway.

While I've never used Oracle's clustering tools, I've read up on them and
have customers who use them, and I think this description of Oracle
clustering is a mis-read on what the Oracle system actually does. A check
with a true Oracle clustering expert is in order here.

Hope this helps. If asked, I'm willing to (re)write some of the bits
discussed above.

Regards,
Richard

--
Richard Troy, Chief Scientist
Science Tools Corporation
510-924-1363 or 202-747-1263
rtroy(a)ScienceTools.com, http://ScienceTools.com/

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

From: "Dawid Kuroczko" on 25 Oct 2006 16:34

On 10/25/06, Bruce Momjian <bruce(a)momjian.us> wrote:
> Joshua D. Drake wrote:
> > Bruce Momjian wrote:
> > > Tom Lane wrote:
> > >> "Magnus Hagander" <mha(a)sollentuna.net> writes:
> > >>> I think this is a good reason not to list *any* of the products by name
> > >>> in the documentation, but instead refer to a page on say techdocs that
> > >>> can be more easily updated.
> > >> I agree with that. If we have statements about other projects in our
> > >> docs, we will have a problem with not being able to update those
> > >> statements in a timely fashion when the other projects change.
> > >
> > > I mention only Slony and pgpool as examples of replication types. They
> > > seem to have risen to high enough visiblity to do that. I have not
> > > mentioned any other solutions.
> >
> > What about Slony-II or pgpool2? Which are fundamentally different from
> > their v1 counterparts (o.k. slony-ii isn't out yet but still).
> >
> > I +1 that we move to have all of the replication documentation pushed to
> > techdocs or other facility and just have a link from the docs.
>
> What I did was to mention Slony and pgpool as "examples", so people
> realize there are many other soluions. It would be good to have a
> companion web site that could list them all, both open source and
> commercial. That is going to take a lot more work, but I think would
> have great value, especially since our documentation will clearly
> outline the terms. What you don't want to do is to throw up a list and
> have people try to figure out what solutions they cover.

I'm in quite an unique situation right now, working with a few DBAs
who have deep knowledge but no PostgreSQL background, so I have
a good view how PostgreSQL is perceived by people with fair knowledge
of other databases.

What I have noticed is a deep respect for community. If they ask about
replication solution, and I tell about Slony, they ask if Slony is provided
with the postgresql-contrib. Well... no, and it won't be. Then they look
back, think a while and say somethig on the lines of: well, $SOME_OTHER
_DATABASE was using external replication solutions so it is all right.

But then, before I talked with them, they did some quick research on
PostgreSQL and their perception was that there's no replication / replication
is shady in PostgreSQL. It would be quite convenient to tell them:
"No replication? Did you actually read the manual? <here goes URL>"
Well, pointing them to slony page is a solution but of a lesser caliber
(how should they know about Slony anyway? They are newbies).
Pointing them at The Documentation is a Good Argument (and it may
cause them to look for some other information, like SQL syntax or
PostgreSQL-specific catalog views there, which is Good).

Enough background.

Bruce, I've read Your documentation and I was left a bit with a feeling
that it's a bit too generic. It's almost as if it could be about just about
any major database, not PostgreSQL specific. I feel that, when I'm
reading PostgreSQL docs I would like to know how to set up multi-master
replication with PostgreSQL not an explanation what a multi-master
replication is. It's not about the actual documentation content, but rather
on accents distribution. Now it is something like: "These are the types
of replication solutions possible, some of them can be done with PostgreSQL",
I think it should be rather: "With PostgreSQL and some third-party tools you
can achieve such and such replication solutions, oh and by the way, research
is done on such and such replication method, but it's not a production quality
yet".

And I try to think as my DBA-mates would do if they read the documentation,
I'm not sure they would end up enlighted after reading the docs -- thay would
probably say: "hey, I knew that, it's well structured there, but I
still don't know
what should I use", or maybe "where can I read something about this slony
thing anyway?".

It may be my "closed thinking schema" though. What I feel is that such
outsider, after reading these docs should end with "Aha! I should be using
Slony for my purposes". Or pgpool, if it's what she needs. I believe Tom's
remark that it does NOT belong in the PostgreSQL documentation is quite
right (though I wish there IS some reference to external replication packages,
mainly because over and over again I need to prove PostgreSQL CAN be
replicated, and it's not uncommon). However I'm still unconvinced about
TechDocs -- TechDocs are good but still they are a bit scattered and
unorganised. I am a PostgreSQL enthusiast, but it took me a while to
learn about them, and for newbies not biased towards PostgreSQL it may
take even more time. If it is linked from within the documentation, random
DBAs might read it, and I wish they do.

Right now I am more and more biased towards an additional "documentation
book" for PostgreSQL, something like "DBA guide" or handbook. In format
similar to the PostgreSQL documentation, but inside oriented around
configuring other tools around and together with PostgreSQL. I shall send
here some drafts withing 10-days time to seed a discussion. After all,
PostgreSQL is too big for just one documentation book. [1]

Regards,
Dawid

[1]: Then, later, a programmer's handbook? Deeper knowledge about fancy
stuff with Python, Perl and PgSQL? ;-)

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

From: Cesar Suga on 25 Oct 2006 22:08

Joshua D. Drake wrote:
> Cesar Suga wrote:
>
>> Hi,
>>
>> I also wrote Bruce about that.
>>
>> It happens that, if you 'freely advertise' commercial solutions (rather
>> than they doing so by other vehicles) you will always happen to be an
>> 'updater' to the docs if they change their product lines, if they change
>> their business model, if and if.
>>
>
> That is no different than the open source offerings. We have had several
> open source offerings that have died over the years. Replicator, for
> example has always been Replicator and has been around longer than any
> of the current replication solutions.
>
The documentation comes with the open source tarball.

I would welcome if the docs point to an unofficial wiki (maintained
externally from authoritative PostgreSQL developers) or a website
listing them and giving a brief of each solution.

postgresql.org already does this for events (commercial training!) and
news. Point to postgresql.org/download/commercial as there *already* are
brief descriptions, pricing and website links.
>> If you cite a commercial solution, as a fair game you should cite *all*
>> of them.
>>
>
> No. That doesn't make any sense either. I assume we aren't going to list
> all PostgreSQL OSS replication solutions (there are at least a dozen or
> more).
>
> You list the ones that are stable in their existence (commercial or not).
>
And how would you determine it? Years of existance? Contribution to
PostgreSQL's source code? It is not easy and wouldn't be fair. There are
ones that certainly will be listed, and other doubtful ones (which would
perhaps complain, that's why I said 'all' - if they are not stable,
either they stay out of the market or fix their problems).
>> If one enterprise has the right to be listed in the
>> documentation, all of them might, as you will never be favouring one of
>> them.
>>
>
> You are looking at this the wrong way. This isn't about *any*
> enterprise. It is about a PostgreSQL Solution. There happens to be two
> or three known working open source solutions, and two or three known
> working commercial solutions.
>
(see first three paragraphs)
>> That's the main motivation to write this. Moreover, if there are also
>> commercial solutions for high-end installs and they are cited as
>> providers to those solutions, it (to a point) disencourages those of
>> gathering themselves and writing open source extensions to PostgreSQL.
>>
>
> No it doesn't. Because there is always the, "It want's to be free!" crowd.
>
Yes, I agree there are. But also development in *that* cutting-edge is
scarce. It feels that something had filled the gap if you list some
commercial solution, mainly people in the trenches (DBAs). They would,
obviously, firstly seek the commercial solutions as they are interested.
So they click 'commercial products' in the main website.
>> If people (who read the documentation) professionally work with
>> PostgreSQL, they may already have been briefed by those commercial
>> offerings in some way.
>>
>
> Maybe, maybe not.
>
> Sincerely,
>
> Joshua D. Drake
>
And I agree with your point, still. However, that would open a precedent
for people to have to maintain lists of stable software in every
documentation area.

Regards,
Cesar

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: Lock partitions
Next: Piggybacking vacuum I/O