From: Hannu Krosing on
On Tue, 2010-04-27 at 10:32 -0400, Michael Tharp wrote:
> On 04/27/2010 09:59 AM, Kevin Grittner wrote:
> > Under what circumstances would PostgreSQL
> > modify a file without changing the "last modified" timestamp or the
> > file size?
>
> Do all OSes have sub-second precision mtimes? Because otherwise I could
> see a scenario such at this:
>
> * File is modified
> * Backup inspects and copies the file in the same second
> * File is modified again in the same second, so the mtime doesn't change
> * Backup is run again some time later and sees that the mtime has not
> changed
>
> Even with microsecond precision this kind of scenario makes me squidgy,
> especially if some OSes decide that skipping frequent mtime updates is
> OK.

To be on the safe side you need to record the latest table data change
time _after_ the backup anyway, it is easy to wait a few secs to be
sure.

> Florian's point about clock changes is also very relevant. Since
> Postgres has the capability to give a better answer about what is in the
> file, it would be best to use that.
>
> -- m. tharp
>


--
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training



--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "Kevin Grittner" on
Hannu Krosing <hannu(a)2ndquadrant.com> wrote:

> I see the main value when doing pg_dump based backups

Ah, now that makes more sense.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Hannu Krosing on
On Tue, 2010-04-27 at 12:14 -0400, Merlin Moncure wrote:
> On Tue, Apr 27, 2010 at 11:13 AM, Kevin Grittner
> <Kevin.Grittner(a)wicourts.gov> wrote:
> > Merlin Moncure <mmoncure(a)gmail.com> wrote:
> >
> >> The proposal only seems a win to me if a fair percentage of the
> >> larger files don't change, which strikes me as a relatively low
> >> level case to optimize for.
> >
> > That's certainly a situation we face, with a relatively slow WAN in
> > the middle.
> >
> > http://archives.postgresql.org/pgsql-admin/2009-07/msg00071.php
> >
> > I don't know how rare or common that is.
>
> hm...interesting read. pretty clever. Your archiving requirements are high.
>
> With the new stuff (HS/SR) taken into consideration, would you have
> done your DR the same way if you had to do it all over again?
>
> Part of my concern here is that manual filesystem level backups are
> going to become an increasingly arcane method of doing things as the
> HS/SR train starts leaving the station.

Actually the HS/SR speaks _for_ adding explicit change dates to files,
as the mod times on slave side will be different, and you may still want
to know when the table really was last modified

>
> hm, it would be pretty neat to see some of the things you do pushed
> into logical (pg_dump) style backups...with some enhancements so that
> it can skip tables haven't changed and are exhibited in a previously
> supplied dump. This is more complicated but maybe more useful for a
> broader audience?

Yes, I see the main value in of this for pg_dump backups, as physical
files already have this in terms of file ctime/mtime/atime

>
> Side question: is it impractical to backup via pg_dump a hot standby
> because of query conflict issues?
>
> merlin
>


--
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training



--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Hannu Krosing on
On Tue, 2010-04-27 at 17:28 +0200, Csaba Nagy wrote:
> Hi all,
>
> On Tue, 2010-04-27 at 11:07 -0400, Merlin Moncure wrote:
> > The block level case seems pretty much covered by the hot standby feature.
>
> One use case we would have is to dump only the changes from the last
> backup of a single table. This table takes 30% of the DB disk space, it
> is in the order of ~400GB, and it's only inserted, never updated, then
> after ~1 year the old entries are archived. There's ~10M new entries
> daily in this table. If the backup would be smart enough to only read
> the changed blocks (in this case only for newly inserted records), it
> would be a fairly big win...

The standard trick for this kind of table is having this table
partitioned by insertion date - this way you have two benefits:

1) you already know which table to backup (the latest, and maye one
before that if you just switche to new one)

2) archiving will be fast (copy full latest table away and the truncate
it) instead of slow (copy "old enough" records out, then do delete of
the same records, both ow which are quite slow, and you also need to do
vacuum after that, which is also slow on large tables)

This would actually be a good sample case for tracking "latest dml",
except that in this particular corner case you can arrange for this
yourself.

--
Hannu Krosing http://www.2ndQuadrant.com
PostgreSQL Scalability and Availability
Services, Consulting and Training



--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "Kevin Grittner" on
Hannu Krosing <hannu(a)2ndquadrant.com> wrote:
> On Tue, 2010-04-27 at 17:28 +0200, Csaba Nagy wrote:

>> One use case we would have is to dump only the changes from the
>> last backup of a single table. This table takes 30% of the DB
>> disk space, it is in the order of ~400GB, and it's only inserted,
>> never updated, then after ~1 year the old entries are archived.
>> There's ~10M new entries daily in this table. If the backup would
>> be smart enough to only read the changed blocks (in this case
>> only for newly inserted records), it would be a fairly big win...

That is covered pretty effectively in PITR-style backups with the
hard link and rsync approach cited earlier in the thread. Those 1GB
table segment files which haven't changed aren't read or written,
and only those portions of the other files which have actually
changed are sent over the wire (although the entire disk file is
written on the receiving end).

> The standard trick for this kind of table is having this table
> partitioned by insertion date

That doesn't always work. In our situation the supreme court sets
records retention rules which can be quite complex, but usually key
on *final disposition* of a case rather than insertion date; that
is, the earliest date on which the data related to a case is
*allowed* to be deleted isn't known until weeks or years after
insertion. Additionally, it is the elected clerk of court in each
county who determines when and if data for that county will be
purged once it has reached the minimum retention threshold set by
supreme court rules.

That's not to say that partitioning couldn't help with some backup
strategies; just that it doesn't solve all "insert-only" (with
eventual purge) use cases. One of the nicest things about
PostgreSQL is the availability of several easy and viable backup
strategies, so that you can tailor one to fit your environment.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers