From: Robert Heller on
At Wed, 14 Apr 2010 16:55:19 +0000 Greg Russell <me(a)invalid.com> wrote:

>
> On Wed, 14 Apr 2010 18:10:36 +0200, Harald Meyer wrote:
>
> > Greg Russell wrote:
> >> Does the gzip (-z) option to tar introduce some entropy into a file
> >> between iterations with the same arguments? e.g.:
> >
> > gzip preserves the timestamp of the input file, in this case the .tar
> > file.
>
> What does that mean wrt:
>
> $ tar czf test_0.tgz *.txt; tar czf test_1.tgz *.txt
>
> Is a transient tar file created before the gzip process, and that's the
> difference accounted by:
>
> $ cmp -l *.tgz
> 5 277 310

No 'transient tar file' is created. Tar sends the tar 'file' to a pipe to
gzip, which simply grabs the current time as a timestamp.

>
> Is there any option to disable the feature that causes the problem?

Why is this a problem? What *really* are you trying to do?


--
Robert Heller -- 978-544-6933
Deepwoods Software -- Download the Model Railroad System
http://www.deepsoft.com/ -- Binaries for Linux and MS-Windows
heller(a)deepsoft.com -- http://www.deepsoft.com/ModelRailroadSystem/

From: Greg Russell on
On Wed, 14 Apr 2010 12:28:29 -0500, Robert Heller quoted and wrote:

>> > gzip preserves the timestamp of the input file, in this case the .tar
>> > file.
>>
>> What does that mean wrt:
>>
>> $ tar czf test_0.tgz *.txt; tar czf test_1.tgz *.txt
>>
>> Is a transient tar file created before the gzip process, and that's the
>> difference accounted by:
>>
>> $ cmp -l *.tgz
>> 5 277 310
>
> No 'transient tar file' is created. Tar sends the tar 'file' to a pipe
> to gzip, which simply grabs the current time as a timestamp.
>
>> Is there any option to disable the feature that causes the problem?
>
> Why is this a problem? What *really* are you trying to do?

The intent is to minimize a local archive of dynamic text-based calendar
data files that exist on a remote web server, in order to enable a backup
of the work of several people to a most recent state if needed.

Once per hour, a local script via cron employs "wget -N" to retrieve any
changed text data files from the remote web server. The complete local
collection of text files is then archived in /tmp with "tar -cz" and if
it differs from the most recently-saved archive, the temp tgz is saved as
a new archive; if it's the same then the /tmp file is discarded.

Everything works properly using gzip instead of "tar -cz", but the
archives are about 4 times as large ... not a major obstacle, but now
that the nature of the problem has been discovered with the help of
Usenet, it can be circumvented by other means.

Since tar preserves the individual timestamps of the files in the
archive, it was a very useful learning exercise to find that the "-z"
option adds an additional component of information to the archive that
causes the diff.

Appreciation goes to Mssr. Meyer for illuminating this feature, and
judging by the nature of Usenet and the timestamp of your post (16
minutes after Mssr. Meyer's), my appreciation goes to you as well,
despite your seeming overly-challenging tone in closing.

From: Vilmos Soti on
Greg Russell <me(a)invalid.com> writes:

>>> Is there any option to disable the feature that causes the problem?

Yes.

>>
>> Why is this a problem? What *really* are you trying to do?
>
> The intent is to minimize a local archive of dynamic text-based calendar
> data files that exist on a remote web server, in order to enable a backup
> of the work of several people to a most recent state if needed.

Did you take a look at rsync? Especially the --link-dest= feature?

> Once per hour, a local script via cron employs "wget -N" to retrieve any
> changed text data files from the remote web server. The complete local
> collection of text files is then archived in /tmp with "tar -cz" and if
> it differs from the most recently-saved archive, the temp tgz is saved as
> a new archive; if it's the same then the /tmp file is discarded.

If you insist on using tar and gzip, then try this:

tar ... | gzip -n > file.tgz

From gzip(1):
-n --no-name
When compressing, do not save the original file name and time
stamp by default. (The original name is always saved if the name
had to be truncated.) When decompressing, do not restore the
original file name if present (remove only the gzip suffix from
the compressed file name) and do not restore the original time
stamp if present (copy it from the compressed file). This option
is the default when decompressing.

Here is a test: (rearranged a bit for readability)

$ while true; do
sleep 0.3
date
echo hello world | gzip | md5sum
done

Thu Apr 15 08:30:39 PDT 2010 ee5f97aead40f0d081508548853e787d -
Thu Apr 15 08:30:39 PDT 2010 ee5f97aead40f0d081508548853e787d -
Thu Apr 15 08:30:39 PDT 2010 ee5f97aead40f0d081508548853e787d -
Thu Apr 15 08:30:40 PDT 2010 0f442555a99878a9a5782a50370869f4 -
Thu Apr 15 08:30:40 PDT 2010 0f442555a99878a9a5782a50370869f4 -
Thu Apr 15 08:30:40 PDT 2010 0f442555a99878a9a5782a50370869f4 -
Thu Apr 15 08:30:41 PDT 2010 ff070d337e6c4ffa8f5d6cc18552d900 -
Thu Apr 15 08:30:41 PDT 2010 ff070d337e6c4ffa8f5d6cc18552d900 -
Thu Apr 15 08:30:41 PDT 2010 ff070d337e6c4ffa8f5d6cc18552d900 -
$

And now the same with the -n switch:

$ while true; do
sleep 0.3
date
echo hello world | gzip -n | md5sum
done

Thu Apr 15 08:32:13 PDT 2010 92b0a5a28433b3ce3ee6d01a84b4508c -
Thu Apr 15 08:32:13 PDT 2010 92b0a5a28433b3ce3ee6d01a84b4508c -
Thu Apr 15 08:32:14 PDT 2010 92b0a5a28433b3ce3ee6d01a84b4508c -
Thu Apr 15 08:32:14 PDT 2010 92b0a5a28433b3ce3ee6d01a84b4508c -
Thu Apr 15 08:32:14 PDT 2010 92b0a5a28433b3ce3ee6d01a84b4508c -
Thu Apr 15 08:32:15 PDT 2010 92b0a5a28433b3ce3ee6d01a84b4508c -
Thu Apr 15 08:32:15 PDT 2010 92b0a5a28433b3ce3ee6d01a84b4508c -
$

Vilmos
From: Dave Gibson on
Greg Russell <me(a)invalid.com> wrote:
> On Wed, 14 Apr 2010 12:28:29 -0500, Robert Heller quoted and wrote:
>
>>> > gzip preserves the timestamp of the input file, in this case the .tar
>>> > file.
>>>
>>> What does that mean wrt:
>>>
>>> $ tar czf test_0.tgz *.txt; tar czf test_1.tgz *.txt
>>>
>>> Is a transient tar file created before the gzip process, and that's the
>>> difference accounted by:
>>>
>>> $ cmp -l *.tgz
>>> 5 277 310
>>
>> No 'transient tar file' is created. Tar sends the tar 'file' to a pipe
>> to gzip, which simply grabs the current time as a timestamp.
>>
>>> Is there any option to disable the feature that causes the problem?
>>
>> Why is this a problem? What *really* are you trying to do?
>
> The intent is to minimize a local archive of dynamic text-based calendar
> data files that exist on a remote web server, in order to enable a backup
> of the work of several people to a most recent state if needed.
>
> Once per hour, a local script via cron employs "wget -N" to retrieve any
> changed text data files from the remote web server. The complete local
> collection of text files is then archived in /tmp with "tar -cz" and if
> it differs from the most recently-saved archive, the temp tgz is saved as
> a new archive; if it's the same then the /tmp file is discarded.

The files' access times will be modified when tar reads them; the
next time they are added to an archive the updated timestamps will
ensure that the new archive differs from the older one.

GNU tar's --atime-preserve option can be used to sidestep that
but you're still relying on there being no changes to archive
structures (-H gnu helps here) and the order in which files are
added to the archives.

> Everything works properly using gzip instead of "tar -cz", but the
> archives are about 4 times as large ... not a major obstacle, but now
> that the nature of the problem has been discovered with the help of
> Usenet, it can be circumvented by other means.

Use gzip's -n option to prevent it writing a timestamp to its
output. This can be set in the environment so gzip will still see
it when invoked via tar:

GZIP=-n tar czf /tmp/whatever.tgz --atime-preserve -H gnu *.txt

As an alternative to comparing tarballs, consider using sorted
md5sum lists instead -- one matching the contents of the archive,
one freshly generated each time wget completes.
From: Greg Russell on
On Thu, 15 Apr 2010 08:35:00 -0700, Vilmos Soti wrote:

....
>> The intent is to minimize a local archive of dynamic text-based
>> calendar data files that exist on a remote web server, in order to
>> enable a backup of the work of several people to a most recent state if
>> needed.
>
> Did you take a look at rsync? Especially the --link-dest= feature?

Yes, but we don't have access to a shell on the remote web server, hence
the "wget -N" using ftp.

....
> If you insist on using tar and gzip, then try this:
>
> tar ... | gzip -n > file.tgz

Thank you, that works very well.