From: Robert Heller on 14 Apr 2010 13:28 At Wed, 14 Apr 2010 16:55:19 +0000 Greg Russell <me(a)invalid.com> wrote: > > On Wed, 14 Apr 2010 18:10:36 +0200, Harald Meyer wrote: > > > Greg Russell wrote: > >> Does the gzip (-z) option to tar introduce some entropy into a file > >> between iterations with the same arguments? e.g.: > > > > gzip preserves the timestamp of the input file, in this case the .tar > > file. > > What does that mean wrt: > > $ tar czf test_0.tgz *.txt; tar czf test_1.tgz *.txt > > Is a transient tar file created before the gzip process, and that's the > difference accounted by: > > $ cmp -l *.tgz > 5 277 310 No 'transient tar file' is created. Tar sends the tar 'file' to a pipe to gzip, which simply grabs the current time as a timestamp. > > Is there any option to disable the feature that causes the problem? Why is this a problem? What *really* are you trying to do? -- Robert Heller -- 978-544-6933 Deepwoods Software -- Download the Model Railroad System http://www.deepsoft.com/ -- Binaries for Linux and MS-Windows heller(a)deepsoft.com -- http://www.deepsoft.com/ModelRailroadSystem/
From: Greg Russell on 14 Apr 2010 22:11 On Wed, 14 Apr 2010 12:28:29 -0500, Robert Heller quoted and wrote: >> > gzip preserves the timestamp of the input file, in this case the .tar >> > file. >> >> What does that mean wrt: >> >> $ tar czf test_0.tgz *.txt; tar czf test_1.tgz *.txt >> >> Is a transient tar file created before the gzip process, and that's the >> difference accounted by: >> >> $ cmp -l *.tgz >> 5 277 310 > > No 'transient tar file' is created. Tar sends the tar 'file' to a pipe > to gzip, which simply grabs the current time as a timestamp. > >> Is there any option to disable the feature that causes the problem? > > Why is this a problem? What *really* are you trying to do? The intent is to minimize a local archive of dynamic text-based calendar data files that exist on a remote web server, in order to enable a backup of the work of several people to a most recent state if needed. Once per hour, a local script via cron employs "wget -N" to retrieve any changed text data files from the remote web server. The complete local collection of text files is then archived in /tmp with "tar -cz" and if it differs from the most recently-saved archive, the temp tgz is saved as a new archive; if it's the same then the /tmp file is discarded. Everything works properly using gzip instead of "tar -cz", but the archives are about 4 times as large ... not a major obstacle, but now that the nature of the problem has been discovered with the help of Usenet, it can be circumvented by other means. Since tar preserves the individual timestamps of the files in the archive, it was a very useful learning exercise to find that the "-z" option adds an additional component of information to the archive that causes the diff. Appreciation goes to Mssr. Meyer for illuminating this feature, and judging by the nature of Usenet and the timestamp of your post (16 minutes after Mssr. Meyer's), my appreciation goes to you as well, despite your seeming overly-challenging tone in closing.
From: Vilmos Soti on 15 Apr 2010 11:35 Greg Russell <me(a)invalid.com> writes: >>> Is there any option to disable the feature that causes the problem? Yes. >> >> Why is this a problem? What *really* are you trying to do? > > The intent is to minimize a local archive of dynamic text-based calendar > data files that exist on a remote web server, in order to enable a backup > of the work of several people to a most recent state if needed. Did you take a look at rsync? Especially the --link-dest= feature? > Once per hour, a local script via cron employs "wget -N" to retrieve any > changed text data files from the remote web server. The complete local > collection of text files is then archived in /tmp with "tar -cz" and if > it differs from the most recently-saved archive, the temp tgz is saved as > a new archive; if it's the same then the /tmp file is discarded. If you insist on using tar and gzip, then try this: tar ... | gzip -n > file.tgz From gzip(1): -n --no-name When compressing, do not save the original file name and time stamp by default. (The original name is always saved if the name had to be truncated.) When decompressing, do not restore the original file name if present (remove only the gzip suffix from the compressed file name) and do not restore the original time stamp if present (copy it from the compressed file). This option is the default when decompressing. Here is a test: (rearranged a bit for readability) $ while true; do sleep 0.3 date echo hello world | gzip | md5sum done Thu Apr 15 08:30:39 PDT 2010 ee5f97aead40f0d081508548853e787d - Thu Apr 15 08:30:39 PDT 2010 ee5f97aead40f0d081508548853e787d - Thu Apr 15 08:30:39 PDT 2010 ee5f97aead40f0d081508548853e787d - Thu Apr 15 08:30:40 PDT 2010 0f442555a99878a9a5782a50370869f4 - Thu Apr 15 08:30:40 PDT 2010 0f442555a99878a9a5782a50370869f4 - Thu Apr 15 08:30:40 PDT 2010 0f442555a99878a9a5782a50370869f4 - Thu Apr 15 08:30:41 PDT 2010 ff070d337e6c4ffa8f5d6cc18552d900 - Thu Apr 15 08:30:41 PDT 2010 ff070d337e6c4ffa8f5d6cc18552d900 - Thu Apr 15 08:30:41 PDT 2010 ff070d337e6c4ffa8f5d6cc18552d900 - $ And now the same with the -n switch: $ while true; do sleep 0.3 date echo hello world | gzip -n | md5sum done Thu Apr 15 08:32:13 PDT 2010 92b0a5a28433b3ce3ee6d01a84b4508c - Thu Apr 15 08:32:13 PDT 2010 92b0a5a28433b3ce3ee6d01a84b4508c - Thu Apr 15 08:32:14 PDT 2010 92b0a5a28433b3ce3ee6d01a84b4508c - Thu Apr 15 08:32:14 PDT 2010 92b0a5a28433b3ce3ee6d01a84b4508c - Thu Apr 15 08:32:14 PDT 2010 92b0a5a28433b3ce3ee6d01a84b4508c - Thu Apr 15 08:32:15 PDT 2010 92b0a5a28433b3ce3ee6d01a84b4508c - Thu Apr 15 08:32:15 PDT 2010 92b0a5a28433b3ce3ee6d01a84b4508c - $ Vilmos
From: Dave Gibson on 15 Apr 2010 13:18 Greg Russell <me(a)invalid.com> wrote: > On Wed, 14 Apr 2010 12:28:29 -0500, Robert Heller quoted and wrote: > >>> > gzip preserves the timestamp of the input file, in this case the .tar >>> > file. >>> >>> What does that mean wrt: >>> >>> $ tar czf test_0.tgz *.txt; tar czf test_1.tgz *.txt >>> >>> Is a transient tar file created before the gzip process, and that's the >>> difference accounted by: >>> >>> $ cmp -l *.tgz >>> 5 277 310 >> >> No 'transient tar file' is created. Tar sends the tar 'file' to a pipe >> to gzip, which simply grabs the current time as a timestamp. >> >>> Is there any option to disable the feature that causes the problem? >> >> Why is this a problem? What *really* are you trying to do? > > The intent is to minimize a local archive of dynamic text-based calendar > data files that exist on a remote web server, in order to enable a backup > of the work of several people to a most recent state if needed. > > Once per hour, a local script via cron employs "wget -N" to retrieve any > changed text data files from the remote web server. The complete local > collection of text files is then archived in /tmp with "tar -cz" and if > it differs from the most recently-saved archive, the temp tgz is saved as > a new archive; if it's the same then the /tmp file is discarded. The files' access times will be modified when tar reads them; the next time they are added to an archive the updated timestamps will ensure that the new archive differs from the older one. GNU tar's --atime-preserve option can be used to sidestep that but you're still relying on there being no changes to archive structures (-H gnu helps here) and the order in which files are added to the archives. > Everything works properly using gzip instead of "tar -cz", but the > archives are about 4 times as large ... not a major obstacle, but now > that the nature of the problem has been discovered with the help of > Usenet, it can be circumvented by other means. Use gzip's -n option to prevent it writing a timestamp to its output. This can be set in the environment so gzip will still see it when invoked via tar: GZIP=-n tar czf /tmp/whatever.tgz --atime-preserve -H gnu *.txt As an alternative to comparing tarballs, consider using sorted md5sum lists instead -- one matching the contents of the archive, one freshly generated each time wget completes.
From: Greg Russell on 15 Apr 2010 13:20
On Thu, 15 Apr 2010 08:35:00 -0700, Vilmos Soti wrote: .... >> The intent is to minimize a local archive of dynamic text-based >> calendar data files that exist on a remote web server, in order to >> enable a backup of the work of several people to a most recent state if >> needed. > > Did you take a look at rsync? Especially the --link-dest= feature? Yes, but we don't have access to a shell on the remote web server, hence the "wget -N" using ftp. .... > If you insist on using tar and gzip, then try this: > > tar ... | gzip -n > file.tgz Thank you, that works very well. |