wget to save worldnet web pages? [General Linux]

Prev: Mobile Phone Compatibility = Dead End
Next: mdadm /dev/md0 raid not being created

From: Ron Hardin on 4 Feb 2010 06:25

Worldnet is terminating its personal web pages.

Apparently wget can save a copy of what I have there on my HD,
but I am snowed by options.

A recursive retrieve looks right, starting from index.html, and
restricted to http://home.att.net/

What's a safe wget command to try that won't flood the server or
download the entire world wide web?

(I can fix links myself with sed scripts once it's on the HD, so
just need a copy of what's there unconverted)

wget is under cygwin on XP Home but ought to work the same as
linux.

Maybe somebody is an expert user and can give the obvious command
line.
--
rhhardin(a)mindspring.com

On the internet, nobody knows you're a jerk.

From: Nuno J. Silva on 4 Feb 2010 14:53

Ron Hardin <rhhardin(a)mindspring.com> writes:

> Worldnet is terminating its personal web pages.
>
> Apparently wget can save a copy of what I have there on my HD,
> but I am snowed by options.

wget can save copies of files you have there *and* that are accessible
by following links. That is, if you uploaded some file to make it
available but you never link to it, wget via http won't guess it's
there.

> A recursive retrieve looks right, starting from index.html, and
> restricted to http://home.att.net/

I've already used wget, but I can't recall how does it decide when
should other hosts be accessed.

> What's a safe wget command to try that won't flood the server or
> download the entire world wide web?

To avoid overloading the server, you should use --wait=seconds (other
units are allowed, too).

>
> (I can fix links myself with sed scripts once it's on the HD, so
> just need a copy of what's there unconverted)

wget can fix links (--convert-links) and extensions
(--adjust-extension). If it works the way you want, maybe it's worth
doing it in wget.

Maybe you should also look at --page-requisites (to download pictures
and other inlined items), and at --continue (it will avoid duplicates
when wget accesses the same file twice).

> wget is under cygwin on XP Home but ought to work the same as
> linux.
>
> Maybe somebody is an expert user and can give the obvious command
> line.

--
Nuno J. Silva
gopher://sdf-eu.org/1/users/njsg

From: root on 4 Feb 2010 15:21

Nuno J. Silva <nunojsilva(a)invalid.invalid> wrote:
>
>> A recursive retrieve looks right, starting from index.html, and
>> restricted to http://home.att.net/
>
> I've already used wget, but I can't recall how does it decide when
> should other hosts be accessed.
>
>> What's a safe wget command to try that won't flood the server or
>> download the entire world wide web?
>

If you direct wget -r to a URL which has a link UP instead
of DOWN, you will end up pulling in much more than you
want, up to the point you download the whole site.

From: Sidney Lambe on 4 Feb 2010 15:51

On comp.os.linux.misc, Ron Hardin <rhhardin(a)mindspring.com> wrote:
> Worldnet is terminating its personal web pages.
>
> Apparently wget can save a copy of what I have there on my HD,
> but I am snowed by options.
>
> A recursive retrieve looks right, starting from index.html, and
> restricted to http://home.att.net/
>
> What's a safe wget command to try that won't flood the server or
> download the entire world wide web?
>
> (I can fix links myself with sed scripts once it's on the HD, so
> just need a copy of what's there unconverted)
>
> wget is under cygwin on XP Home but ought to work the same as
> linux.
>
> Maybe somebody is an expert user and can give the obvious command
> line.
> --
> rhhardin(a)mindspring.com
>
> On the internet, nobody knows you're a jerk.

Posting the url so that we could look it over ourselves
would make a lot of sense.

And I'd post the question on comp.unix.shell.

Sid

From: Nuno J. Silva on 4 Feb 2010 17:06

root <NoEMail(a)home.org> writes:

> Nuno J. Silva <nunojsilva(a)invalid.invalid> wrote:
>>
>>> A recursive retrieve looks right, starting from index.html, and
>>> restricted to http://home.att.net/
>>
>> I've already used wget, but I can't recall how does it decide when
>> should other hosts be accessed.
>>
>>> What's a safe wget command to try that won't flood the server or
>>> download the entire world wide web?
>>
>
> If you direct wget -r to a URL which has a link UP instead
> of DOWN, you will end up pulling in much more than you
> want, up to the point you download the whole site.

To avoid that (going up on the same host) you can use --no-parent.

--
Nuno J. Silva
gopher://sdf-eu.org/1/users/njsg

| Next | Last
Pages: 1 2
Prev: Mobile Phone Compatibility = Dead End
Next: mdadm /dev/md0 raid not being created