wget -r (recursive) failing to back up a site [Shell]

Prev: sed queation - remove all characters after a hyphen
Next: fgrep,grep and egrep

From: James Harris (es) on 11 Jun 2010 13:05

"Bob Melson" <amia9018(a)mypacks.net> wrote in message
news:D4OdnbLEnJAPUpPRnZ2dnUVZ_sCdnZ2d(a)earthlink.com...

....

> Another thing to consider is that many folks killfile all gmail,
> googlemail
> and googlegroups addresses because of the huge amount spam originating on
> them and google's refusal to do anything about it. Many of us don't see
> those original posts, just the rare responses.

Understood. Google's lack of policing or even their lack of adequate
response to spam reports is very bad. The touble is it's just too useful.
The Usenet service providers I use seem to filter spam - including that from
Google - but keep legitimate posts.

To anyone who didn't see the original query, I'm trying to use wget -r to
back up

http://sundry.wikispaces.com/

but despite what I try I only ever get the home page.

Any ideas why wget is not recursing to linked pages on the same site?

James

From: James Harris (es) on 11 Jun 2010 13:11

"Christian" <cgregoir99(a)yahoo.com> wrote in message
news:hutero$hsp$1(a)writer.imaginet.fr...
> >"James Harris" <james.harris.1(a)googlemail.com> a �crit dans le message de
> >news:
> >daed461b-a37a-445a-8c7d-4791875fc4fe(a)t10g2000yqg.googlegroups.com...
>>On 4 June, 23:47, James Harris <james.harri...(a)googlemail.com> wrote:
>
>>> I'm trying to use wget -r to back up
>>>
>>> http://sundry.wikispaces.com/
>>>
>>> but it fails to back up more than the home page. The same command
>>> works fine elsewhere and I've tried various options for the above web
>>> site to no avail. The site seems to use a session id - if that's
>>> important - but the home page as downloaded clearly has the <a href
>>> links to further pages so I'm not sure why wget fails to follow them.
>>>
>>> Any ideas?
>
>>No response from comp.unix.admin. Trying comp.unix.shell. Maybe
>>someone there has an idea to fix the wget problem...?
>
>>James
>
> Try with a 'standard' user-agent : wget --user-agent="Mozilla/4.0
> (compatible; MSIE 7.0; Windows NT 6.0; GTB6.4; SLCC1; .NET CLR 2.0.50727;
> Media Center PC 5.0; Tablet PC 2.0; .NET CLR 3.5.21022; .NET CLR
> 3.5.30729; .NET CLR 3.0.30729)" ...

Also a good idea. I've just tried with a couple of user-agent strings but it
still doesn't work. I don't think it can be the user-agent id as wget loads
the specified page successfully and that page looks alright. It contains
embedded <a href=...> links. Unfortunately wget -r fails to follow them.

James

From: Chris Nehren on 12 Jun 2010 05:46

["Followup-To:" header set to comp.unix.admin.]
On 2010-06-11, Christian scribbled these
curious markings:
>>"James Harris" <james.harris.1(a)googlemail.com> a écrit dans le message de
>>news: daed461b-a37a-445a-8c7d-4791875fc4fe(a)t10g2000yqg.googlegroups.com...
>>On 4 June, 23:47, James Harris <james.harri...(a)googlemail.com> wrote:
>
>>> I'm trying to use wget -r to back up
>>>
>>> http://sundry.wikispaces.com/
>>>
>>> but it fails to back up more than the home page. The same command
>>> works fine elsewhere and I've tried various options for the above web
>>> site to no avail. The site seems to use a session id - if that's
>>> important - but the home page as downloaded clearly has the <a href
>>> links to further pages so I'm not sure why wget fails to follow them.
>>>
>>> Any ideas?
>
>>No response from comp.unix.admin. Trying comp.unix.shell. Maybe
>>someone there has an idea to fix the wget problem...?
>
>>James
>
> Try with a 'standard' user-agent : wget --user-agent="Mozilla/4.0
> (compatible; MSIE 7.0; Windows NT 6.0; GTB6.4; SLCC1; .NET CLR 2.0.50727;
> Media Center PC 5.0; Tablet PC 2.0; .NET CLR 3.5.21022; .NET CLR 3.5.30729;
> .NET CLR 3.0.30729)" ...

In addition: have you turned on debugging yet? Have you asked wget to
print the HTTP headers of the requests and responses yet? The server is
giving wget information that it's using to determine to not go any
further. Ask it for this information and you should be able to discern
why it's behaving the way it is. Otherwise you're just guessing in an
engineering discipline.

--
Thanks and best regards,
Chris Nehren

From: Use-Author-Supplied-Address-Header on 16 Jun 2010 14:26

James Harris <james.harris.1(a)googlemail.com> wrote:
: On 4 June, 23:47, James Harris <james.harri...(a)googlemail.com> wrote:
[cut]
: No response from comp.unix.admin. Trying comp.unix.shell. Maybe
: someone there has an idea to fix the wget problem...?

The best place to deal with this is the wget mailing list. See,
http://lists.gnu.org/mailman/listinfo/bug-wget. For an nntp 'mirror'
see also NG gmane.comp.web.wget.general on the gmane server at
news.gmane.org.

HTH
Tom.

Ps. The email address in the header is just a spam-trap.
--
Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,
Egham, Surrey, TW20 0EX, England.
Email: T.Crane at rhul dot ac dot uk
Fax: +44 (0) 1784 472794

First | Prev |
Pages: 1 2
Prev: sed queation - remove all characters after a hyphen
Next: fgrep,grep and egrep