wget -r (recursive) failing to back up a site [Shell]

Prev: sed queation - remove all characters after a hyphen
Next: fgrep,grep and egrep

From: James Harris on 7 Jun 2010 17:10

On 4 June, 23:47, James Harris <james.harri...(a)googlemail.com> wrote:

> I'm trying to use wget -r to back up
>
> http://sundry.wikispaces.com/
>
> but it fails to back up more than the home page. The same command
> works fine elsewhere and I've tried various options for the above web
> site to no avail. The site seems to use a session id - if that's
> important - but the home page as downloaded clearly has the <a href
> links to further pages so I'm not sure why wget fails to follow them.
>
> Any ideas?

No response from comp.unix.admin. Trying comp.unix.shell. Maybe
someone there has an idea to fix the wget problem...?

James

From: Tony on 8 Jun 2010 17:19

On 07/06/2010 22:10, James Harris wrote:
> On 4 June, 23:47, James Harris<james.harri...(a)googlemail.com> wrote:
>
>> I'm trying to use wget -r to back up
>>
>> http://sundry.wikispaces.com/

>> Any ideas?
>
> No response from comp.unix.admin. Trying comp.unix.shell. Maybe
> someone there has an idea to fix the wget problem...?

Does the site's robots.txt file preclude the links you're trying to
spider? wget plays nice by default.

--
Tony Evans
Saving trees and wasting electrons since 1993
blog -> http://perceptionistruth.com/
books -> http://www.bookthing.co.uk
[ anything below this line wasn't written by me ]

From: Bob Melson on 8 Jun 2010 19:17

On Tuesday 08 June 2010 15:19, Tony (tony(a)darkstorm.invalid) opined:

> On 07/06/2010 22:10, James Harris wrote:
>> On 4 June, 23:47, James Harris<james.harri...(a)googlemail.com> wrote:
>>
>>> I'm trying to use wget -r to back up
>>>
>>> http://sundry.wikispaces.com/
>
>>> Any ideas?
>>
>> No response from comp.unix.admin. Trying comp.unix.shell. Maybe
>> someone there has an idea to fix the wget problem...?
>
> Does the site's robots.txt file preclude the links you're trying to
> spider? wget plays nice by default.
>
>
> --
> Tony Evans
> Saving trees and wasting electrons since 1993
> blog -> http://perceptionistruth.com/
> books -> http://www.bookthing.co.uk
> [ anything below this line wasn't written by me ]

Another thing to consider is that many folks killfile all gmail, googlemail
and googlegroups addresses because of the huge amount spam originating on
them and google's refusal to do anything about it. Many of us don't see
those original posts, just the rare responses.

--
Robert G. Melson | Rio Grande MicroSolutions | El Paso, Texas
-----
Nothing astonishes men so much as common sense and plain dealing.
Ralph Waldo Emerson

From: Christian on 11 Jun 2010 09:56

>"James Harris" <james.harris.1(a)googlemail.com> a �crit dans le message de
>news: daed461b-a37a-445a-8c7d-4791875fc4fe(a)t10g2000yqg.googlegroups.com...
>On 4 June, 23:47, James Harris <james.harri...(a)googlemail.com> wrote:

>> I'm trying to use wget -r to back up
>>
>> http://sundry.wikispaces.com/
>>
>> but it fails to back up more than the home page. The same command
>> works fine elsewhere and I've tried various options for the above web
>> site to no avail. The site seems to use a session id - if that's
>> important - but the home page as downloaded clearly has the <a href
>> links to further pages so I'm not sure why wget fails to follow them.
>>
>> Any ideas?

>No response from comp.unix.admin. Trying comp.unix.shell. Maybe
>someone there has an idea to fix the wget problem...?

>James

Try with a 'standard' user-agent : wget --user-agent="Mozilla/4.0
(compatible; MSIE 7.0; Windows NT 6.0; GTB6.4; SLCC1; .NET CLR 2.0.50727;
Media Center PC 5.0; Tablet PC 2.0; .NET CLR 3.5.21022; .NET CLR 3.5.30729;
..NET CLR 3.0.30729)" ...

Christian

From: James Harris (es) on 11 Jun 2010 12:59

"Tony" <tony(a)darkstorm.invalid> wrote in message
news:humc4g$5p7$1(a)matrix.darkstorm.co.uk...
> On 07/06/2010 22:10, James Harris wrote:
>> On 4 June, 23:47, James Harris<james.harri...(a)googlemail.com> wrote:
>>
>>> I'm trying to use wget -r to back up
>>>
>>> http://sundry.wikispaces.com/
>
>>> Any ideas?
>>
>> No response from comp.unix.admin. Trying comp.unix.shell. Maybe
>> someone there has an idea to fix the wget problem...?
>
> Does the site's robots.txt file preclude the links you're trying to
> spider? wget plays nice by default.

Good idea. I've been checking it and it doesn't seem to be the problem. It
has lines such as

User-agent: *
Disallow: /file/rename
Disallow: /file/delete

but these don't disallow the data pages that I want to back up. There is
also a sitemap.xml. To my untutored eye it looks fine too.

James

| Next | Last
Pages: 1 2
Prev: sed queation - remove all characters after a hyphen
Next: fgrep,grep and egrep