Urllib2: Only a partial page retrieved [Python]

Prev: help with the Python 3 version of the decorator module
Next: logging: AttributeError: 'module' object has no attribute'getLogger'

From: Dragon Lord on 22 May 2010 05:43

I am trying to download a few IEEE pages by using urllib2, but with
certain pages I get only the first part of the page. With other pages
from the same server and url (just another pageID) I get the right
results. The difference between these pages seems to be the date the
paper for which the page is was published. Any papers from before 2000
end just before the date, pages from 2000 and later and at <\html>.

Two example URLs:

Does not work: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=517048
Does work: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=854728

I tried both urlopen and urlretrieve and tried both urllib and
urllib2. With urlopen I tried both .read() and .read(10000) to make
sure I got the whole page, but nothing helped.
Sample code:

import urllib2
response = urllib2.urlopen("http://ieeexplore.ieee.org/xpl/
freeabs_all.jsp?arnumber=517048")
html = response.read()
print html

The cutoff is allways at the same location: just after the label
"Meeting date" and before the date itself. Could it be that something
is interpreted as and eof command or something like that?

example of the cutoff point with a bad page:
 Meeting Date: 

example of the cutoff point with a good page:
 Meeting Date: 

13 jun 2000

The bad pages do continue after this point btw. if you use a
webbrowser, it does not seem to be a server problem.

From: Dragon Lord on 22 May 2010 12:24

Oops, het "Good" page is alos handled wrongly. The papers from 2000
are handled wrong too so a real example of a well performing page:

http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5206867

On May 22, 11:43 am, Dragon Lord <dragonlord...(a)gmail.com> wrote:
> I am trying to download a few IEEE pages by using urllib2, but with
> certain pages I get only the first part of the page. With other pages
> from the same server and url (just another pageID) I get the right
> results. The difference between these pages seems to be the date the
> paper for which the page is was published. Any papers from before 2000
> end just before the date, pages from 2000 and later and at <\html>.
>
> Two example URLs:
>
> Does not work:http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=517048
> Does work:http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=854728
>
> I tried both urlopen and urlretrieve and tried both urllib and
> urllib2. With urlopen I tried both .read() and .read(10000) to make
> sure I got the whole page, but nothing helped.
> Sample code:
>
> import urllib2
> response = urllib2.urlopen("http://ieeexplore.ieee.org/xpl/
> freeabs_all.jsp?arnumber=517048")
> html = response.read()
> print html
>
> The cutoff is allways at the same location: just after the label
> "Meeting date" and before the date itself. Could it be that something
> is interpreted as and eof command or something like that?
>
> example of the cutoff point with a bad page:
> Meeting Date: 
>
> example of the cutoff point with a good page:
> Meeting Date: 
>
> 13 jun 2000
>
> The bad pages do continue after this point btw. if you use a
> webbrowser, it does not seem to be a server problem.

From: hpsMouse on 23 May 2010 05:19

On 5ÔÂ22ÈÕ, ÏÂÎç5Ê±43·Ö, Dragon Lord <dragonlord...(a)gmail.com> wrote:
> The cutoff is allways at the same location: just after the label
> "Meeting date" and before the date itself. Could it be that something
> is interpreted as and eof command or something like that?
>
> example of the cutoff point with a bad page:
> Meeting Date: 
>
> example of the cutoff point with a good page:
> Meeting Date: 

I checked TCP packages, and found that the remote HTTP server send a
data package with flag "PUSH", causing the client to close connection.
That is exactly where the "Meeting Date: " appears.
This seems not to be a bug for python, because Qt and telnet both
failed in my test, so did the wget program...
Most browsers use keep-alive HTTP, so the connection won't be closed.
I think that's why a browser show the page correctly.

From: hpsMouse on 23 May 2010 05:42

I know what the problem is.

Server checks client's locale setting to determine how the date should
be displayed. Python don't send locale information by default. So
server fails at that point.

If you add the following field in the HTTP request, the response will
be correct:
Accept-Language: en

From: Dragon Lord on 23 May 2010 07:34

Thanks, that works perfectly!

(oh and I learnt something new too, because I tried using telnet to
connect to the server :) )

On May 23, 11:42 am, hpsMouse <hpsmo...(a)gmail.com> wrote:
> I know what the problem is.
>
> Server checks client's locale setting to determine how the date should
> be displayed. Python don't send locale information by default. So
> server fails at that point.
>
> If you add the following field in the HTTP request, the response will
> be correct:
> Accept-Language: en

|
Pages: 1
Prev: help with the Python 3 version of the decorator module
Next: logging: AttributeError: 'module' object has no attribute'getLogger'