From: Scipione Dal Ferro on
Hi there,

I use Import to parse the hyperlinks of many similar html pages without any problem, but for few pages (as for the example in the subject) it fails.
More in detail, here the example with the result:

In[1]:= Import["http://www.paginegialle.it/ascensoriromamir.a.m", "Hyperlinks"]

Read::readt: Invalid input found when reading <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
from C:\Users\scipione.dalferro\AppData\Local\Temp\mFA3E.tmp\ascensoriromamir.a.m. >>

Out[1]= $Failed

The error messages states there's an invalid input; anyway the page can be opened with a browser correctly.

I tried changing the Element to "Source" or other, but with the same result.
Similar pages work correctly, as this one for example:

In[2]:=Import["http://www.paginegialle.it/esis", "Hyperlinks"]

Hope u can help me to understand this issue.

Thanks,
Scipione

From: Sjoerd C. de Vries on
Hi Scipione,

Given the uncommon file extension you have to make explicit that
you're dealing with html, like this:

Import["http://www.paginegialle.it/ascensoriromamir.a.m", {"HTML",
"Hyperlinks"}]

Cheers -- Sjoerd

On Mar 18, 11:31 am, Scipione Dal Ferro <scipionedalfe...(a)yahoo.it>
wrote:
> Hi there,
>
> I use Import to parse the hyperlinks of many similar html pages without any problem, but for few pages (as for the example in the subject) it fails.
> More in detail, here the example with the result:
>
> In[1]:= Import["http://www.paginegialle.it/ascensoriromamir.a.m", "Hyperlinks"]
>
> Read::readt: Invalid input found when reading <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> from C:\Users\scipione.dalferro\AppData\Local\Temp\mFA3E.tmp\ascensoriromamir.a.m. >>
>
> Out[1]= $Failed
>
> The error messages states there's an invalid input; anyway the page can be opened with a browser correctly.
>
> I tried changing the Element to "Source" or other, but with the same result.
> Similar pages work correctly, as this one for example:
>
> In[2]:=Import["http://www.paginegialle.it/esis", "Hyperlinks"]
>
> Hope u can help me to understand this issue.
>
> Thanks,
> Scipione


From: Hans Michel on
This worked

In[2]:= Import["http://www.paginegialle.it/ascensoriromamir.a.m", {"XHTML",
"Hyperlinks"}]
Out[2]=
{http://www.paginegialle.it,http://www.paginegialle.it,http://www.paginegialle.it/index_numero.html,http://www.paginegialle.it/cat/pagine_gialle_naviga.html,http://www.paginegialle.it/index_video.html,/ascensoriromamir.a.m/segnala,/pg/cgi/vcf.cgi?cc=758980621&cl=1,http://www.paginegialle.it/ascensoriromamir.a.m,mailto:fra.mirabella(a)tiscali.it,http://www.paginegialle.it/ascensoriromamir.a.m/mappa,http://www.paginegialle.it/ascensoriromamir.a.m/fotoaerea,http://www.paginegialle.it/ascensoriromamir.a.m/percorso,/ascensoriromamir.a.m,/ascensoriromamir.a.m/contatto,,http://www.paginebianche.it/,http://www.tuttocitta.it/,http://www.paginegiallevisual.it/,http://www.paginegiallenav.it/,http://www.892424.it/,http://www.seat.it,http://www.europages.it/,http://www.seatconvoi.it/,http://www.convoimagazineseat.it/,http://www.seatcorporateuniversity.it/,http://www.giallopromo.it/,http://www.kompassitalia.it/,http://www.consodata.it/,http://www.Lineaffari.com/,http://www.118000.fr/,http://
www.11880.com/,http://www.thomsonlocal.com/,http://www.11811.es/,http://www.alberghieturismo.it/,http://www.jobville.it/,http://www.paginegialle.it/pg/extra/marchi/seat_protetti.html,http://www.paginegialle.it/pg/offertapgol/cgi/contatta.cgi,http://www.paginegialle.it/pg/extra/privacy.html,http://www.paginegialle.it/pg/extra/copyright/tutelacopyright.html}

In[3]:= $Version
Out[3]= 7.0 for Microsoft Windows (32-bit) (November 10, 2008)

Since the extension to this file was not .htm, or .html and it included a
SGML DOCTYPE declaration I don't think imported file was routed to the
correct parser. Apprently the current link can be successfully parsed using
{"HTML","XMLObject"}.

Please note that helping the application/function abit by telling it what
the file is helps such as the following:

In[4]:= Import["http://www.paginegialle.it/ascensoriromamir.a.m", {"HTML",
"Hyperlinks"}]
Out[4]=
{http://www.paginegialle.it,http://www.paginegialle.it,http://www.paginegialle.it/index_numero.html,http://www.paginegialle.it/cat/pagine_gialle_naviga.html,http://www.paginegialle.it/index_video.html,/ascensoriromamir.a.m/segnala,/pg/cgi/vcf.cgi?cc=758980621&cl=1,http://www.paginegialle.it/ascensoriromamir.a.m,mailto:fra.mirabella(a)tiscali.it,http://www.paginegialle.it/ascensoriromamir.a.m/mappa,http://www.paginegialle.it/ascensoriromamir.a.m/fotoaerea,http://www.paginegialle.it/ascensoriromamir.a.m/percorso,/ascensoriromamir.a.m,/ascensoriromamir.a.m/contatto,,http://www.paginebianche.it/,http://www.tuttocitta.it/,http://www.paginegiallevisual.it/,http://www.paginegiallenav.it/,http://www.892424.it/,http://www.seat.it,http://www.europages.it/,http://www.seatconvoi.it/,http://www.convoimagazineseat.it/,http://www.seatcorporateuniversity.it/,http://www.giallopromo.it/,http://www.kompassitalia.it/,http://www.consodata.it/,http://www.Lineaffari.com/,http://www.118000.fr/,http://
www.11880.com/,http://www.thomsonlocal.com/,http://www.11811.es/,http://www.alberghieturismo.it/,http://www.jobville.it/,http://www.paginegialle.it/pg/extra/marchi/seat_protetti.html,http://www.paginegialle.it/pg/offertapgol/cgi/contatta.cgi,http://www.paginegialle.it/pg/extra/privacy.html,http://www.paginegialle.it/pg/extra/copyright/tutelacopyright.html}

Hans

"Scipione Dal Ferro" <scipionedalferro(a)yahoo.it> wrote in message
news:hnsrtt$5ks$1(a)smc.vnet.net...
> Hi there,
>
> I use Import to parse the hyperlinks of many similar html pages without
> any problem, but for few pages (as for the example in the subject) it
> fails.
> More in detail, here the example with the result:
>
> In[1]:= Import["http://www.paginegialle.it/ascensoriromamir.a.m",
> "Hyperlinks"]
>
> Read::readt: Invalid input found when reading <!DOCTYPE html PUBLIC
> "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> from
> C:\Users\scipione.dalferro\AppData\Local\Temp\mFA3E.tmp\ascensoriromamir.a.m.
> >>
>
> Out[1]= $Failed
>
> The error messages states there's an invalid input; anyway the page can be
> opened with a browser correctly.
>
> I tried changing the Element to "Source" or other, but with the same
> result.
> Similar pages work correctly, as this one for example:
>
> In[2]:=Import["http://www.paginegialle.it/esis", "Hyperlinks"]
>
> Hope u can help me to understand this issue.
>
> Thanks,
> Scipione
>


From: rafscipio on
Sjoerd, Hans, thanks for your clear explanations.

(I dont' know why i didn't think about specify the file type! :) )

Many regards,
Scipione.