Prev: How to extract the x-axis and y-axis current ticks of a plot from
Next: How to set $UserBaseDirectory in the Workbench?
From: Scipione Dal Ferro on 18 Mar 2010 05:31 Hi there, I use Import to parse the hyperlinks of many similar html pages without any problem, but for few pages (as for the example in the subject) it fails. More in detail, here the example with the result: In[1]:= Import["http://www.paginegialle.it/ascensoriromamir.a.m", "Hyperlinks"] Read::readt: Invalid input found when reading <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> from C:\Users\scipione.dalferro\AppData\Local\Temp\mFA3E.tmp\ascensoriromamir.a.m. >> Out[1]= $Failed The error messages states there's an invalid input; anyway the page can be opened with a browser correctly. I tried changing the Element to "Source" or other, but with the same result. Similar pages work correctly, as this one for example: In[2]:=Import["http://www.paginegialle.it/esis", "Hyperlinks"] Hope u can help me to understand this issue. Thanks, Scipione
From: Sjoerd C. de Vries on 19 Mar 2010 03:37 Hi Scipione, Given the uncommon file extension you have to make explicit that you're dealing with html, like this: Import["http://www.paginegialle.it/ascensoriromamir.a.m", {"HTML", "Hyperlinks"}] Cheers -- Sjoerd On Mar 18, 11:31 am, Scipione Dal Ferro <scipionedalfe...(a)yahoo.it> wrote: > Hi there, > > I use Import to parse the hyperlinks of many similar html pages without any problem, but for few pages (as for the example in the subject) it fails. > More in detail, here the example with the result: > > In[1]:= Import["http://www.paginegialle.it/ascensoriromamir.a.m", "Hyperlinks"] > > Read::readt: Invalid input found when reading <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > from C:\Users\scipione.dalferro\AppData\Local\Temp\mFA3E.tmp\ascensoriromamir.a.m. >> > > Out[1]= $Failed > > The error messages states there's an invalid input; anyway the page can be opened with a browser correctly. > > I tried changing the Element to "Source" or other, but with the same result. > Similar pages work correctly, as this one for example: > > In[2]:=Import["http://www.paginegialle.it/esis", "Hyperlinks"] > > Hope u can help me to understand this issue. > > Thanks, > Scipione
From: Hans Michel on 19 Mar 2010 03:38 This worked In[2]:= Import["http://www.paginegialle.it/ascensoriromamir.a.m", {"XHTML", "Hyperlinks"}] Out[2]= {http://www.paginegialle.it,http://www.paginegialle.it,http://www.paginegialle.it/index_numero.html,http://www.paginegialle.it/cat/pagine_gialle_naviga.html,http://www.paginegialle.it/index_video.html,/ascensoriromamir.a.m/segnala,/pg/cgi/vcf.cgi?cc=758980621&cl=1,http://www.paginegialle.it/ascensoriromamir.a.m,mailto:fra.mirabella(a)tiscali.it,http://www.paginegialle.it/ascensoriromamir.a.m/mappa,http://www.paginegialle.it/ascensoriromamir.a.m/fotoaerea,http://www.paginegialle.it/ascensoriromamir.a.m/percorso,/ascensoriromamir.a.m,/ascensoriromamir.a.m/contatto,,http://www.paginebianche.it/,http://www.tuttocitta.it/,http://www.paginegiallevisual.it/,http://www.paginegiallenav.it/,http://www.892424.it/,http://www.seat.it,http://www.europages.it/,http://www.seatconvoi.it/,http://www.convoimagazineseat.it/,http://www.seatcorporateuniversity.it/,http://www.giallopromo.it/,http://www.kompassitalia.it/,http://www.consodata.it/,http://www.Lineaffari.com/,http://www.118000.fr/,http:// www.11880.com/,http://www.thomsonlocal.com/,http://www.11811.es/,http://www.alberghieturismo.it/,http://www.jobville.it/,http://www.paginegialle.it/pg/extra/marchi/seat_protetti.html,http://www.paginegialle.it/pg/offertapgol/cgi/contatta.cgi,http://www.paginegialle.it/pg/extra/privacy.html,http://www.paginegialle.it/pg/extra/copyright/tutelacopyright.html} In[3]:= $Version Out[3]= 7.0 for Microsoft Windows (32-bit) (November 10, 2008) Since the extension to this file was not .htm, or .html and it included a SGML DOCTYPE declaration I don't think imported file was routed to the correct parser. Apprently the current link can be successfully parsed using {"HTML","XMLObject"}. Please note that helping the application/function abit by telling it what the file is helps such as the following: In[4]:= Import["http://www.paginegialle.it/ascensoriromamir.a.m", {"HTML", "Hyperlinks"}] Out[4]= {http://www.paginegialle.it,http://www.paginegialle.it,http://www.paginegialle.it/index_numero.html,http://www.paginegialle.it/cat/pagine_gialle_naviga.html,http://www.paginegialle.it/index_video.html,/ascensoriromamir.a.m/segnala,/pg/cgi/vcf.cgi?cc=758980621&cl=1,http://www.paginegialle.it/ascensoriromamir.a.m,mailto:fra.mirabella(a)tiscali.it,http://www.paginegialle.it/ascensoriromamir.a.m/mappa,http://www.paginegialle.it/ascensoriromamir.a.m/fotoaerea,http://www.paginegialle.it/ascensoriromamir.a.m/percorso,/ascensoriromamir.a.m,/ascensoriromamir.a.m/contatto,,http://www.paginebianche.it/,http://www.tuttocitta.it/,http://www.paginegiallevisual.it/,http://www.paginegiallenav.it/,http://www.892424.it/,http://www.seat.it,http://www.europages.it/,http://www.seatconvoi.it/,http://www.convoimagazineseat.it/,http://www.seatcorporateuniversity.it/,http://www.giallopromo.it/,http://www.kompassitalia.it/,http://www.consodata.it/,http://www.Lineaffari.com/,http://www.118000.fr/,http:// www.11880.com/,http://www.thomsonlocal.com/,http://www.11811.es/,http://www.alberghieturismo.it/,http://www.jobville.it/,http://www.paginegialle.it/pg/extra/marchi/seat_protetti.html,http://www.paginegialle.it/pg/offertapgol/cgi/contatta.cgi,http://www.paginegialle.it/pg/extra/privacy.html,http://www.paginegialle.it/pg/extra/copyright/tutelacopyright.html} Hans "Scipione Dal Ferro" <scipionedalferro(a)yahoo.it> wrote in message news:hnsrtt$5ks$1(a)smc.vnet.net... > Hi there, > > I use Import to parse the hyperlinks of many similar html pages without > any problem, but for few pages (as for the example in the subject) it > fails. > More in detail, here the example with the result: > > In[1]:= Import["http://www.paginegialle.it/ascensoriromamir.a.m", > "Hyperlinks"] > > Read::readt: Invalid input found when reading <!DOCTYPE html PUBLIC > "-//W3C//DTD XHTML 1.0 Transitional//EN" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > from > C:\Users\scipione.dalferro\AppData\Local\Temp\mFA3E.tmp\ascensoriromamir.a.m. > >> > > Out[1]= $Failed > > The error messages states there's an invalid input; anyway the page can be > opened with a browser correctly. > > I tried changing the Element to "Source" or other, but with the same > result. > Similar pages work correctly, as this one for example: > > In[2]:=Import["http://www.paginegialle.it/esis", "Hyperlinks"] > > Hope u can help me to understand this issue. > > Thanks, > Scipione >
From: rafscipio on 19 Mar 2010 07:45
Sjoerd, Hans, thanks for your clear explanations. (I dont' know why i didn't think about specify the file type! :) ) Many regards, Scipione. |