From: kevin on 13 Jun 2010 04:13 Hi Guys, Is there any way to use mathematica to read all the words of a website, say www.bloomberg.com? Thanks in advance. Best, Kevin
From: Christopher Arthur on 13 Jun 2010 18:52 Look at Miscellaneous`Dictionary and Web Services Package for easy ways, but otherwise it depends on how nice the website is. kevin a =E9crit : > Hi Guys, > > Is there any way to use mathematica to read all the words of a > website, say www.bloomberg.com? Thanks in advance. > > Best, > Kevin > > >
From: telefunkenvf14 on 13 Jun 2010 18:53 On Jun 13, 3:13 am, kevin <kevin999ko...(a)gmail.com> wrote: > Hi Guys, > > Is there any way to use mathematica to read all the words of = a > website, saywww.bloomberg.com?Thanks in advance. > > Best, > Kevin I'm not experienced with this kind of thing, but for the case at hand I think what you'll have to do is Import[] as "Source" and then parse the source code for what you want. For a specific example, I'll start on the Technology summary page, with the eventual goal of grabbing the title of each article, the url, and the corresponding short summary: raw = Import[ "http://www.bloomberg.com/news/industries/technology.html", "Source" ] After looking at the raw source code in Firefox, I think this will get us close to grabbing some relevant info: StringPosition[ToString[raw], "<a class=\"summheadline\"" ~~ __ ~~ " </ p>"] And use StringTake[] to grab the above positions: raw2 = StringTake[raw, %] My inexperience in parsing strings stops me here... Hopefully someone else can chime in. I'd like to learn. -RG
From: Hans Michel on 15 Jun 2010 02:30 Try ReadList[StringToStream[ StringReplace[Import["http://www.bloomberg.com/", "Source"], RegularExpression["<(.|\\n)*?>"] -> " "]], Word, WordSeparators -> {" ", "\t", "\n"}] This particular page does not parse well as Plaintext. Even the XMLObject is missing the body element. Thus Data, FullData, Hyperlinks, Plaintext are either blank or empty. In[1]:= Import["http://www.bloomberg.com/","Elements"] Out[1]= {Data,FullData,Hyperlinks,Plaintext,Source,Title,XMLObject} So read the source and try som brute force regex for the tags, and stream and parse result to a list by word. Hans "kevin" <kevin999koshy(a)gmail.com> wrote in message news:hv23ul$5hl$1(a)smc.vnet.net... > Hi Guys, > > Is there any way to use mathematica to read all the words of a > website, say www.bloomberg.com? Thanks in advance. > > Best, > Kevin >
From: Leonid Shifrin on 15 Jun 2010 02:31 Hi Kevin, I think that the notion of "words" is not sharply defined in this context, since html is not a plain text file - you have words in main text, in links, image descriptions and many other places. I have written a simplistic HTML parser a couple of years ago for my own purposes. I intended (and still do) to polish it, convert into one or several packages and make available to anybody interested - but never had time to do it. Your post prompted me to speed things up in this regard, and as a first step, here is the code as it is now - may be you or someone else will find it of some use, since it will still take me some time to bring it to the proper package format. ------------------------------------------------------------------------------------- ClearAll[listSplit, reconstructIntervals, groupElements, groupPositions, processPosList, groupElementsNested, getAllUsedTags, refineTags, getTagTitle, getPairedTags, makeTagReplaceRules, getTagNames, makeTagHashRules, tagSplit, splitText, preparse, openCloseEnumerate, getOpenCloseForm, makeTagDepthList, oneStepParse, tagProcess, openCloseProcess, documentParse, refineParsed, parseText, removeLeaves, attribContainer, postProcess]; listSplit[x_List, lengthlist_List, headlist_List] := MapThread[#1 @@ Take[x, #2] &, {headlist, Transpose[{Most[#] + 1, Rest[#]} &[ FoldList[Plus, 0, lengthlist]]]}]; reconstructIntervals[listlen_Integer, ints_List] := Module[{missed, startint, lastint}, startint = If[ints[[1, 1]] == 1, {}, {1, ints[[1, 1]] - 1}]; lastint = If[ints[[-1, -1]] == listlen, {}, {ints[[-1, -1]] + 1, listlen}]; missed = Map[If[#[[2, 1]] - #[[1, 2]] > 1, {#[[1, 2]] + 1, #[[2, 1]] - 1}, {}] &, Partition[ints, 2, 1]]; missed = Join[missed, {lastint}]; Prepend[Flatten[Transpose[{ints, missed}], 1], startint]]; groupElements[lst_List, poslist_List, headlist_List] /; And[OrderedQ[Flatten[Sort[poslist]]], Length[headlist] == Length[poslist]] := Module[{totalheadlist, allints, llist}, totalheadlist = Append[ Flatten[ Transpose[ {Array[Sequence &, {Length[headlist]}], headlist}], 1], Sequence]; allints = reconstructIntervals[Length[lst], poslist]; llist = Map[If[# === {}, 0, 1 - Subtract @@ #] &, allints]; listSplit[lst, llist, totalheadlist]]; (* To work on general heads, we need this *) groupElements[h_[x__], poslist_List, headlist_List] := h[Sequence @@ groupElements[{x}, poslist, headlist]]; (* If we have a single head *) groupElements[expr_, poslist_List, head_] := groupElements[expr, poslist, Table[head, {Length[poslist]}]]; groupPositions[plist_List] := Reap[Sow[Last[#], {Most[#]}] & /@ plist, _, List][[2]]; processPosList[{openlist_List, closelist_List}] := Module[{opengroup, closegroup, poslist}, {opengroup, closegroup} = groupPositions /@ {openlist, closelist} ; poslist = Transpose[Transpose[Sort[#]] & /@ {opengroup, closegroup}]; If[UnsameQ @@ poslist[[1]], Return[(Print["Unmatched lists!", {openlist, closelist}]; {})], poslist = Transpose[{poslist[[1, 1]], Transpose /@ Transpose[poslist[[2]]]}] ] ]; groupElementsNested[nested_, {openposlist_List, closeposlist_List}, head_] /; Head[head] =!= List := Fold[Function[{x, y}, MapAt[groupElements[#, y[[2]], head] &, x, {y[[1]]}]], nested, Sort[processPosList[{openposlist, closeposlist}], Length[#2[[1]]] < Length[#1[[1]]] &]]; (* Find all the tags in the document*) getAllUsedTags[text_String] := Module[{htmlTagsposlist, result, chars = Characters[text], x}, htmlTagsposlist = StringPosition[text, ShortestMatch["<" ~~ x__ ~~ Whitespace | ">"], Overlaps -> True]; result = Union[Map[ToLowerCase, StringJoin @@@ Map[Take[chars, {#[[1]], #[[2]] - 1}] &, htmlTagsposlist]] /. {"<br/" :> "<br"}]]; (* get rid of tags with non-alphabetic characters *) refineTags[tags_List] := Module[{alphabet = Characters["abcdefghijklmnopqrstuvwxyz/"]}, DeleteCases[tags, x_ /; ! MemberQ[alphabet, StringTake[x, {2, 2}]]]]; getTagTitle[tag_String] := If[StringTake[tag, {2, 2}] === "/", StringDrop[tag, {2}], tag]; (* Find all paired tags *) getPairedTags[tags_List] := Reverse /@ Select[Reap[Sow[#, getTagTitle[#]] & /@ tags, _, #2 &][[2]], Length[#] == 2 &]; (* Prepare string replacement rules to be used when we tokenize the \ string *) makeTagReplaceRules[pairedtags_List, unpairedtags_List] := Sort[Join[ Rule @@@ Map[{#, {StringDrop[#, 1], "Unpaired", "Open"}} &, unpairedtags], Rule @@@ Flatten[Transpose[{#, {{StringDrop[#[[1]], 1], "Open"}, {StringDrop[#[[2]], 2], "Close"}}}] & /@ pairedtags, 1], {Rule[">", {">", "UnpairedClose"}]}], StringLength[#1[[2, 1]]] > StringLength[#2[[2, 1]]] &]; getTagNames[pairedtags_List, unpairedtags_List] := StringDrop[#, 1] & /@ Join[Transpose[pairedtags][[1]], unpairedtags]; makeTagHashRules[tagnames_List] := Dispatch[MapThread[Rule, {#, Range[Length[#]]} &[tagnames]]]; (* Tokenize *) tagSplit[text_String, {tagrules__Rule}] := DeleteCases[StringSplit[text, {tagrules}], x_ /; StringMatchQ[x, Whitespace | "" ~~ ">" ~~ Whitespace | ""]]; splitText[text_String, pairedtags_List, unpairedtags_List] := tagSplit[text, makeTagReplaceRules[pairedtags, unpairedtags]]; (* Insert containers for attributes as a pre-parsing stage *) preparse[text_] := Module[{step1}, With[{pos = Position[text, {_, "Open"} | {_, _, "Open"}, Infinity]}, step1 = ReplacePart[text, attribContainer /@ Extract[text, pos + 1], pos + 1, List /@ Range[Length[pos]]]]]; (* insert tag depth for each open/close tag sublist in the tokenized \ list *) openCloseEnumerate[splittext_List, pairedtags_List, unpairedtags_List] := Module[{tagnames, taghashrules, tagtitlecounters, unpairedstack = {}, temptag}, tagnames = getTagNames[pairedtags, unpairedtags]; taghashrules = makeTagHashRules[tagnames]; tagtitlecounters = Table[0, {Length[tagnames]}]; Map[Switch[#, {x_, "Open"}, {#[[1]], "Open", ++tagtitlecounters[[#[[1]] /. taghashrules]]}, {x_, "Close"}, {#[[1]], "Close", tagtitlecounters[[#[[1]] /. taghashrules]]--}, {x_, "Unpaired", "Open"}, AppendTo[unpairedstack, #]; {#[[1]], "Open", ++tagtitlecounters[[#[[1]] /. taghashrules]]}, {"/>" | ">", "UnpairedClose"}, If[Length[unpairedstack] > 0, temptag = unpairedstack[[-1]]; unpairedstack = Most[unpairedstack]; {temptag[[1]], "Close", tagtitlecounters[[temptag[[1]] /. taghashrules]]--}, (* else *) # ] , _, #] &, splittext]]; getOpenCloseForm[text_String, pairedtags_List, unpairedtags_List] := openCloseEnumerate[ preparse(a)splitText[text, pairedtags, unpairedtags], pairedtags, unpairedtags]; (* Create a list of all used tags and their maximal depth *) makeTagDepthList[opencloseform_List, pairedtags_List, unpairedtags_List] := DeleteCases[{#, Max[Cases[ opencloseform, {#, "Open" | "Close", x_Integer} :> x]]} & /@ getTagNames[pairedtags, unpairedtags], {x_, -Infinity}]; (* Parse a single depth level of a given tag *) oneStepParse[parsed_, depth_Integer, tag_String, head_] := Module[{plist = Position[parsed, {tag, #, depth}, Infinity] & /@ {"Open", "Close"}}, groupElementsNested[parsed, plist, head]]; (* Parse given tag, all levels *) tagProcess[parseme_, {tag_String, maxdepth_Integer}] := Module[{hd = ToExpression[ tag <> "Container"], result}, With[{hd1 = hd, ourtag = tag}, hd1[{ourtag, "Open", n_}, x__, {ourtag, "Close", n_}] := hd1[x]; result = Fold[oneStepParse[#1, #2, ourtag, hd1] &, parseme, Range[maxdepth, 1, -1]]]; Clear[hd]; result]; (* parse all tags *) openCloseProcess[opencloseform_List, pairedtags_List, unpairedtags_List] := Fold[tagProcess, opencloseform, makeTagDepthList[opencloseform, pairedtags, unpairedtags]]; (* A few higher - level functions to combine steps *) documentParse[text_String, pairedtags_List, unpairedtags_List] := openCloseProcess[ getOpenCloseForm[text, pairedtags, unpairedtags], pairedtags, unpairedtags]; refineParsed[parsed_] := If[# === {}, #, First@#] &@Cases[parsed, _htmlContainer]; parseText[text_String] := Module[{tags, paired, unpaired, parsed}, tags = refineTags(a)getAllUsedTags@text; paired = getPairedTags(a)tags; unpaired = Complement[tags, Flatten(a)paired]; parsed = refineParsed@ documentParse[text, paired, unpaired] /. {">", "UnpairedClose"} :> ">"; {parsed, paired, unpaired}]; (* Keep the page skeleton only *) removeLeaves[parsed_] := DeleteCases[parsed, _, {-1}]; (* Some optional post-processing - but leads to information loss *) postProcess[parsed_] := DeleteCases[parsed, ">" | "", Infinity]; ------------------------------------------------------------------------------------- Some tidbits on the implementation: I developed a groupElementsNested function which can be useful in many contexts, and allows to wrap elements inside possibly nested expressions in additional wrappers, in reasonably efficient manner, based on lists of their positions. The parser is based on this functionality, and is breadth-first rather than depth-first. This allows it to parse individual html tags and depth levels, if needed - for example, we could instruct it to parse only <div> tags, etc. This also means that it will not break completely on malformed html - may be it will just not parse all tags or levels. The parser computes used html tags dynamically from the document, and is completely heuristic in that no strict html format rules are built into it. So, it is not at all industrial strength or production quality. However, I used the parser for my purposes and it worked reasonably well for me (in terms of correctness and efficiency). What it gives you is a symbolic tree representation of the parsed html document, with different tags transformed into ...Container[content], for example divContainer[aContainer["Some text"]]. The container names mirror the html tag names, except for attribContainer which is for tag attributes. Also, without post-processing, no information is lost - I was able to reconstruct back the original html files from this symbolic representation, I think in all cases I looked at. Given this structure, one can use Mathematica rules and patterns to further process the document to one's needs. For example, here we will reconstruct the text in the wikipedia page about parsers, reasonably well IMO: page = Import["http://en.wikipedia.org/wiki/Parser", "text"]; parsed = postProcess(a)parseText[page]; StringJoin @@ Cases[Cases[parsed, _pContainer, Infinity] /. _attribContainer :> Sequence[], _String, Infinity] The code was (and is) intended to go under GPL, so you can treat it as such. Once again, it has not been polished and does certainly contain some number of bugs. If you discover any, please let me know. Hope this helps. Regards, Leonid On Sun, Jun 13, 2010 at 1:13 AM, kevin <kevin999koshy(a)gmail.com> wrote: > Hi Guys, > > Is there any way to use mathematica to read all the words of a > website, say www.bloomberg.com? Thanks in advance. > > Best, > Kevin > >
|
Pages: 1 Prev: Find value of unknown const that causes integral to Next: Automatic update of variables |