Prev: awk: is it possible to use some charcters' combination as the field-separator?
Next: awk: is it possible to use some charcters' combination as thefield-separator?
From: Andreas Marschke on 22 Feb 2010 16:45 > Whereever you build pipelines of: cut, head, tail, sed, grep, tr, etc. > etc. use (e.g.) awk(1) instead; and "avaiable on every possible machine"; > it's standard on Unix and available even for WinDOS if you like. Another > option, if you're not repelled by it's syntax, is perl (it's non-standard > on Unixes, but generally available as well). > > Janis TBH I havent taken the time yet to have a look into awk. But I try to learn perl besides my current work on c++ applications. So well yes I will have a look and see what I can do with your tool of choice. Thanks!
From: mop2 on 22 Feb 2010 18:33 On Mon, 22 Feb 2010 11:06:09 -0300, mop2 <invalid(a)mail.address> wrote: > On Mon, 22 Feb 2010 07:43:38 -0300, Andreas Marschke > <xxtjaxx(a)gmail.com> wrote: >> To start it off Here is a simple bash script scraping the daily >> JARGON off >> the website for the new hackers dictionary: >> >> |+-+-+-+-+-+--+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| >> #!/bin/bash >> >> wget http://www.jargon.net/ -O- 2>/dev/null | grep '<A >> HREF="/jargonfile/[a- >> z]/[a-zA-Z0-9]*.html">[a-zA-Z0-9]*</A>' | sed >> 's:\(<[a-zA-Z0-9]*>\|</[a-zA- >> Z0-9]*>\|<A >> HREF="/[a-zA-Z0-9]*/[a-z]/[a-zA-Z0-9]*\.html">\|<[a-z]*>\|</[a- >> z]*>\)::g' | sed s/\ \ */\ /g >> |+-+-+-+-+-+--+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| > > An alternative for one line mono spaced, specific for that site: > > echo `wget -qO- http://www.jargon.net/|grep HR|sed 's/<[^>]*>//g'` > > For fragment of web pages, I think 3 generic functions are > convenient: > f1 - get the page > f2 - filter the desired fragment > f3 - remove html tags and display as text, monospaced and > honoring newlines and, perhaps, bold tags > > Or perhaps one function with 3 parameters > With this I can see a fragment of page suggested as example by Andreas: wget -qO- http://www.jargon.net/|grep -A99 '^</sc'|grep -B99 -m1 '^<img' It is a bit greater than the final target desired, and intended as source for the question bellow. If I define the point A as start and B as end in the stream: A='<br /> </center> <p> <font size="+1">' B="</font></p><center> You're Visitor" Which is the suggestion to get all text between these two points, but without them, and without the html tags? I don't see nothing generic, practical and elegant. Can someone help?
From: Ivan Shmakov on 23 Feb 2010 01:26 >>>>> "AM" == Andreas Marschke <xxtjaxx(a)gmail.com> writes: [...] AM> To start it off Here is a simple bash script scraping the daily AM> JARGON off the website for the new hackers dictionary: AM> #!/bin/bash AM> wget http://www.jargon.net/ -O- 2>/dev/null [...] Funny enough, but I have a similar script to fetch OISSTv2 [1] data from an FTP server. Like: #!/bin/bash p=~/public/hist/logs/download/ftp.emc.ncep.noaa.gov-$(date +%s). ## NB: here, we parse the Squid caching proxy output, not the FTP ## server's one (as the latter isn't going to be HTML.) wget -qO - \ --force-directories --timestamping \ ftp://ftp.emc.ncep.noaa.gov/cmb/sst/oisst_v2/ \ | sed -ne '\,.*<'A' HREF="\([^"]\+\)">[^"<>]*</A>/</H2>$, { s,,ftp.emc.ncep.noaa.gov\1, ; h ; } ; \,^<'A' HREF="\([^/"]\+\)">.*, { s//\1/ ; G ; s/\(.*\)\n\(.*\)/\2\1/ ; p ; }' \ | grep -E -- '\<oisst\.[[:digit:]]*' \ | LC_ALL=C sort -r \ | (while read f ; do test -e "$f" || echo ftp://"$f" ; done) \ > "$p"in ## NB: beware of the race here LC_ALL=C wget -b -6 --quota=256M \ --server-response \ -i "$p"in -o "$p"out sleep 1s chmod =r -- "$p"{in,out} But perhaps it should read instead: .... ## NB: race is still possible here (though a bit less likely) exec > "$p"out chmod =r -- "$p"out ## ... or should we just umask before exec instead? LC_ALL=C wget -b -6 --quota=256M \ --server-response \ -i "$p"in -o /dev/stdout As for the race condition, it's avoided by the virtue of the fact that this script isn't usually run in parallel at all. [1] http://www.esrl.noaa.gov/psd/data/gridded/data.noaa.oisst.v2.html -- FSF associate member #7257
From: Ivan Shmakov on 23 Feb 2010 01:30 >>>>> Ivan Shmakov <ivan(a)main.uusia.org> writes: [...] Oops, a silly mistake here. ## NB: here, we parse the Squid caching proxy output, not the FTP ## server's one (as the latter isn't going to be HTML.) wget -qO - \ - --force-directories --timestamping \ ftp://ftp.emc.ncep.noaa.gov/cmb/sst/oisst_v2/ \ [...] ## NB: beware of the race here LC_ALL=C wget -b -6 --quota=256M \ + --force-directories --timestamping \ --server-response \ -i "$p"in -o "$p"out [...] Also note that the line below (and also the second wget invocation) implies that the script should be run from the directory below which the retrieved data is stored. > | (while read f ; do test -e "$f" || echo ftp://"$f" ; done) \ -- FSF associate member #7257
From: mop2 on 23 Feb 2010 03:29
On Mon, 22 Feb 2010 20:33:33 -0300, mop2 <invalid(a)mail.address> wrote: > On Mon, 22 Feb 2010 11:06:09 -0300, mop2 <invalid(a)mail.address> > wrote: > >> On Mon, 22 Feb 2010 07:43:38 -0300, Andreas Marschke >> <xxtjaxx(a)gmail.com> wrote: > >>> To start it off Here is a simple bash script scraping the daily >>> JARGON off >>> the website for the new hackers dictionary: >>> >>> |+-+-+-+-+-+--+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| >>> #!/bin/bash >>> >>> wget http://www.jargon.net/ -O- 2>/dev/null | grep '<A >>> HREF="/jargonfile/[a- >>> z]/[a-zA-Z0-9]*.html">[a-zA-Z0-9]*</A>' | sed >>> 's:\(<[a-zA-Z0-9]*>\|</[a-zA- >>> Z0-9]*>\|<A >>> HREF="/[a-zA-Z0-9]*/[a-z]/[a-zA-Z0-9]*\.html">\|<[a-z]*>\|</[a- >>> z]*>\)::g' | sed s/\ \ */\ /g >>> |+-+-+-+-+-+--+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| > > >> >> An alternative for one line mono spaced, specific for that site: >> >> echo `wget -qO- http://www.jargon.net/|grep HR|sed 's/<[^>]*>//g'` >> >> For fragment of web pages, I think 3 generic functions are >> convenient: >> f1 - get the page >> f2 - filter the desired fragment >> f3 - remove html tags and display as text, monospaced and >> honoring newlines and, perhaps, bold tags >> >> Or perhaps one function with 3 parameters >> > > > With this I can see a fragment of page suggested as example by > Andreas: > > wget -qO- http://www.jargon.net/|grep -A99 '^</sc'|grep -B99 -m1 > '^<img' > > It is a bit greater than the final target desired, and intended as > source > for the question bellow. > > If I define the point A as start and B as end in the stream: > > A='<br /> > </center> > <p> > <font size="+1">' > > B="</font></p><center> > You're Visitor" > > Which is the suggestion to get all text between these two points, > but > without them, and without the html tags? > > I don't see nothing generic, practical and elegant. > Can someone help? Without newlines in the marks, no problems (well, elegance?? ): $ cat g #!/bin/bash wg() { wget -qO- "$1"| tr -s '\n\t' ' '| sed "s/> </></g;s|.*$2||;s|$3.*||;s/<[^>]*>//g"| fmt -w 78 } case "$1" in jargon) wg http://www.jargon.net/ ' 1995.<br /></center><p><font size="+1">' "You're Visitor ";; esac $ ./g jargon flat-file /adj./ A flattened representation of some database or tree or network structure as a single file from which the structure could implicitly be rebuilt, esp. one in flat-ASCII form. See also sharchive. $ However, newlines are important, particularly when exist some TAGs , like "<pre>" for example, or when there is repetitive marks, where just one more newline can be a differential. The search continues... Sorry, I have a special interest in this kind of "toy". :) |