From: james_027 on
On Apr 29, 5:31 am, Cameron Simpson <c...(a)zip.com.au> wrote:
> On 28Apr2010 22:03, Daniel Fetchinson <fetchin...(a)googlemail.com> wrote:
> | > Any idea how I can replace words in a html file? Meaning only the
> | > content will get replace while the html tags, javascript, & css are
> | > remain untouch.
> |
> | I'm not sure what you tried and what you haven't but as a first trial
> | you might want to
> |
> | <untested>
> |
> | f = open( 'new.html', 'w' )
> | f.write( open( 'index.html' ).read( ).replace( 'replace-this', 'with-that' ) )
> | f.close( )
> |
> | </untested>
>
> If 'replace-this' occurs inside the javascript etc or happens to be an
> HTML tag name, it will get mangled. The OP didn't want that.
>
> The only way to get this right is to parse the file, then walk the doc
> tree enditing only the text parts.
>
> The BeautifulSoup module (3rd party, but a single .py file and trivial to
> fetch and use, though it has some dependencies) does a good job of this,
> coping even with typical not quite right HTML. It gives you a parse
> tree you can easily walk, and you can modify it in place and write it
> straight back out.
>
> Cheers,
> --
> Cameron Simpson <c...(a)zip.com.au> DoD#743http://www.cskk.ezoshosting.com/cs/
>
> The Web site you seek
> cannot be located but
> endless others exist
> - Haiku Error Messageshttp://www.salonmagazine.com/21st/chal/1998/02/10chal2.html

Hi all,

Thanks for all your input. Cameron Simpson get the idea of what I am
trying to do. I've been looking at beautiful soup so far I don't know
how to perform search and replace within it.

Any suggest good read?

Thanks all,

James
From: Iain King on
On Apr 29, 10:38 am, Daniel Fetchinson <fetchin...(a)googlemail.com>
wrote:
> > | > Any idea how I can replace words in a html file? Meaning only the
> > | > content will get replace while the html tags, javascript, & css are
> > | > remain untouch.
> > |
> > | I'm not sure what you tried and what you haven't but as a first trial
> > | you might want to
> > |
> > | <untested>
> > |
> > | f = open( 'new.html', 'w' )
> > | f.write( open( 'index.html' ).read( ).replace( 'replace-this', 'with-that'
> > ) )
> > | f.close( )
> > |
> > | </untested>
>
> > If 'replace-this' occurs inside the javascript etc or happens to be an
> > HTML tag name, it will get mangled. The OP didn't want that.
>
> Correct, that is why I started with "I'm not sure what you tried and
> what you haven't but as a first trial you might". For instance if the
> OP wants to replace words which he knows are not in javascript and/or
> css and he knows that these words are also not in html attribute
> names/values, etc, etc, then the above approach would work, in which
> case BeautifulSoup is a gigantic overkill. The OP needs to specify
> more clearly what he wants, before really useful advice can be given.
>
> Cheers,
> Daniel
>

Funny, everyone else understood what the OP meant, and useful advice
was given.
From: Daniel Fetchinson on
>> > | > Any idea how I can replace words in a html file? Meaning only the
>> > | > content will get replace while the html tags, javascript, & css are
>> > | > remain untouch.
>> > |
>> > | I'm not sure what you tried and what you haven't but as a first trial
>> > | you might want to
>> > |
>> > | <untested>
>> > |
>> > | f = open( 'new.html', 'w' )
>> > | f.write( open( 'index.html' ).read( ).replace( 'replace-this',
>> > 'with-that'
>> > ) )
>> > | f.close( )
>> > |
>> > | </untested>
>>
>> > If 'replace-this' occurs inside the javascript etc or happens to be an
>> > HTML tag name, it will get mangled. The OP didn't want that.
>>
>> Correct, that is why I started with "I'm not sure what you tried and
>> what you haven't but as a first trial you might". For instance if the
>> OP wants to replace words which he knows are not in javascript and/or
>> css and he knows that these words are also not in html attribute
>> names/values, etc, etc, then the above approach would work, in which
>> case BeautifulSoup is a gigantic overkill. The OP needs to specify
>> more clearly what he wants, before really useful advice can be given.
>
> Funny, everyone else understood what the OP meant, and useful advice
> was given.

It was a lucky day for the OP then!

:)

Cheers,
Daniel


--
Psss, psss, put it down! - http://www.cafepress.com/putitdown
From: Cameron Simpson on
On 29Apr2010 05:03, james_027 <cai.haibin(a)gmail.com> wrote:
| On Apr 29, 5:31 am, Cameron Simpson <c...(a)zip.com.au> wrote:
| > On 28Apr2010 22:03, Daniel Fetchinson <fetchin...(a)googlemail.com> wrote:
| > | > Any idea how I can replace words in a html file? Meaning only the
| > | > content will get replace while the html tags, javascript, & css are
| > | > remain untouch.
[...]
| > The only way to get this right is to parse the file, then walk the doc
| > tree enditing only the text parts.
| >
| > The BeautifulSoup module (3rd party, but a single .py file and trivial to
| > fetch and use, though it has some dependencies) does a good job of this,
| > coping even with typical not quite right HTML. It gives you a parse
| > tree you can easily walk, and you can modify it in place and write it
| > straight back out.
|
| Thanks for all your input. Cameron Simpson get the idea of what I am
| trying to do. I've been looking at beautiful soup so far I don't know
| how to perform search and replace within it.

Well the BeautifulSoup web page helped me:
http://www.crummy.com/software/BeautifulSoup/documentation.html

Here's a function from a script I wrote to bulk edit a web site. I was
replacing OBJECT and EMBED nodes with modern versions:

def recurse(node):
global didmod
for O in node.contents:
if isinstance(O,Tag):
for attr in 'src', 'href':
if attr in O:
rurl=O[attr]
rurlpath=pathwrt(rurl,SRCPATH)
if not os.path.exists(rurlpath):
print >>sys.stderr, "%s: MISSING: %s" % (SRCPATH, rurlpath,)
O2=None
if O.name == "object":
O2, SUBOBJ = fixmsobj(O)
elif O.name == "embed":
O2, SUBOBJ = fixembed(O)
if O2 is not None:
O.replaceWith(O2)
SUBOBJ.replaceWith(O)
##print >>sys.stderr, "%s: update: new OBJECT: %s" % (SRCPATH, str(O2), )
didmod=True
continue
recurse(O)

but you have only to change it a little to modify things that aren't Tag
objects. The calling end looks like this:

with open(SRCPATH) as srcfp:
srctext = srcfp.read()
SOUP = BeautifulSoup(srctext)
didmod = False # icky global set by recurse()
recurse(SOUP)
if didmod:
srctext = str(SOUP)

If didmod becomes True we recompute srctext and resave the file (or save it
to a copy).

Cheers,
--
Cameron Simpson <cs(a)zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

Democracy is the theory that the people know what they want, and deserve to
get it good and hard. - H.L. Mencken
From: Stefan Behnel on
Cameron Simpson, 30.04.2010 00:47:
> Here's a function from a script I wrote to bulk edit a web site. I was
> replacing OBJECT and EMBED nodes with modern versions:
>
> def recurse(node):
> global didmod
> [...]
> didmod=True
> continue
> recurse(O)
>
> The calling end looks like this:
>
> SOUP = BeautifulSoup(srctext)
> didmod = False # icky global set by recurse()
> recurse(SOUP)
> if didmod:
> srctext = str(SOUP)
>
> If didmod becomes True we recompute srctext and resave the file (or save it
> to a copy).

You should rethink your naming in the above code and remove the need for a
global variable.

Stefan