Prev: DocVariable wont work
Next: VBA
From: Simon on 14 Mar 2010 23:09 Office 2007 and onwards use a zip file to contain XML of each part of a Word document. If you study the layout (use Winzip to open the .docx file and then extract to a temporary location for analysis) you might find a better way to do this search and replace directly in the XML. If so, a tool like TextPipe Pro (which can find and replace in the XML of Office 2007 documents and zip files directly) could be handy. Hope this helps! PrincessLea wrote: Frustration with regexes in Word, VBA, VBScript 02-Feb-09 I am trying to do some fairly simple (in the scheme of things) regular expressions in VBA for Word. For 15 years or more, I have been frustrated by many things about Word, and right now it's back to limitations respecting regular expressions. I am looking for various patterns such as "section 54", "subsections 56(4) and 62.12(2)", "paragraph 23(2)(b)". The keywords may be singular or plural, the numbers may or may not have decimal values. I mark the section numbers (such as 62.12) with unique characters on either side (I use upside-down exclamation and question marks), then do a single Word find-and-replace to remove the unique characters and mark the numbers with a character style. I have done all this very nicely with VBScript regexes incorporated into VBA, but the problem is that any character-based formatting in my text is destroyed. I know why this happens, but it's frustrating nonetheless -- Microsoft should have developed a way to avoid the destruction. Unfortunately, my frustration is only increased by the REAL cause of the problem -- the lack of competent regexes in Word itself. I mean, why is it that after all these years, Word regexes are still so stunted that they don't even have a zero-or-more option and an OR function??? Surely Microsoft could have spent a tiny bit of money on fixing this long-standing omission, preferably by exposing VBScript regexes in Word itself, as an alternative to what's there. Yes, I can accomplish my task with a pile of individual search-and-replace operations, but that is inefficient, inelegant, frustrating and downright stupid. My main regex is the following: strFindExpr = "(section|subsection|paragraph|subparagraph)(s)?" strFindExpr2 = "[\x20\xA0]+([1-9][0-9]{0,2})(\.[1-9][0-9]{0,2})?" objRegEx.Pattern = strFindExpr & strFindExpr2 (I know I don't need the backslash before the period, but it I find it a useful holdover from Perl-type regexes.) I am hoping that there is something I am unaware of that would allow me to use VBScript-style regexes to do what I'm trying to do, without losing my character formatting. Is there such a facility, or am I just relegated to either a brute-force pile of Word regex statements or programming the recognition at each find? Thanks for any help you can provide, even if it's just to confirm that there is no other possibility within the realm of Word and VBA. Previous Posts In This Thread: On Monday, February 02, 2009 4:08 PM PrincessLea wrote: Frustration with regexes in Word, VBA, VBScript I am trying to do some fairly simple (in the scheme of things) regular expressions in VBA for Word. For 15 years or more, I have been frustrated by many things about Word, and right now it's back to limitations respecting regular expressions. I am looking for various patterns such as "section 54", "subsections 56(4) and 62.12(2)", "paragraph 23(2)(b)". The keywords may be singular or plural, the numbers may or may not have decimal values. I mark the section numbers (such as 62.12) with unique characters on either side (I use upside-down exclamation and question marks), then do a single Word find-and-replace to remove the unique characters and mark the numbers with a character style. I have done all this very nicely with VBScript regexes incorporated into VBA, but the problem is that any character-based formatting in my text is destroyed. I know why this happens, but it's frustrating nonetheless -- Microsoft should have developed a way to avoid the destruction. Unfortunately, my frustration is only increased by the REAL cause of the problem -- the lack of competent regexes in Word itself. I mean, why is it that after all these years, Word regexes are still so stunted that they don't even have a zero-or-more option and an OR function??? Surely Microsoft could have spent a tiny bit of money on fixing this long-standing omission, preferably by exposing VBScript regexes in Word itself, as an alternative to what's there. Yes, I can accomplish my task with a pile of individual search-and-replace operations, but that is inefficient, inelegant, frustrating and downright stupid. My main regex is the following: strFindExpr = "(section|subsection|paragraph|subparagraph)(s)?" strFindExpr2 = "[\x20\xA0]+([1-9][0-9]{0,2})(\.[1-9][0-9]{0,2})?" objRegEx.Pattern = strFindExpr & strFindExpr2 (I know I don't need the backslash before the period, but it I find it a useful holdover from Perl-type regexes.) I am hoping that there is something I am unaware of that would allow me to use VBScript-style regexes to do what I'm trying to do, without losing my character formatting. Is there such a facility, or am I just relegated to either a brute-force pile of Word regex statements or programming the recognition at each find? Thanks for any help you can provide, even if it's just to confirm that there is no other possibility within the realm of Word and VBA. On Monday, February 02, 2009 5:05 PM PrincessLea wrote: RE: Frustration with regexes in Word, VBA, VBScript By the way, I realize that the expression: strFindExpr = "(section|subsection|paragraph|subparagraph)(s)?" should be just: strFindExpr = "(section|paragraph)(s)?" At the moment, it's just a bit of "self-documentation" to be removed later. PL "PrincessLeah" wrote: On Monday, February 02, 2009 5:41 PM PrincessLea wrote: Another frustration with Word's regex is that it seems impossible to disable Another frustration with Word's regex is that it seems impossible to disable case sensitivity. Is there isome way to make it case insensitive? Thanks. PL "PrincessLeah" wrote: On Wednesday, February 04, 2009 12:16 PM PrincessLea wrote: Larry, thanks for your efforts. Larry, thanks for your efforts. Your program code seems to be a simple replace operation in an SGML/XML DTD (the CDATA keyword), which would not contain any formatting codes (unless done for documentation purposes, in which case you probably wouldn't be doing that replacement). I've done lots of VBScript regexes in VBA, and they work fine if I don't care about losing character formatting. Your suggestion about "salvaging" character coding is possible, but more complex than programming the solution in VBA by doing a series of finds and parsing found strings in code rather than via regexes (which I have done). So I have a working program -- I'm just really frustrated that I seem to have to use a more complex solution than I should have to, just because Microsoft seems to have not implemented "fully-competent" regexes in Word. Even a "zero-or-more" operator would be a huge improvement. I can't understand this lack of capability in the most important text-handling program in the world. Of course, this is one case where I really hope I am wrong, and someone can tell me that there is a way to use "fully-competent" regexes in Word without losing character formatting. I'm not holding my breath, though. PL "Larry" wrote: On Wednesday, February 04, 2009 2:09 PM Pablo Cardellino wrote: Sorry, I really don't want to "steal" the thread, but I didn't find anything Sorry, I really don't want to "steal" the thread, but I didn't find anything about RegExp in Word VBA help. It would be very useful for me RegExp to be available in Word VBA. Where could I find some documentation about it? Regards, -- Pablo Cardellino Florian??polis, SC Brazil PrincessLeah escribi??: On Wednesday, February 04, 2009 3:54 PM Jay Freedman wrote: Hi Pablo,The "trick" is that RegExp is supplied as a COM object from the Hi Pablo, The "trick" is that RegExp is supplied as a COM object from the Scripting library, C:\Windows\System32\vbscript.dll. VBA can use it, either by assigning the result of CreateObject("VBScript.RegExp") to an Object variable, or by going into the Tools > References dialog and setting a reference to "Microsoft VBScript Regular Expressions". But because it isn't literally a part of VBA, there's no VBA help topic for it. Try these articles: http://msdn.microsoft.com/en-us/library/ms974570.aspx http://www.vbaexpress.com/forum/showthread.php?t=6805 http://msdn.microsoft.com/en-us/library/yab2dx62(VS.85).aspx (the official Help topic) The problem to which PrincessLeah referred is that RegExp operates only on strings within VBA, not on formatted ranges within documents. So if you use RegExp to find some text in a document and replace it with some other text, you'll lose any character formatting the original text had (and possibly paragraph/style formatting, if you're unlucky enough to replace paragraph marks). Word's built-in Find object can preserve or modify formatting, but its search syntax is comparatively brain-dead and there's no sign that anyone is looking at fixing it. -- Regards, Jay Freedman Microsoft Word MVP FAQ: http://word.mvps.org Email cannot be acknowledged; please post all follow-ups to the newsgroup so all may benefit. Pablo Cardellino wrote: On Wednesday, February 04, 2009 5:57 PM Pablo Cardellino wrote: Hi, Jay,thanks, your explanation will be very useful. Hi, Jay, thanks, your explanation will be very useful. One mor question: if I use this object in Word 2003, the macro should run succesfully under word 2003 and 2000? Regards -- Pablo Cardellino Florian?polis, SC Brazil Jay Freedman escribi?: On Wednesday, February 04, 2009 8:25 PM Jay Freedman wrote: That's correct, there's no difference in the way any of the That's correct, there is no difference in the way any of the VBA-enabled applications work with external COM objects. (Word 95 was the last of the WordBasic-using versions.) On Thursday, February 05, 2009 4:00 AM Larry wrote: Princess, to protect significant local formatting, I've used a macrothat Princess, to protect significant local formatting, I've used a macro that surrounds all italic text with <i>...</i>, all bold with <b>...</ b>, etc. Such an approach will surely complicate your regex, but perhaps you can come up with something more subtle that will be seen by your regex as just garden-variety text but which a later macro will be able to recognise and recast back to italic, bold, etc. Also, I dredged up this snippet from one of my macros (I know I wrote it, I just don't remember much about it): Dim rx As RegExp strReplace = "<![CDATA[&]]>$1" Set rx = New RegExp With rx .Pattern = "&([A-Z][A-Z0-9._\-]*;)" .IgnoreCase = True .Global = True End With wraptext = rx.Replace(wraptext, strReplace) I referenced Microsoft VBScript Regular Expressions 5.5. to get the RegExp class. Is this just what you're already doing? On Sunday, February 28, 2010 1:54 AM Bob Reardon wrote: Try OpenOffice Have you tried a different word processor, such as OpenOffice? On Sunday, February 28, 2010 1:57 AM Bob Reardon wrote: Try OpenOffice Have you tried a different word processor, such as OpenOffice? Submitted via EggHeadCafe - Software Developer Portal of Choice WPF Circular Progress Indicator http://www.eggheadcafe.com/tutorials/aspnet/4d89b4cb-ba59-4362-ab0a-cc047643fd42/wpf-circular-progress-ind.aspx
|
Pages: 1 Prev: DocVariable wont work Next: VBA |