From: Al Dunbar on 5 Mar 2010 01:01 "James" <webmasterhub.net(a)gmail.com> wrote in message news:cc81aa22-8549-43a8-ac8a-0d96a2bd6314(a)l12g2000prg.googlegroups.com... > On Mar 5, 1:55 pm, "mayayana" <mayay...(a)nospam.invalid> wrote: >> The method below must certainly be a better approach to the function >> linked from this thread, or suggested by Al. What do you think? Also, >> is there a better way to incorporate the contextual tests for each >> illegal character the string? <snip> > > Thanks Mayayana, > > The illegal characters are being removed or replaced as expected. I > am using a regular expression with the replace function to remove all > html tags exept for "a" tags (hyperlinks). I am then removing all "a" > tags so that only the href value is left, which is placed after the > anchor text in brackets. > > The next step I am using the string clean function from the linked > article (now modified to include suggestions in this thread) to remove > all special characters from the string except when a component of a > URL. > > The final step, which I am currently working on is to parse the > cleaned string to replace urls with the internal redirect. It is > working as expected, but there are some cases where URLs are not > followed by a space depending on the context in the original string. > The problem being that there isn't currently a consistent method to > find the end of each URL. I am working toward adjusting the function > so that all URLs are contained in square brackets [] once processed > using the string clean function so that they can be found easily when > parsing to update the URLs. So I am curious. What was the purpose of your initial post? To get some feedback on a script you are trying to develop? Or to advertise a site containing expertly developed code? Or to get feedback on a site purportedly containing expertly developed code? /Al
From: James on 5 Mar 2010 01:29 Hi Al, Thanks for your wise words. The reason for using the function in this case is not for filenames, although it was written for this purpose. You method using the replace function will not work at all for what I am trying to achieve. If you read the response to your question, you will actually see that i agreed with you that the replace method would be more suitable if every all illegal characters are being processed in the same way (remove all / replace all occurrences with the same char). As i am removing characters from the text that are not a component of a url, the replace method in your function would not be suitable, as it doesn't allow me to test characters surrounding an illegal character. > You cannot compare my function as written with your function as modified to > solve some new problem. There was no comparison with "some new problem" and your function. I acknowledged that in the context of the linked article and in response to your intelligent rhetorical question that you method would be better. BUT, in the context of the solution I am working towards yours would not be suitable, which is why I needed to explain the scenario in more detail. > Regardless, another knock against your function as posted, if you are > interested in objective criticism, is that it does not fully document > itself. The nature of an "illegal character" is somewhat inferred, but not > fully explained. If the goal is to convert a valid path to a string that > could be used as a filename, here are a few quirks you appear not to have > addressed: The term "illegal characters" is used because that is what the article and function was originally written for removing characters that are illegal in filenames. This doesn't mean that the function can only ever be used to remove characters in filenames. I am not using it for filenames at all in this case, which makes most of what you have said irrelevant. Thanks for pointing out this highly important fact. Sorry that you seem to have gotten your knickers in a knot. If you just looking for an argument, then you should find another community to abuse. James
From: mayayana on 5 Mar 2010 10:12 >> I haven't tested the possibilities. > > I strongly suspect that the variant thing will > make most vbscript code less > efficient than a compiled language, and that > it might cause the tokenized > approach to be less efficient than it might be expected to be. > There's not much sense in talking about it if we're all just going to speculate, so I tried it out. I think you're clearly right. Replace bogs down in compiled code, but the reverse is the case with VBS. And a different-length replacement string doesn't seem to affect the results to speak of. While the tokenizing provides a very nice way to do a very complex operation on a string, it doesn't come close compared to Replace. I tried your function, my numeric tokenizer, and a tokenizer that left each character as a string. Testing a few large HTML files I found that the numeric tokeinzer was slightly faster than the string tokenizer, but the Replace method was about 10 times as fast. Dim Arg, FSO, TS, s1, i1, i2, s2 Arg = WScript.arguments(0) Set FSO = CreateObject("Scripting.FileSystemObject") Set TS = FSO.OpenTextFile(Arg, 1) s1 = TS.ReadAll TS.Close Set TS = Nothing i1 = timer s2 = CleanTok(s1) i2 = timer MsgBox "Time for tokenize: " & (i2 - i1) * 1000 & " ms" i1 = timer s2 = CleanTokS(s1) i2 = timer MsgBox "Time for tokenizeS: " & (i2 - i1) * 1000 & " ms" i1 = timer s2 = CleanRep(s1) i2 = timer MsgBox "Time for replace: " & (i2 - i1) * 1000 & " ms" Set FSO = nothing Function CleanRep (strtoclean) strtemp = strtoclean badchars = Array("?", "/", "\", ":", "*", """", "<", ">", ",", "&", "#", "~", "%", "{", "}", "+", "_", ".") For Each badchar in badchars Select Case badchar Case "&": goodchar = " and " Case ":": goodchar = "-" Case Else: goodchar = " " End Select strtemp = replace( strtemp, badchar, goodchar ) Next cleanRep = strtemp End Function Function CleanTokS(sIn) Dim i2, Char, A1() ReDim A1(len(sIn) - 1) For i2 = 1 to Len(sIn) Char = Mid(sIn, i2, 1) Select Case Char Case "?", "/", "\", ":", "*", """", "<", ">", ",", "&", "#", "~", "%", "{", "}", "+", "_", "." A1(i2 - 1) = "-" Case Else A1(i2 - 1) = Char End Select Next CleanTokS = Join(A1, "") End Function Function CleanTok(sIn) Dim i2, iChar, A1() ReDim A1(len(sIn) - 1) For i2 = 1 to Len(sIn) iChar = Asc(Mid(sIn, i2, 1)) Select Case iChar Case 63, 47, 92, 58, 42, 60, 62, 44, 46, 43, 126, 37, 123, 125, 43, 95, 46 A1(i2 - 1) = "-" Case Else A1(i2 - 1) = Chr(iChar) End Select Next CleanTok = Join(A1, "") End Function
From: Al Dunbar on 5 Mar 2010 22:26 "James" <webmasterhub.net(a)gmail.com> wrote in message news:1a1afd8b-2ac6-459a-8be9-f930469d4675(a)g8g2000pri.googlegroups.com... > Hi Al, Thanks for your wise words. The reason for using the function > in this case is not for filenames, although it was written for this > purpose. You method using the replace function will not work at all > for what I am trying to achieve. If you read the response to your > question, you will actually see that i agreed with you that the > replace method would be more suitable if every all illegal characters > are being processed in the same way (remove all / replace all > occurrences with the same char). As i am removing characters from the > text that are not a component of a url, the replace method in your > function would not be suitable, as it doesn't allow me to test > characters surrounding an illegal character. I think we are talking at cross-purposes here. I have been comparing my replace-based version of your "clean" function with your version. I have not been saying that one should use replace or that it can be used in every situation. All I have been saying is that if you have two functions that produce identical results, the better choice is usually the simpler of the two. I misread you as representing your "clean" function as one that you were making available for others to use, as-is, as an example of a well-written function. I did not anticipate that this thread would evolve into a discussion of an application for which neither version of the function would suffice, but one that would need to be adapted. >> You cannot compare my function as written with your function as modified >> to >> solve some new problem. > > There was no comparison with "some new problem" and your function. Thanks for putting me straight on that. This goes to my upthread comment about talking at cross-purposes. > I > acknowledged that in the context of the linked article and in response > to your intelligent rhetorical question that you method would be > better. BUT, in the context of the solution I am working towards yours > would not be suitable, which is why I needed to explain the scenario > in more detail. I never suggested that my version of your function would do anything different than it does. But at least I think I am starting to understand where you are coming from... >> Regardless, another knock against your function as posted, if you are >> interested in objective criticism, is that it does not fully document >> itself. The nature of an "illegal character" is somewhat inferred, but >> not >> fully explained. If the goal is to convert a valid path to a string that >> could be used as a filename, here are a few quirks you appear not to have >> addressed: > > The term "illegal characters" is used because that is what the article > and function was originally written for removing characters that are > illegal in filenames. This doesn't mean that the function can only > ever be used to remove characters in filenames. I am not using it for > filenames at all in this case, which makes most of what you have said > irrelevant. Thanks for pointing out this highly important fact. Not so important a fact, just a comment made with constructive intent on the assumption that you were, indeed, looking for comment. > Sorry that you seem to have gotten your knickers in a knot. If you > just looking for an argument, then you should find another community > to abuse. If my knickers were in a knot over this teapot tempest (which they aren't) that would be my fault, not yours. I apologize for seeming to be taking an abuse approach here, as that was truly not my intent. /Al
From: James on 6 Mar 2010 23:04
On Mar 6, 2:12 am, "mayayana" <mayay...(a)nospam.invalid> wrote: > >> I haven't tested the possibilities. > > > I strongly suspect that the variant thing will > > make most vbscript code less > > efficient than a compiled language, and that > > it might cause the tokenized > > approach to be less efficient than it might be expected to be. > > There's not much sense in talking about it > if we're all just going to speculate, so I tried > it out. I think you're clearly right. Replace bogs > down in compiled code, but the reverse is the > case with VBS. And a different-length replacement > string doesn't seem to affect the results to > speak of. > > While the > tokenizing provides a very nice way to do a very > complex operation on a string, it doesn't come > close compared to Replace. > > I tried your function, my numeric tokenizer, and > a tokenizer that left each character as a string. > Testing a few large HTML files I found that the > numeric tokeinzer was slightly faster than the > string tokenizer, but the Replace method was > about 10 times as fast. > > Dim Arg, FSO, TS, s1, i1, i2, s2 > Arg = WScript.arguments(0) > > Set FSO = CreateObject("Scripting.FileSystemObject") > Set TS = FSO.OpenTextFile(Arg, 1) > s1 = TS.ReadAll > TS.Close > Set TS = Nothing > > i1 = timer > s2 = CleanTok(s1) > i2 = timer > MsgBox "Time for tokenize: " & (i2 - i1) * 1000 & " ms" > > i1 = timer > s2 = CleanTokS(s1) > i2 = timer > MsgBox "Time for tokenizeS: " & (i2 - i1) * 1000 & " ms" > > i1 = timer > s2 = CleanRep(s1) > i2 = timer > MsgBox "Time for replace: " & (i2 - i1) * 1000 & " ms" > > Set FSO = nothing > > Function CleanRep (strtoclean) > strtemp = strtoclean > badchars = Array("?", "/", "\", ":", "*", """", "<", ">", ",", "&", > "#", "~", "%", "{", "}", "+", "_", ".") > For Each badchar in badchars > Select Case badchar > Case "&": goodchar = " and " > Case ":": goodchar = "-" > Case Else: goodchar = " " > End Select > strtemp = replace( strtemp, badchar, goodchar ) > Next > cleanRep = strtemp > End Function > > Function CleanTokS(sIn) > Dim i2, Char, A1() > ReDim A1(len(sIn) - 1) > For i2 = 1 to Len(sIn) > Char = Mid(sIn, i2, 1) > Select Case Char > Case "?", "/", "\", ":", "*", """", "<", ">", ",", "&", "#", "~", > "%", "{", "}", "+", "_", "." > A1(i2 - 1) = "-" > Case Else > A1(i2 - 1) = Char > End Select > Next > CleanTokS = Join(A1, "") > End Function > > Function CleanTok(sIn) > Dim i2, iChar, A1() > ReDim A1(len(sIn) - 1) > For i2 = 1 to Len(sIn) > iChar = Asc(Mid(sIn, i2, 1)) > Select Case iChar > Case 63, 47, 92, 58, 42, 60, 62, 44, 46, 43, 126, 37, 123, 125, 43, > 95, 46 > A1(i2 - 1) = "-" > Case Else > A1(i2 - 1) = Chr(iChar) > End Select > Next > CleanTok = Join(A1, "") > End Function Thanks mayayana, that has cleared things up a lot. I have been trying to achieve the same thing using regular expressions which seem to have similar speeds to the Replace function, but are not always consistent. I think this could be quite a good method, as it avoids using the loop for each of the characters being removed or replaced, and I should be able to incorporate the conditions required to only remove/replace characters if not a component of a url. Function CleanRepReg (strtoclean) strtemp = strtoclean Dim objRegExp, strOutput Set objRegExp = New Regexp objRegExp.IgnoreCase = True objRegExp.Global = True objRegExp.Pattern = "(\?|\*|\""|,|\\|<|>|&|#|~|%|{|}|\+|_|\.|@| \||:|/)" strOutput = objRegExp.Replace(strtemp, "-") objRegExp.Pattern = "-+" strOutput = objRegExp.Replace(strOutput, "-") CleanRepReg = strOutput End Function Note that this also uses a second reg replace to remove duplicate "-" characters once the initial replace method has been called. I am having some trouble trying to derive a regular expression that will remove any ":" or "/" characters from the text that are not part of a url. I am able to remove them if they are part of a URL ( "http: (.)*" ) or similar, but nothing happens if I try to do the reverse using the "not" operator ( ! ) in the regular expression. Is this possible using the replace method of a regular expression object in VBScript? Thanks James |