From: mayayana on 4 Mar 2010 09:59 This looks like some kind of advertisement for a blog, but it's an interesting question. In compiled VB both of the foregoing methods would be extremely slow on large strings. The webpage sample is allocating a vast number of strings to do its job. As the strings get bigger it would slow to a crawl. The Replace function looks much better to me, but it's also fairly slow. (Replace itself is slow.) Probably none of that matters if the function is only being used for filename strings of 20+- characters. And it's not easy to optimize for speed in VBS anyway. But personally I'd still much prefer your Replace loop. I don't see the sense of writing a highly inefficient Replace method in VBS when the scripting runtime can do it internally. But in general, why not tokenize? In compiled code that should be by far the fastest, with much greater speed achieved if the characters can be treated as numbers in an array so that the operation is not allocating new strings or deciphering the Chr value of each stored numeric value of the string. In VBS, I don't know whether treating characters as numbers will help, since it's still a variant that has to be "parsed". I haven't tested the possibilities. But I'm using numeric conversion below. I figured that it should be a little faster than having the function need to do a string comparison. (In a Select Case where the character is not an "illegal" there would be 20-30 string comparisons happening if one uses the string version.) Another adsvantage of tokenizing is flexibility. There can be dozens of Case declares with very little cost. ' Note: I just wrote this as an "air code" sample. ' I didn't bother to get all of the ascii values since ' it's just a demo. Function Clean(sIn) Dim i2, iChar, A1() ReDim A1(len(sIn) - 1) For i2 = 1 to Len(sIn) iChar = Asc(Mid(sIn, i2, 1)) Select Case iChar Case 63, 47, 92, 58, 42, 60, 62, 44, 46, 43, 126 A1(i2 - 1) = "-" Case Else A1(i2 - 1) = Chr(iChar) End Select Next Clean = Join(A1, "") End Function
From: James on 4 Mar 2010 18:41 On Mar 5, 1:59 am, "mayayana" <mayay...(a)nospam.invalid> wrote: > This looks like some kind of advertisement > for a blog, but it's an interesting question. > In compiled VB both of the foregoing methods > would be extremely slow on large strings. > The webpage sample is allocating a vast > number of strings to do its job. As the strings > get bigger it would slow to a crawl. The Replace > function looks much better to me, but it's also > fairly slow. (Replace itself is slow.) > > Probably none of that matters if the function > is only being used for filename strings of 20+- > characters. And it's not easy to optimize for > speed in VBS anyway. But personally I'd still much > prefer your Replace loop. I don't see the sense of > writing a highly inefficient Replace method in > VBS when the scripting runtime can do it internally. > > But in general, why not tokenize? In compiled > code that should be by far the fastest, with much > greater speed achieved if the characters can be > treated as numbers in an array so that the operation > is not allocating new strings or deciphering the Chr > value of each stored numeric value of the string. > In VBS, I don't know whether treating characters as > numbers will help, since it's still a variant that has > to be "parsed". I haven't tested the possibilities. > But I'm using numeric conversion below. I figured that > it should be a little faster than having the function > need to do a string comparison. (In a Select Case > where the character is not an "illegal" there would be > 20-30 string comparisons happening if one uses the > string version.) > > Another adsvantage of tokenizing is flexibility. > There can be dozens of Case declares with very > little cost. > > ' Note: I just wrote this as an "air code" sample. > ' I didn't bother to get all of the ascii values since > ' it's just a demo. > > Function Clean(sIn) > Dim i2, iChar, A1() > > ReDim A1(len(sIn) - 1) > For i2 = 1 to Len(sIn) > iChar = Asc(Mid(sIn, i2, 1)) > Select Case iChar > Case 63, 47, 92, 58, 42, 60, 62, 44, 46, 43, 126 > A1(i2 - 1) = "-" > Case Else > A1(i2 - 1) = Chr(iChar) > End Select > Next > Clean = Join(A1, "") > End Function Hi Mayayana, As the "air code" sample of your method parses the string character by character, I suspect theat a combination of your method and the function provided should allow characters to be replaced, taking into account the context of each illegal character. I am using the method to clean a plain text string that may or may not contain URLs. If there are URLs present in the string, they are later replaced with an internal url with paramaters pointing to a logging script that loggs and forwards the request to the original url. The cleaned string is also used to generate a set of keywords and keyphrases from the text supplied. I have based the code below from the "air code" demo, which has also not been tested. I have incorporated the contextual tests to only remove/replace some characters if they are not in a scpecific context (using a URL as an example). The method below must certainly be a better approach to the function linked from this thread, or suggested by Al. What do you think? Also, is there a better way to incorporate the contextual tests for each illegal character the string? Thanks James ------------------------- Function Clean(sIn) Dim i2, iChar, A1() ReDim A1(len(sIn) - 1) For i2 = 1 to Len(sIn) iChar = Asc(Mid(sIn, i2, 1)) Select Case iChar Case 58 rChars = Mid(sIn, i2+1, 2) If rChars = "//" Then A1(i2 - 1) = Chr(iChar) End If Case 47 rChar = Asc(Mid(sIn, i2+1, 1)) lChar = Asc(Mid(sIn, i2-1, 1)) If rChar = 47 OR lChar = 47 Then A1(i2 - 1) = Chr(iChar) Else A1(i2 - 1) = "-" End If Case 63, 92, 42, 60, 62 A1(i2 - 1) = "-" Case 44, 46, 43, 126 A1(i2 - 1) = "" Case Else A1(i2 - 1) = Chr(iChar) End Select Next Clean = Join(A1, "") End Function
From: mayayana on 4 Mar 2010 21:55 > The method below must certainly be a better approach to the function linked from this thread, or suggested by Al. What do you think? Also, is there a better way to incorporate the contextual tests for each illegal character the string? > I think that's pretty much what I meant in saying it's flexible. There's no limit, really. One could even call separate functions from within the Select Case. Parsing URLs sounds tricky, but it can be done. For instance, you could check each ":" to see if it's part of "http://", then get the whole URL and write your edited URL to the array. You'd just have to find the end of the URL, calculate the offset of the start and end characters, and keep track of how many characters you've actually written to the array. With edits involved you might need to use a bigger array and then Redim Preserve it at the end before the Join call. ------------------------- Function Clean(sIn) Dim i2, iChar, A1() ReDim A1(len(sIn) - 1) For i2 = 1 to Len(sIn) iChar = Asc(Mid(sIn, i2, 1)) Select Case iChar Case 58 rChars = Mid(sIn, i2+1, 2) If rChars = "//" Then A1(i2 - 1) = Chr(iChar) End If Case 47 rChar = Asc(Mid(sIn, i2+1, 1)) lChar = Asc(Mid(sIn, i2-1, 1)) If rChar = 47 OR lChar = 47 Then A1(i2 - 1) = Chr(iChar) Else A1(i2 - 1) = "-" End If Case 63, 92, 42, 60, 62 A1(i2 - 1) = "-" Case 44, 46, 43, 126 A1(i2 - 1) = "" Case Else A1(i2 - 1) = Chr(iChar) End Select Next Clean = Join(A1, "") End Function
From: James on 4 Mar 2010 23:23 On Mar 5, 1:55 pm, "mayayana" <mayay...(a)nospam.invalid> wrote: > The method below must certainly be a better approach to the function > linked from this thread, or suggested by Al. What do you think? Also, > is there a better way to incorporate the contextual tests for each > illegal character the string? > > > > I think that's pretty much what I meant in saying > it's flexible. There's no limit, really. One could even > call separate functions from within the Select Case. > > Parsing URLs > sounds tricky, but it can be done. For instance, you > could check each ":" to see if it's part of "http://", > then get the whole URL and write your edited > URL to the array. You'd just have to find the end > of the URL, calculate the offset of the start and end > characters, and keep track of how many characters > you've actually written to the array. With edits involved > you might need to use a bigger array and then Redim > Preserve it at the end before the Join call. > > ------------------------- > > Function Clean(sIn) > Dim i2, iChar, A1() > > ReDim A1(len(sIn) - 1) > For i2 = 1 to Len(sIn) > iChar = Asc(Mid(sIn, i2, 1)) > Select Case iChar > Case 58 > rChars = Mid(sIn, i2+1, 2) > If rChars = "//" Then > A1(i2 - 1) = Chr(iChar) > End If > > Case 47 > rChar = Asc(Mid(sIn, i2+1, 1)) > lChar = Asc(Mid(sIn, i2-1, 1)) > > If rChar = 47 OR lChar = 47 Then > A1(i2 - 1) = Chr(iChar) > Else > A1(i2 - 1) = "-" > End If > > Case 63, 92, 42, 60, 62 > A1(i2 - 1) = "-" > > Case 44, 46, 43, 126 > A1(i2 - 1) = "" > > Case Else > A1(i2 - 1) = Chr(iChar) > End Select > Next > Clean = Join(A1, "") > End Function Thanks Mayayana, The illegal characters are being removed or replaced as expected. I am using a regular expression with the replace function to remove all html tags exept for "a" tags (hyperlinks). I am then removing all "a" tags so that only the href value is left, which is placed after the anchor text in brackets. The next step I am using the string clean function from the linked article (now modified to include suggestions in this thread) to remove all special characters from the string except when a component of a URL. The final step, which I am currently working on is to parse the cleaned string to replace urls with the internal redirect. It is working as expected, but there are some cases where URLs are not followed by a space depending on the context in the original string. The problem being that there isn't currently a consistent method to find the end of each URL. I am working toward adjusting the function so that all URLs are contained in square brackets [] once processed using the string clean function so that they can be found easily when parsing to update the URLs. I am replacing all special characters with a space, then re-parsing the string to remove double (or more) spaces between words / URLs. This works most of the time, but as i am not removing "." chars (ASCII # 46), a url may end up with an additional "." at the end (http:// address.com.). To prevent this, i am replacing all "." with " ." before parsing URLs so allow URLS to be recognised consistently. After parsing and converting URLs, I then replace any occurrances of " ." with the original "." This seems to work, but I am not sure that it is the best way to do this as the same string is parsed a number of times before the desired results are achieved. The string clean function works well using the tokenizing method. Thanks again for your suggestion. James
From: mayayana on 5 Mar 2010 00:27
> This seems to work, but I am not sure that it is the best way to do this as the same string is parsed a number of times before the desired results are achieved. > I think if it were me I'd put it *all* in the tokenizer. For instance, for "<" you could do something like: Case 60 If ucase(Mid(sIn, i2 + 1, 1)) = "A" then 'This is an anchor tag, so parse it. Else 'drop out all other tags. Do i2 = i2 + 1 if Mid(sIn, i2, 1) = ">" then exit do Loop End If One note with that: You'd want to use Do/Loop for the main loop so that you can change the value of i2. The code above would go back to the start of the main loop and begin processing the next character after the end of the tag. My original code used: For i2 = ..... Next I guess it all gets down to a matter of personal preference at some point, though. You're the one who's going to have to maintain your script. :) |