From: mayayana on
This looks like some kind of advertisement
for a blog, but it's an interesting question.
In compiled VB both of the foregoing methods
would be extremely slow on large strings.
The webpage sample is allocating a vast
number of strings to do its job. As the strings
get bigger it would slow to a crawl. The Replace
function looks much better to me, but it's also
fairly slow. (Replace itself is slow.)

Probably none of that matters if the function
is only being used for filename strings of 20+-
characters. And it's not easy to optimize for
speed in VBS anyway. But personally I'd still much
prefer your Replace loop. I don't see the sense of
writing a highly inefficient Replace method in
VBS when the scripting runtime can do it internally.

But in general, why not tokenize? In compiled
code that should be by far the fastest, with much
greater speed achieved if the characters can be
treated as numbers in an array so that the operation
is not allocating new strings or deciphering the Chr
value of each stored numeric value of the string.
In VBS, I don't know whether treating characters as
numbers will help, since it's still a variant that has
to be "parsed". I haven't tested the possibilities.
But I'm using numeric conversion below. I figured that
it should be a little faster than having the function
need to do a string comparison. (In a Select Case
where the character is not an "illegal" there would be
20-30 string comparisons happening if one uses the
string version.)

Another adsvantage of tokenizing is flexibility.
There can be dozens of Case declares with very
little cost.

' Note: I just wrote this as an "air code" sample.
' I didn't bother to get all of the ascii values since
' it's just a demo.

Function Clean(sIn)
Dim i2, iChar, A1()

ReDim A1(len(sIn) - 1)
For i2 = 1 to Len(sIn)
iChar = Asc(Mid(sIn, i2, 1))
Select Case iChar
Case 63, 47, 92, 58, 42, 60, 62, 44, 46, 43, 126
A1(i2 - 1) = "-"
Case Else
A1(i2 - 1) = Chr(iChar)
End Select
Next
Clean = Join(A1, "")
End Function


From: James on
On Mar 5, 1:59 am, "mayayana" <mayay...(a)nospam.invalid> wrote:
>   This looks like some kind of advertisement
> for a blog, but it's an interesting question.
> In compiled VB both of the foregoing methods
> would be extremely slow on large strings.
> The webpage sample is allocating a vast
> number of strings to do its job. As the strings
> get bigger it would slow to a crawl. The Replace
> function looks much better to me, but it's also
> fairly slow. (Replace itself is slow.)
>
>    Probably none of that matters if the function
> is only being used for filename strings of 20+-
> characters. And it's not easy to optimize for
> speed in VBS anyway. But personally I'd still much
> prefer your Replace loop. I don't see the sense of
> writing a highly inefficient Replace method in
> VBS when the scripting runtime can do it internally.
>
>    But in general, why not tokenize? In compiled
> code that should be by far the fastest, with much
> greater speed achieved if the characters can be
> treated as numbers in an array so that the operation
> is not allocating new strings or deciphering the Chr
> value of each stored numeric value of the string.
> In VBS, I don't know whether treating characters as
> numbers will help, since it's still a variant that has
> to be "parsed". I haven't tested the possibilities.
> But I'm using numeric conversion below. I figured that
> it should be a little faster than having the function
> need to do a string comparison. (In a Select Case
> where the character is not an "illegal" there would be
> 20-30 string comparisons happening if one uses the
> string version.)
>
>    Another adsvantage of tokenizing is flexibility.
> There can be dozens of Case declares with very
> little cost.
>
> ' Note: I just wrote this as an "air code" sample.
> ' I didn't bother to get all of the ascii values since
> ' it's just a demo.
>
> Function Clean(sIn)
>  Dim i2, iChar, A1()
>
>  ReDim A1(len(sIn) - 1)
>     For i2 = 1 to Len(sIn)
>        iChar = Asc(Mid(sIn, i2, 1))
>       Select Case iChar
>         Case 63, 47, 92, 58, 42, 60, 62, 44, 46, 43, 126
>            A1(i2 - 1) = "-"
>         Case Else
>           A1(i2 - 1) = Chr(iChar)
>       End Select
>     Next
>       Clean = Join(A1, "")
> End Function

Hi Mayayana,

As the "air code" sample of your method parses the string character by
character, I suspect theat a combination of your method and the
function provided should allow characters to be replaced, taking into
account the context of each illegal character.

I am using the method to clean a plain text string that may or may not
contain URLs. If there are URLs present in the string, they are later
replaced with an internal url with paramaters pointing to a logging
script that loggs and forwards the request to the original url. The
cleaned string is also used to generate a set of keywords and
keyphrases from the text supplied.

I have based the code below from the "air code" demo, which has also
not been tested. I have incorporated the contextual tests to only
remove/replace some characters if they are not in a scpecific context
(using a URL as an example).

The method below must certainly be a better approach to the function
linked from this thread, or suggested by Al. What do you think? Also,
is there a better way to incorporate the contextual tests for each
illegal character the string?

Thanks

James

-------------------------

Function Clean(sIn)
Dim i2, iChar, A1()

ReDim A1(len(sIn) - 1)
For i2 = 1 to Len(sIn)
iChar = Asc(Mid(sIn, i2, 1))
Select Case iChar
Case 58
rChars = Mid(sIn, i2+1, 2)
If rChars = "//" Then
A1(i2 - 1) = Chr(iChar)
End If

Case 47
rChar = Asc(Mid(sIn, i2+1, 1))
lChar = Asc(Mid(sIn, i2-1, 1))

If rChar = 47 OR lChar = 47 Then
A1(i2 - 1) = Chr(iChar)
Else
A1(i2 - 1) = "-"
End If

Case 63, 92, 42, 60, 62
A1(i2 - 1) = "-"

Case 44, 46, 43, 126
A1(i2 - 1) = ""

Case Else
A1(i2 - 1) = Chr(iChar)
End Select
Next
Clean = Join(A1, "")
End Function
From: mayayana on
>
The method below must certainly be a better approach to the function
linked from this thread, or suggested by Al. What do you think? Also,
is there a better way to incorporate the contextual tests for each
illegal character the string?
>

I think that's pretty much what I meant in saying
it's flexible. There's no limit, really. One could even
call separate functions from within the Select Case.

Parsing URLs
sounds tricky, but it can be done. For instance, you
could check each ":" to see if it's part of "http://",
then get the whole URL and write your edited
URL to the array. You'd just have to find the end
of the URL, calculate the offset of the start and end
characters, and keep track of how many characters
you've actually written to the array. With edits involved
you might need to use a bigger array and then Redim
Preserve it at the end before the Join call.

-------------------------

Function Clean(sIn)
Dim i2, iChar, A1()

ReDim A1(len(sIn) - 1)
For i2 = 1 to Len(sIn)
iChar = Asc(Mid(sIn, i2, 1))
Select Case iChar
Case 58
rChars = Mid(sIn, i2+1, 2)
If rChars = "//" Then
A1(i2 - 1) = Chr(iChar)
End If

Case 47
rChar = Asc(Mid(sIn, i2+1, 1))
lChar = Asc(Mid(sIn, i2-1, 1))

If rChar = 47 OR lChar = 47 Then
A1(i2 - 1) = Chr(iChar)
Else
A1(i2 - 1) = "-"
End If

Case 63, 92, 42, 60, 62
A1(i2 - 1) = "-"

Case 44, 46, 43, 126
A1(i2 - 1) = ""

Case Else
A1(i2 - 1) = Chr(iChar)
End Select
Next
Clean = Join(A1, "")
End Function


From: James on
On Mar 5, 1:55 pm, "mayayana" <mayay...(a)nospam.invalid> wrote:
> The method below must certainly be a better approach to the function
> linked from this thread, or suggested by Al. What do you think? Also,
> is there a better way to incorporate the contextual tests for each
> illegal character the string?
>
>
>
>   I think that's pretty much what I meant in saying
> it's flexible. There's no limit, really. One could even
> call separate functions from within the Select Case.
>
>   Parsing URLs
> sounds tricky, but it can be done. For instance, you
> could check each ":" to see if it's part of "http://",
> then get the whole URL and write your edited
> URL to the array. You'd just have to find the end
> of the URL, calculate the offset of the start and end
> characters, and keep track of how many characters
> you've actually written to the array. With edits involved
> you might need to use a bigger array and then Redim
> Preserve it at the end before the Join call.
>
> -------------------------
>
> Function Clean(sIn)
>  Dim i2, iChar, A1()
>
>  ReDim A1(len(sIn) - 1)
>     For i2 = 1 to Len(sIn)
>        iChar = Asc(Mid(sIn, i2, 1))
>       Select Case iChar
> Case 58
> rChars = Mid(sIn, i2+1, 2)
> If rChars = "//" Then
> A1(i2 - 1) = Chr(iChar)
> End If
>
> Case 47
> rChar = Asc(Mid(sIn, i2+1, 1))
> lChar = Asc(Mid(sIn, i2-1, 1))
>
> If rChar = 47 OR lChar = 47 Then
> A1(i2 - 1) = Chr(iChar)
> Else
> A1(i2 - 1) = "-"
> End If
>
> Case 63, 92, 42, 60, 62
>    A1(i2 - 1) = "-"
>
> Case 44, 46, 43, 126
>    A1(i2 - 1) = ""
>
>         Case Else
>           A1(i2 - 1) = Chr(iChar)
>       End Select
>     Next
>       Clean = Join(A1, "")
> End Function

Thanks Mayayana,

The illegal characters are being removed or replaced as expected. I
am using a regular expression with the replace function to remove all
html tags exept for "a" tags (hyperlinks). I am then removing all "a"
tags so that only the href value is left, which is placed after the
anchor text in brackets.

The next step I am using the string clean function from the linked
article (now modified to include suggestions in this thread) to remove
all special characters from the string except when a component of a
URL.

The final step, which I am currently working on is to parse the
cleaned string to replace urls with the internal redirect. It is
working as expected, but there are some cases where URLs are not
followed by a space depending on the context in the original string.
The problem being that there isn't currently a consistent method to
find the end of each URL. I am working toward adjusting the function
so that all URLs are contained in square brackets [] once processed
using the string clean function so that they can be found easily when
parsing to update the URLs.

I am replacing all special characters with a space, then re-parsing
the string to remove double (or more) spaces between words / URLs.
This works most of the time, but as i am not removing "." chars (ASCII
# 46), a url may end up with an additional "." at the end (http://
address.com.). To prevent this, i am replacing all "." with " ."
before parsing URLs so allow URLS to be recognised consistently.
After parsing and converting URLs, I then replace any occurrances of
" ." with the original "."

This seems to work, but I am not sure that it is the best way to do
this as the same string is parsed a number of times before the desired
results are achieved.

The string clean function works well using the tokenizing method.
Thanks again for your suggestion.

James
From: mayayana on
>
This seems to work, but I am not sure that it is the best way to do
this as the same string is parsed a number of times before the desired
results are achieved.
>

I think if it were me I'd put it *all* in the tokenizer.
For instance, for "<" you could do something like:

Case 60
If ucase(Mid(sIn, i2 + 1, 1)) = "A" then
'This is an anchor tag, so parse it.
Else 'drop out all other tags.
Do
i2 = i2 + 1
if Mid(sIn, i2, 1) = ">" then exit do
Loop
End If

One note with that: You'd want to use Do/Loop
for the main loop so that you can change the
value of i2. The code above would go back to the
start of the main loop and begin processing the next
character after the end of the tag. My original code
used: For i2 = ..... Next

I guess it all gets down to a matter of personal
preference at some point, though. You're the one
who's going to have to maintain your script. :)