From: Al Dunbar on


"James" <webmasterhub.net(a)gmail.com> wrote in message
news:cc81aa22-8549-43a8-ac8a-0d96a2bd6314(a)l12g2000prg.googlegroups.com...
> On Mar 5, 1:55 pm, "mayayana" <mayay...(a)nospam.invalid> wrote:
>> The method below must certainly be a better approach to the function
>> linked from this thread, or suggested by Al. What do you think? Also,
>> is there a better way to incorporate the contextual tests for each
>> illegal character the string?

<snip>

>
> Thanks Mayayana,
>
> The illegal characters are being removed or replaced as expected. I
> am using a regular expression with the replace function to remove all
> html tags exept for "a" tags (hyperlinks). I am then removing all "a"
> tags so that only the href value is left, which is placed after the
> anchor text in brackets.
>
> The next step I am using the string clean function from the linked
> article (now modified to include suggestions in this thread) to remove
> all special characters from the string except when a component of a
> URL.
>
> The final step, which I am currently working on is to parse the
> cleaned string to replace urls with the internal redirect. It is
> working as expected, but there are some cases where URLs are not
> followed by a space depending on the context in the original string.
> The problem being that there isn't currently a consistent method to
> find the end of each URL. I am working toward adjusting the function
> so that all URLs are contained in square brackets [] once processed
> using the string clean function so that they can be found easily when
> parsing to update the URLs.

So I am curious. What was the purpose of your initial post? To get some
feedback on a script you are trying to develop? Or to advertise a site
containing expertly developed code? Or to get feedback on a site purportedly
containing expertly developed code?

/Al


From: James on
Hi Al, Thanks for your wise words. The reason for using the function
in this case is not for filenames, although it was written for this
purpose. You method using the replace function will not work at all
for what I am trying to achieve. If you read the response to your
question, you will actually see that i agreed with you that the
replace method would be more suitable if every all illegal characters
are being processed in the same way (remove all / replace all
occurrences with the same char). As i am removing characters from the
text that are not a component of a url, the replace method in your
function would not be suitable, as it doesn't allow me to test
characters surrounding an illegal character.

> You cannot compare my function as written with your function as modified to
> solve some new problem.

There was no comparison with "some new problem" and your function. I
acknowledged that in the context of the linked article and in response
to your intelligent rhetorical question that you method would be
better. BUT, in the context of the solution I am working towards yours
would not be suitable, which is why I needed to explain the scenario
in more detail.

> Regardless, another knock against your function as posted, if you are
> interested in objective criticism, is that it does not fully document
> itself. The nature of an "illegal character" is somewhat inferred, but not
> fully explained. If the goal is to convert a valid path to a string that
> could be used as a filename, here are a few quirks you appear not to have
> addressed:

The term "illegal characters" is used because that is what the article
and function was originally written for removing characters that are
illegal in filenames. This doesn't mean that the function can only
ever be used to remove characters in filenames. I am not using it for
filenames at all in this case, which makes most of what you have said
irrelevant. Thanks for pointing out this highly important fact.

Sorry that you seem to have gotten your knickers in a knot. If you
just looking for an argument, then you should find another community
to abuse.

James
From: mayayana on
>> I haven't tested the possibilities.
>
> I strongly suspect that the variant thing will
> make most vbscript code less
> efficient than a compiled language, and that
> it might cause the tokenized
> approach to be less efficient than it might be expected to be.
>

There's not much sense in talking about it
if we're all just going to speculate, so I tried
it out. I think you're clearly right. Replace bogs
down in compiled code, but the reverse is the
case with VBS. And a different-length replacement
string doesn't seem to affect the results to
speak of.

While the
tokenizing provides a very nice way to do a very
complex operation on a string, it doesn't come
close compared to Replace.

I tried your function, my numeric tokenizer, and
a tokenizer that left each character as a string.
Testing a few large HTML files I found that the
numeric tokeinzer was slightly faster than the
string tokenizer, but the Replace method was
about 10 times as fast.

Dim Arg, FSO, TS, s1, i1, i2, s2
Arg = WScript.arguments(0)

Set FSO = CreateObject("Scripting.FileSystemObject")
Set TS = FSO.OpenTextFile(Arg, 1)
s1 = TS.ReadAll
TS.Close
Set TS = Nothing

i1 = timer
s2 = CleanTok(s1)
i2 = timer
MsgBox "Time for tokenize: " & (i2 - i1) * 1000 & " ms"

i1 = timer
s2 = CleanTokS(s1)
i2 = timer
MsgBox "Time for tokenizeS: " & (i2 - i1) * 1000 & " ms"


i1 = timer
s2 = CleanRep(s1)
i2 = timer
MsgBox "Time for replace: " & (i2 - i1) * 1000 & " ms"

Set FSO = nothing

Function CleanRep (strtoclean)
strtemp = strtoclean
badchars = Array("?", "/", "\", ":", "*", """", "<", ">", ",", "&",
"#", "~", "%", "{", "}", "+", "_", ".")
For Each badchar in badchars
Select Case badchar
Case "&": goodchar = " and "
Case ":": goodchar = "-"
Case Else: goodchar = " "
End Select
strtemp = replace( strtemp, badchar, goodchar )
Next
cleanRep = strtemp
End Function

Function CleanTokS(sIn)
Dim i2, Char, A1()
ReDim A1(len(sIn) - 1)
For i2 = 1 to Len(sIn)
Char = Mid(sIn, i2, 1)
Select Case Char
Case "?", "/", "\", ":", "*", """", "<", ">", ",", "&", "#", "~",
"%", "{", "}", "+", "_", "."
A1(i2 - 1) = "-"
Case Else
A1(i2 - 1) = Char
End Select
Next
CleanTokS = Join(A1, "")
End Function

Function CleanTok(sIn)
Dim i2, iChar, A1()
ReDim A1(len(sIn) - 1)
For i2 = 1 to Len(sIn)
iChar = Asc(Mid(sIn, i2, 1))
Select Case iChar
Case 63, 47, 92, 58, 42, 60, 62, 44, 46, 43, 126, 37, 123, 125, 43,
95, 46
A1(i2 - 1) = "-"
Case Else
A1(i2 - 1) = Chr(iChar)
End Select
Next
CleanTok = Join(A1, "")
End Function



From: Al Dunbar on


"James" <webmasterhub.net(a)gmail.com> wrote in message
news:1a1afd8b-2ac6-459a-8be9-f930469d4675(a)g8g2000pri.googlegroups.com...
> Hi Al, Thanks for your wise words. The reason for using the function
> in this case is not for filenames, although it was written for this
> purpose. You method using the replace function will not work at all
> for what I am trying to achieve. If you read the response to your
> question, you will actually see that i agreed with you that the
> replace method would be more suitable if every all illegal characters
> are being processed in the same way (remove all / replace all
> occurrences with the same char). As i am removing characters from the
> text that are not a component of a url, the replace method in your
> function would not be suitable, as it doesn't allow me to test
> characters surrounding an illegal character.

I think we are talking at cross-purposes here. I have been comparing my
replace-based version of your "clean" function with your version. I have not
been saying that one should use replace or that it can be used in every
situation. All I have been saying is that if you have two functions that
produce identical results, the better choice is usually the simpler of the
two.

I misread you as representing your "clean" function as one that you were
making available for others to use, as-is, as an example of a well-written
function. I did not anticipate that this thread would evolve into a
discussion of an application for which neither version of the function would
suffice, but one that would need to be adapted.

>> You cannot compare my function as written with your function as modified
>> to
>> solve some new problem.
>
> There was no comparison with "some new problem" and your function.

Thanks for putting me straight on that. This goes to my upthread comment
about talking at cross-purposes.

> I
> acknowledged that in the context of the linked article and in response
> to your intelligent rhetorical question that you method would be
> better. BUT, in the context of the solution I am working towards yours
> would not be suitable, which is why I needed to explain the scenario
> in more detail.

I never suggested that my version of your function would do anything
different than it does. But at least I think I am starting to understand
where you are coming from...

>> Regardless, another knock against your function as posted, if you are
>> interested in objective criticism, is that it does not fully document
>> itself. The nature of an "illegal character" is somewhat inferred, but
>> not
>> fully explained. If the goal is to convert a valid path to a string that
>> could be used as a filename, here are a few quirks you appear not to have
>> addressed:
>
> The term "illegal characters" is used because that is what the article
> and function was originally written for removing characters that are
> illegal in filenames. This doesn't mean that the function can only
> ever be used to remove characters in filenames. I am not using it for
> filenames at all in this case, which makes most of what you have said
> irrelevant. Thanks for pointing out this highly important fact.

Not so important a fact, just a comment made with constructive intent on the
assumption that you were, indeed, looking for comment.

> Sorry that you seem to have gotten your knickers in a knot. If you
> just looking for an argument, then you should find another community
> to abuse.

If my knickers were in a knot over this teapot tempest (which they aren't)
that would be my fault, not yours. I apologize for seeming to be taking an
abuse approach here, as that was truly not my intent.

/Al


From: James on
On Mar 6, 2:12 am, "mayayana" <mayay...(a)nospam.invalid> wrote:
> >> I haven't tested the possibilities.
>
> > I strongly suspect that the variant thing will
> > make most vbscript code less
> > efficient than a compiled language, and that
> > it might cause the tokenized
> > approach to be less efficient than it might be expected to be.
>
>   There's not much sense in talking about it
> if we're all just going to speculate, so I tried
> it out. I think you're clearly right. Replace bogs
> down in compiled code, but the reverse is the
> case with VBS. And a different-length replacement
> string doesn't seem to affect the results to
> speak of.
>
>    While the
> tokenizing provides a very nice way to do a very
> complex operation on a string, it doesn't come
> close compared to Replace.
>
>    I tried your function, my numeric tokenizer, and
> a tokenizer that left each character as a string.
> Testing a few large HTML files I found that the
> numeric tokeinzer was slightly faster than the
> string tokenizer, but the Replace method was
> about 10 times as fast.
>
> Dim Arg, FSO, TS, s1, i1, i2, s2
>  Arg = WScript.arguments(0)
>
> Set FSO = CreateObject("Scripting.FileSystemObject")
>   Set TS = FSO.OpenTextFile(Arg, 1)
>   s1 = TS.ReadAll
>   TS.Close
> Set TS = Nothing
>
> i1 = timer
> s2 = CleanTok(s1)
> i2 = timer
>   MsgBox "Time for tokenize: " & (i2 - i1) * 1000 & " ms"
>
> i1 = timer
> s2 = CleanTokS(s1)
> i2 = timer
>   MsgBox "Time for tokenizeS: " & (i2 - i1) * 1000 & " ms"
>
> i1 = timer
> s2 = CleanRep(s1)
> i2 = timer
>   MsgBox "Time for replace: " & (i2 - i1) * 1000 & " ms"
>
> Set FSO = nothing
>
> Function CleanRep (strtoclean)
>         strtemp = strtoclean
>         badchars = Array("?", "/", "\", ":", "*", """", "<", ">", ",", "&",
> "#", "~", "%", "{", "}", "+", "_", ".")
>         For Each badchar in badchars
>             Select Case badchar
>                 Case "&": goodchar = " and "
>                 Case ":": goodchar = "-"
>                 Case Else: goodchar = " "
>             End Select
>             strtemp = replace( strtemp, badchar, goodchar )
>         Next
>         cleanRep = strtemp
>   End Function
>
> Function CleanTokS(sIn)
>  Dim i2, Char, A1()
>  ReDim A1(len(sIn) - 1)
>     For i2 = 1 to Len(sIn)
>        Char = Mid(sIn, i2, 1)
>       Select Case Char
>         Case "?", "/", "\", ":", "*", """", "<", ">", ",", "&", "#", "~",
> "%", "{", "}", "+", "_", "."
>            A1(i2 - 1) = "-"
>         Case Else
>           A1(i2 - 1) = Char
>       End Select
>     Next
>       CleanTokS = Join(A1, "")
> End Function
>
> Function CleanTok(sIn)
>  Dim i2, iChar, A1()
>  ReDim A1(len(sIn) - 1)
>     For i2 = 1 to Len(sIn)
>        iChar = Asc(Mid(sIn, i2, 1))
>       Select Case iChar
>         Case 63, 47, 92, 58, 42, 60, 62, 44, 46, 43, 126, 37, 123, 125, 43,
> 95, 46
>            A1(i2 - 1) = "-"
>         Case Else
>           A1(i2 - 1) = Chr(iChar)
>       End Select
>     Next
>       CleanTok = Join(A1, "")
> End Function

Thanks mayayana, that has cleared things up a lot.

I have been trying to achieve the same thing using regular expressions
which seem to have similar speeds to the Replace function, but are not
always consistent. I think this could be quite a good method, as it
avoids using the loop for each of the characters being removed or
replaced, and I should be able to incorporate the conditions required
to only remove/replace characters if not a component of a url.


Function CleanRepReg (strtoclean)
strtemp = strtoclean

Dim objRegExp, strOutput
Set objRegExp = New Regexp

objRegExp.IgnoreCase = True
objRegExp.Global = True
objRegExp.Pattern = "(\?|\*|\""|,|\\|<|>|&|#|~|%|{|}|\+|_|\.|@|
\||:|/)"
strOutput = objRegExp.Replace(strtemp, "-")

objRegExp.Pattern = "-+"
strOutput = objRegExp.Replace(strOutput, "-")

CleanRepReg = strOutput

End Function

Note that this also uses a second reg replace to remove duplicate "-"
characters once the initial replace method has been called.

I am having some trouble trying to derive a regular expression that
will remove any ":" or "/" characters from the text that are not part
of a url. I am able to remove them if they are part of a URL ( "http:
(.)*" ) or similar, but nothing happens if I try to do the reverse
using the "not" operator ( ! ) in the regular expression.

Is this possible using the replace method of a regular expression
object in VBScript?

Thanks

James