From: Al Dunbar on


"dsoutter" <webmasterhub.net(a)gmail.com> wrote in message
news:2456c1b9-460d-46dd-af7f-62620e277e83(a)l12g2000prg.googlegroups.com...
> On Mar 3, 3:16 pm, "Al Dunbar" <aland...(a)hotmail.com> wrote:
>> "dsoutter" <webmasterhub....(a)gmail.com> wrote in message

<snip>

>> Here is how I would code this function if I ever needed such a thing:

<snip>

>> IMHO, this has the same result but the logic is somewhat simpler. What
>> benefit would I get from switching from my version to yours?
>>
>> /Al- Hide quoted text -
>>
>> - Show quoted text -
>
> Hi Al, the logic is simpler as you are using the replace() function to
> perform the string replace, where the function provided takes the left
> and right parts of a string, either side of an illegal character.

A nice analysis, and exactly my point. Thanks for making it for me.

> In
> many cases, your method would be more suitable mainly due to the
> simpler logic, especially when all instances of each character are to
> be processed in the same way.

True. But, as written, your function will also only process all instances of
each character in the same way. My method might therefore appear to be
better in all cases in which the functions, as written, could be used. If
you want to compare our methods when applied to a different problem space,
such as you describe here:

> As the method provided parses the string character by character, you
> should have greater control over the output when more complex
> operations need to be performed, such as removing or replacing a
> character only if it within a specific context:

You cannot compare my function as written with your function as modified to
solve some new problem. A better comparison would be to compare your
modified function with a different function I might write to solve that
problem.

> Eg. replace "&" with " and " if padded with spaces or other specific
> character, or with a "+" if not
> "something & something else" would become "something and something
> else"
> "somethin&something else" would become "somethin+something else".
>
> Eg. replace ":" only if NOT part of a url:
>
> "the website is http://code-tips.com " would remain "the website is
> http://code-tips.com "
> "See Here: http://code-tips.com " would become "See Here
> http://code-tips.com
> "
>
> This would be achieved by either checking the previous 3-5 characters
> when a ":" is found to see if it is in the context of a url or not
> (http, https, ftp), or by checking the characters following the
> current ":" is "//" which would indicate that the semicolon is part of
> a url.

There might even be other ways to perform this kind of parsing...

> This functionality has not been included in the function provided, but
> would be easy to implement, as the string is incrementally parsed and
> manipulated using a numeric string position value relative to the
> current position/character in the string.

You seem to be proposing that simple functions be written in such a way that
they are more directly adaptable into more complex ones capable of more
complex operations. I disagree with this approach, UNLESS a function is
coded in such a way that it can be made to perform the more complex work
without first having to be modified to do so by calling it in a different
manner.

I'm not saying that you are wrong to do it your way, just that it may not be
the best approach for others to emulate.

> There may also be differences in performance between the two methods,
> as the function provided includes the code required to remove or
> replace each of the specified characters without calling the replace()
> function.

Yes, you avoid calling replace. But you do that by calling instr for each
possible bad character, plus left, mid, len, and and two string
concatenations for each bad character actually present. If you are concerned
with the overhead of calling a built-in function, my method does that fewer
times.

> I suspect

suspect, but do not know...

> that the replace function uses a similar approach
> to replace the specified characters so any difference in performance
> would be minimal, unless parsing a large string value. I haven't yet
> tested this for performance differences.

I haven't tested either, however, the actual logic used by a built-in
function, while possibly logically identical to that of a function written
in vbscript, is more likely to be faster and more efficient. This is mainly
because the built-in functions are coded in a lower level language.

Regardless, no argument over ultimate relative efficiency can really be
resolved without rigorous testing. Since neither of us feel it important
enough to do that, we probably both are willing to accept some
inefficiencies, given that our functions each perform their intended tasks
perfectly! ;-)

Or do they? I haven't tested your code, but my reading of it suggests to me
that it make unstated assumptions about the nature of the string it is
processing (does it, for example, presume that the string represents a valid
NTFS, UNC or URL path of some sort?).

If you wouldn't mind, try running your function against a string such as
"C::\". I suspect the result might be "C :\", a string containing an illegal
character. If so, you would have to either include an internal recursive
call, or call your function in a loop until the result no longer changed. Or
you would have to qualify your documentation to explain that it is intended
only to process valid paths strings (or whatever the case actually is).

Regardless, another knock against your function as posted, if you are
interested in objective criticism, is that it does not fully document
itself. The nature of an "illegal character" is somewhat inferred, but not
fully explained. If the goal is to convert a valid path to a string that
could be used as a filename, here are a few quirks you appear not to have
addressed:

non-uniqueness: Run your function (or mine, for that matter) on these two
different paths: "C:\documents and settings" and
"C:\documents\and\settings", and you get the same result: "C documents and
settings".

other filename invalidities: run it on one of those huge URL strings and you
might wind up with a filename that was actually too long for the file system
to handle.

the concept of adapting the function to do more comprehensive processing. If
that actually was the reason for your less simple approach, your audience is
not getting the benefit if you do not explain that.

the vagueness of the name of the function itself: clean? there's nothing
dirty here. Calling it Path2Filename might be a more accurate representation
of its purpose (or it might not - I could not tell the purpose from the code
itself without your additional explanation.

/Al


From: Al Dunbar on


"WebmasterHub.net" <webmasterhub.net(a)gmail.com> wrote in message
news:923bab94-9163-4786-b9f3-c3f283a97ff2(a)l24g2000prh.googlegroups.com...
> On Mar 3, 3:16 pm, "Al Dunbar" <aland...(a)hotmail.com> wrote:
>> "dsoutter" <webmasterhub....(a)gmail.com> wrote in message

<snip>


> Hi Al, the logic is simpler as you are using the replace() function
> to
> perform the string replace, where the function provided takes the
> left
> and right parts of a string, either side of an illegal character. In
> many cases, your method would be more suitable mainly due to the
> simpler logic, especially when all instances of each character are to
> be processed in the same way.
>
> As the method provided parses the string character by character, you
> should have greater control over the output when more complex
> operations need to be performed, such as removing or replacing a
> character only if it within a specific context:
>
>
> Eg. replace "&" with " and " if padded with spaces or other specific
> character, or with a "+" if not
> "something & something else" would become "something and something
> else"
> "somethin&something else" would become "somethin+something else".
>
>
> Eg. replace ":" only if NOT part of a url:
>
>
> "the website is http://code-tips.com " would remain "the website is
> http://code-tips.com "
> "See Here: http://code-tips.com " would become "See Here
> http://code-tips.com
> "
>
>
> This would be achieved by either checking the previous 3-5 characters
> when a ":" is found to see if it is in the context of a url or not
> (http, https, ftp), or by checking the characters following the
> current ":" is "//" which would indicate that the semicolon is part
> of
> a url.
>
>
> This functionality has not been included in the function provided,
> but
> would be easy to implement, as the string is incrementally parsed and
> manipulated using a numeric string position value relative to the
> current position/character in the string.
>
> There may also be differences in performance between the two methods,
> as the function provided includes the code required to remove or
> replace each of the specified characters without calling the
> replace()
> function. I suspect that the replace function uses a similar
> approach
> to replace the specified characters so any difference in performance
> would be minimal, unless parsing a large string value. I haven't yet
> tested this for performance differences.

I already replied to your identical post from your alter ego ;-)

/Al


From: Al Dunbar on


"mayayana" <mayayana(a)nospam.invalid> wrote in message
news:eut$Rs6uKHA.732(a)TK2MSFTNGP06.phx.gbl...
> This looks like some kind of advertisement
> for a blog,

or of a web site purporting to demonstrate some level of expertise and
authority that some of us have yet to recognize as such...

> but it's an interesting question.
> In compiled VB both of the foregoing methods
> would be extremely slow on large strings.

Granted. But if limited to URL's, for example, they might not be extremely
huge.

> The webpage sample is allocating a vast
> number of strings to do its job. As the strings
> get bigger it would slow to a crawl. The Replace
> function looks much better to me, but it's also
> fairly slow. (Replace itself is slow.)

I do not dispute that, although I do not know the actual metrics. But for a
site dedicated to providing example vbscripts and a newsgroup dedicated to
the same language, a completely different approach (i.e. re-write in C, for
example) would generally be of no interest to those looking for vbscript
solutions.

> Probably none of that matters if the function
> is only being used for filename strings of 20+-
> characters. And it's not easy to optimize for
> speed in VBS anyway.

Exactly.

> But personally I'd still much
> prefer your Replace loop. I don't see the sense of
> writing a highly inefficient Replace method in
> VBS when the scripting runtime can do it internally.

Agreed. But the other issue with less simple code that cannot be discounted
is the greater effort required to develop it, debug it, and test it to
ensure it works in all cases.

> But in general, why not tokenize? In compiled
> code that should be by far the fastest, with much
> greater speed achieved if the characters can be
> treated as numbers in an array so that the operation
> is not allocating new strings or deciphering the Chr
> value of each stored numeric value of the string.
> In VBS, I don't know whether treating characters as
> numbers will help, since it's still a variant that has
> to be "parsed". I haven't tested the possibilities.

I strongly suspect that the variant thing will make most vbscript code less
efficient than a compiled language, and that it might cause the tokenized
approach to be less efficient than it might be expected to be.

<snip>

/Al


From: Al Dunbar on


"James" <webmasterhub.net(a)gmail.com> wrote in message
news:f4d5de01-3c8f-430a-8c8c-1fcbd78aa5df(a)t9g2000prh.googlegroups.com...
> On Mar 5, 1:59 am, "mayayana" <mayay...(a)nospam.invalid> wrote:
>> This looks like some kind of advertisement
>> for a blog, but it's an interesting question.

<snip>

> Hi Mayayana,
>
> As the "air code" sample of your method parses the string character by
> character, I suspect theat a combination of your method and the
> function provided should allow characters to be replaced, taking into
> account the context of each illegal character.
>
> I am using the method to clean a plain text string that may or may not
> contain URLs. If there are URLs present in the string, they are later
> replaced with an internal url with paramaters pointing to a logging
> script that loggs and forwards the request to the original url. The
> cleaned string is also used to generate a set of keywords and
> keyphrases from the text supplied.

You see, that whole description is not inherent in the listing you have
posted of your clean function.

> I have based the code below from the "air code" demo, which has also
> not been tested. I have incorporated the contextual tests to only
> remove/replace some characters if they are not in a scpecific context
> (using a URL as an example).
>
> The method below must certainly be a better approach to the function
> linked from this thread, or suggested by Al.

It might indeed be better, but I don't see where this must certainly be so.
Your original function and my "simpler" version never even tried to do the
contextual bit, so saying code that was designed to do so is better is a bit
like saying a hammer is a better tool than a nailfile for nailing things
together.

> What do you think? Also,
> is there a better way to incorporate the contextual tests for each
> illegal character the string?

My guess: yes, probably there is. I just find your code below even harder to
follow than the original clean function. But as implied previously, it seems
odd to have two functions doing two different things but having the same
name.

/Al

> Thanks
>
> James
>
> -------------------------
>
> Function Clean(sIn)
> Dim i2, iChar, A1()
>
> ReDim A1(len(sIn) - 1)
> For i2 = 1 to Len(sIn)
> iChar = Asc(Mid(sIn, i2, 1))
> Select Case iChar
> Case 58
> rChars = Mid(sIn, i2+1, 2)
> If rChars = "//" Then
> A1(i2 - 1) = Chr(iChar)
> End If
>
> Case 47
> rChar = Asc(Mid(sIn, i2+1, 1))
> lChar = Asc(Mid(sIn, i2-1, 1))
>
> If rChar = 47 OR lChar = 47 Then
> A1(i2 - 1) = Chr(iChar)
> Else
> A1(i2 - 1) = "-"
> End If
>
> Case 63, 92, 42, 60, 62
> A1(i2 - 1) = "-"
>
> Case 44, 46, 43, 126
> A1(i2 - 1) = ""
>
> Case Else
> A1(i2 - 1) = Chr(iChar)
> End Select
> Next
> Clean = Join(A1, "")
> End Function

From: Al Dunbar on


"mayayana" <mayayana(a)nospam.invalid> wrote in message
news:OCgsS8AvKHA.4220(a)TK2MSFTNGP05.phx.gbl...
>>
> The method below must certainly be a better approach to the function
> linked from this thread, or suggested by Al. What do you think? Also,
> is there a better way to incorporate the contextual tests for each
> illegal character the string?
>>
>
> I think that's pretty much what I meant in saying
> it's flexible. There's no limit, really. One could even
> call separate functions from within the Select Case.
>
> Parsing URLs
> sounds tricky, but it can be done. For instance, you
> could check each ":" to see if it's part of "http://",
> then get the whole URL and write your edited
> URL to the array. You'd just have to find the end
> of the URL, calculate the offset of the start and end
> characters, and keep track of how many characters
> you've actually written to the array. With edits involved
> you might need to use a bigger array and then Redim
> Preserve it at the end before the Join call.

in my opinion, the use of regular expressions seems more likely to be more
efficient than coding all the ifs ands and buts in vbscript. But sorry, I'm
not a regular expression kind of guy.

/Al

> -------------------------
>
> Function Clean(sIn)
> Dim i2, iChar, A1()
>
> ReDim A1(len(sIn) - 1)
> For i2 = 1 to Len(sIn)
> iChar = Asc(Mid(sIn, i2, 1))
> Select Case iChar
> Case 58
> rChars = Mid(sIn, i2+1, 2)
> If rChars = "//" Then
> A1(i2 - 1) = Chr(iChar)
> End If
>
> Case 47
> rChar = Asc(Mid(sIn, i2+1, 1))
> lChar = Asc(Mid(sIn, i2-1, 1))
>
> If rChar = 47 OR lChar = 47 Then
> A1(i2 - 1) = Chr(iChar)
> Else
> A1(i2 - 1) = "-"
> End If
>
> Case 63, 92, 42, 60, 62
> A1(i2 - 1) = "-"
>
> Case 44, 46, 43, 126
> A1(i2 - 1) = ""
>
> Case Else
> A1(i2 - 1) = Chr(iChar)
> End Select
> Next
> Clean = Join(A1, "")
> End Function
>
>