Eliminating duplicate lines - with a twist [VbScript]

Prev: ADSystemInfo and ComputerName
Next: shell.application throw error on asp file

From: Highlander on 21 Sep 2009 17:04

Hello all.

I've used this script before to eliminate duplicate lines in a text
file, or in an array:

http://blogs.technet.com/heyscriptingguy/archive/2005/04/13/how-can-i-eliminate-duplicate-names-in-a-text-file.aspx

Now I need to eliminate duplicate lines in a comma separated file. The
contents of the CSV file are the following:

Software Program One,3.0,9/21/2009
Software Program Two,1.0,9/21/2009
Software Program Two,
Software Program Three,2.0,9/21/2009
Software Program Four,1.0,9/21/2009
Software Program Five,1.0,9/21/2009
Software Program Five,,9/21/2009
Software Program Six,1.0,9/21/2009

I'm not comparing the entire line. Of the 3 fields in the CSV, I want
to find lines where the FIRST FIELD is a duplicate. So these lines
would qualify as duplicates:

Software Program Two,1.0,9/21/2009
Software Program Two,

Software Program Five,1.0,9/21/2009
Software Program Five,,9/21/2009

And then comparing the duplicates, I want to eliminate the line that
has the 2nd and/or the 3rd field empty; and retain the line which has
all 3 fields populated. So my end result would be the CSV looking like
the following:

Software Program One,3.0,9/21/2009
Software Program Two,1.0,9/21/2009
Software Program Three,2.0,9/21/2009
Software Program Four,1.0,9/21/2009
Software Program Five,1.0,9/21/2009
Software Program Six,1.0,9/21/2009

I've come up with a script that will parse a CSV and display the
values in the 3 fields, but I can't figure out the logic to add that
will compare and eliminate the duplicates in the manner I need:

Const ForReading = 1, ForWriting = 2, ForAppending = 8
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objDictionary = CreateObject("Scripting.Dictionary")
Dim sFile, sQueryResults
sFile = "File.txt"
Set objFile = objFSO.OpenTextFile (sFile,1)
sQueryResults = objFile.ReadAll
objFile.Close
GetArrayValues(sQueryResults)
Function GetArrayValues(sList)
Arr = Split(sList, vbCrlf)
For i = 0 To UBound(Arr)
sLine = Arr(i)
ArrFields = Split(sLine, ",")
count = -1
sFieldInfo = ""
For j = 0 To UBound(ArrFields)
count = count + 1
sField = ArrFields(j)
IF sField ="" Then sField = "EMPTY FIELD"
sFieldNum = "Array Field " & count
sFieldInfo = sFieldInfo & sFieldNum & ":" & _
vbTab & sField & vbCrlf
Next
Wscript.Echo vbCrlf & "Array:" & vbTab & vbTab & _
sLine & vbCrlf & "Array UBound:" & vbTab & _
UBound(ArrFields) & vbCrlf & sFieldInfo
count = -1
Next
End Function ' GetArrayValues
Set objFSO = nothing
Set objDictionary = nothing

How can accomplish what I need to?

Any help would be greatly appreciated. Thanks!

- Dave

From: Richard Mueller [MVP] on 21 Sep 2009 18:48

"Highlander" <tron9901(a)msn.com> wrote in message
news:09c67f4d-efdd-482c-88fc-9b26c49df255(a)g6g2000vbr.googlegroups.com...
> Hello all.
>
> I've used this script before to eliminate duplicate lines in a text
> file, or in an array:
>
> http://blogs.technet.com/heyscriptingguy/archive/2005/04/13/how-can-i-eliminate-duplicate-names-in-a-text-file.aspx
>
> Now I need to eliminate duplicate lines in a comma separated file. The
> contents of the CSV file are the following:
>
> Software Program One,3.0,9/21/2009
> Software Program Two,1.0,9/21/2009
> Software Program Two,
> Software Program Three,2.0,9/21/2009
> Software Program Four,1.0,9/21/2009
> Software Program Five,1.0,9/21/2009
> Software Program Five,,9/21/2009
> Software Program Six,1.0,9/21/2009
>
> I'm not comparing the entire line. Of the 3 fields in the CSV, I want
> to find lines where the FIRST FIELD is a duplicate. So these lines
> would qualify as duplicates:
>
> Software Program Two,1.0,9/21/2009
> Software Program Two,
>
> Software Program Five,1.0,9/21/2009
> Software Program Five,,9/21/2009
>
> And then comparing the duplicates, I want to eliminate the line that
> has the 2nd and/or the 3rd field empty; and retain the line which has
> all 3 fields populated. So my end result would be the CSV looking like
> the following:
>
> Software Program One,3.0,9/21/2009
> Software Program Two,1.0,9/21/2009
> Software Program Three,2.0,9/21/2009
> Software Program Four,1.0,9/21/2009
> Software Program Five,1.0,9/21/2009
> Software Program Six,1.0,9/21/2009
>
> I've come up with a script that will parse a CSV and display the
> values in the 3 fields, but I can't figure out the logic to add that
> will compare and eliminate the duplicates in the manner I need:
>
> Const ForReading = 1, ForWriting = 2, ForAppending = 8
> Set objFSO = CreateObject("Scripting.FileSystemObject")
> Set objDictionary = CreateObject("Scripting.Dictionary")
> Dim sFile, sQueryResults
> sFile = "File.txt"
> Set objFile = objFSO.OpenTextFile (sFile,1)
> sQueryResults = objFile.ReadAll
> objFile.Close
> GetArrayValues(sQueryResults)
> Function GetArrayValues(sList)
> Arr = Split(sList, vbCrlf)
> For i = 0 To UBound(Arr)
> sLine = Arr(i)
> ArrFields = Split(sLine, ",")
> count = -1
> sFieldInfo = ""
> For j = 0 To UBound(ArrFields)
> count = count + 1
> sField = ArrFields(j)
> IF sField ="" Then sField = "EMPTY FIELD"
> sFieldNum = "Array Field " & count
> sFieldInfo = sFieldInfo & sFieldNum & ":" & _
> vbTab & sField & vbCrlf
> Next
> Wscript.Echo vbCrlf & "Array:" & vbTab & vbTab & _
> sLine & vbCrlf & "Array UBound:" & vbTab & _
> UBound(ArrFields) & vbCrlf & sFieldInfo
> count = -1
> Next
> End Function ' GetArrayValues
> Set objFSO = nothing
> Set objDictionary = nothing
>
> How can accomplish what I need to?
>
> Any help would be greatly appreciated. Thanks!
>
> - Dave

It might help to use the Jet OLEDB driver to read the csv file. I would add
a header line so you can sort the fields by name. This will arrange all
lines with the same value for the first field together. Because blanks are
sorted before other values, lines with blanks are always first. After the
lines are sorted, you want to output the last line for each unique value in
the first field. There may be a better way to do this using SQL, but I could
only think of the way below. This works for your example lines. I added the
following header line to the start of the csv file:

Field1,Field2,Field3

Then I used the code below:
==========
Option Explicit

Dim adoCSVConnection, adoCSVRecordSet, strPathToTextfile
Dim strCSVFile, k, strLine, strPrevious, strFlag

' Specify path to CSV file.
strPathToTextFile = "c:\Scripts\"

' Specify CSV file name.
strCSVFile = "Example.csv"

' Open connection to the CSV file.
Set adoCSVConnection = CreateObject("ADODB.Connection")
Set adoCSVRecordSet = CreateObject("ADODB.Recordset")

' Open CSV file with header line.
adoCSVConnection.Open "Provider=Microsoft.Jet.OLEDB.4.0;" & _
"Data Source=" & strPathtoTextFile & ";" & _
"Extended Properties=""text;HDR=YES;FMT=Delimited"""

' Sort by all fields, so lines with blanks are first.
' If there is any line with no blanks, it will be the last
' for each unique value of the first field.
adoCSVRecordset.Open "SELECT * FROM " & strCSVFile _
& " ORDER BY Field1, Field2, Field3", adoCSVConnection

' Read the CSV file.
strPrevious = ""
Do Until adoCSVRecordset.EOF
' Display the last line for each unique value of the
' first field.
strFlag = adoCSVRecordset.Fields("Field1").Value
If (strPrevious <> "") And (strFlag <> strPrevious) Then
Wscript.Echo strLine
End If
strLine = adoCSVRecordset.Fields("Field1").Value _
& "," & adoCSVRecordset.Fields("Field2").Value _
& "," & adoCSVRecordset.Fields("Field3").Value
strPrevious = adoCSVRecordset.Fields("Field1").Value
adoCSVRecordset.MoveNext
Loop
Wscript.Echo strLine

' Clean up.
adoCSVRecordset.Close
adoCSVConnection.Close

--
Richard Mueller
MVP Directory Services
Hilltop Lab - http://www.rlmueller.net
--

From: HL0105 on 22 Sep 2009 09:54

On Sep 21, 5:48 pm, "Richard Mueller [MVP]" <rlmueller-
nos...(a)ameritech.nospam.net> wrote:
> "Highlander" <tron9...(a)msn.com> wrote in message
>
> news:09c67f4d-efdd-482c-88fc-9b26c49df255(a)g6g2000vbr.googlegroups.com...
>
>
>
>
>
> > Hello all.
>
> > I've used this script before to eliminate duplicate lines in a text
> > file, or in an array:
>
> >http://blogs.technet.com/heyscriptingguy/archive/2005/04/13/how-can-i...
>
> > Now I need to eliminate duplicate lines in a comma separated file. The
> > contents of the CSV file are the following:
>
> > Software Program One,3.0,9/21/2009
> > Software Program Two,1.0,9/21/2009
> > Software Program Two,
> > Software Program Three,2.0,9/21/2009
> > Software Program Four,1.0,9/21/2009
> > Software Program Five,1.0,9/21/2009
> > Software Program Five,,9/21/2009
> > Software Program Six,1.0,9/21/2009
>
> > I'm not comparing the entire line. Of the 3 fields in the CSV, I want
> > to find lines where the FIRST FIELD is a duplicate. So these lines
> > would qualify as duplicates:
>
> > Software Program Two,1.0,9/21/2009
> > Software Program Two,
>
> > Software Program Five,1.0,9/21/2009
> > Software Program Five,,9/21/2009
>
> > And then comparing the duplicates, I want to eliminate the line that
> > has the 2nd and/or the 3rd field empty; and retain the line which has
> > all 3 fields populated. So my end result would be the CSV looking like
> > the following:
>
> > Software Program One,3.0,9/21/2009
> > Software Program Two,1.0,9/21/2009
> > Software Program Three,2.0,9/21/2009
> > Software Program Four,1.0,9/21/2009
> > Software Program Five,1.0,9/21/2009
> > Software Program Six,1.0,9/21/2009
>
> > I've come up with a script that will parse a CSV and display the
> > values in the 3 fields, but I can't figure out the logic to add that
> > will compare and eliminate the duplicates in the manner I need:
>
> > Const ForReading = 1, ForWriting = 2, ForAppending = 8
> > Set objFSO = CreateObject("Scripting.FileSystemObject")
> > Set objDictionary = CreateObject("Scripting.Dictionary")
> > Dim sFile, sQueryResults
> > sFile = "File.txt"
> > Set objFile = objFSO.OpenTextFile (sFile,1)
> > sQueryResults = objFile.ReadAll
> > objFile.Close
> > GetArrayValues(sQueryResults)
> > Function GetArrayValues(sList)
> > Arr = Split(sList, vbCrlf)
> > For i = 0 To UBound(Arr)
> > sLine = Arr(i)
> > ArrFields = Split(sLine, ",")
> > count = -1
> > sFieldInfo = ""
> > For j = 0 To UBound(ArrFields)
> > count = count + 1
> > sField = ArrFields(j)
> > IF sField ="" Then sField = "EMPTY FIELD"
> > sFieldNum = "Array Field " & count
> > sFieldInfo = sFieldInfo & sFieldNum & ":" & _
> > vbTab & sField & vbCrlf
> > Next
> > Wscript.Echo vbCrlf & "Array:" & vbTab & vbTab & _
> > sLine & vbCrlf & "Array UBound:" & vbTab & _
> > UBound(ArrFields) & vbCrlf & sFieldInfo
> > count = -1
> > Next
> > End Function ' GetArrayValues
> > Set objFSO = nothing
> > Set objDictionary = nothing
>
> > How can accomplish what I need to?
>
> > Any help would be greatly appreciated. Thanks!
>
> > - Dave
>
> It might help to use the Jet OLEDB driver to read the csv file. I would add
> a header line so you can sort the fields by name. This will arrange all
> lines with the same value for the first field together. Because blanks are
> sorted before other values, lines with blanks are always first. After the
> lines are sorted, you want to output the last line for each unique value in
> the first field. There may be a better way to do this using SQL, but I could
> only think of the way below. This works for your example lines. I added the
> following header line to the start of the csv file:
>
> Field1,Field2,Field3
>
> Then I used the code below:
> ==========
> Option Explicit
>
> Dim adoCSVConnection, adoCSVRecordSet, strPathToTextfile
> Dim strCSVFile, k, strLine, strPrevious, strFlag
>
> ' Specify path to CSV file.
> strPathToTextFile = "c:\Scripts\"
>
> ' Specify CSV file name.
> strCSVFile = "Example.csv"
>
> ' Open connection to the CSV file.
> Set adoCSVConnection = CreateObject("ADODB.Connection")
> Set adoCSVRecordSet = CreateObject("ADODB.Recordset")
>
> ' Open CSV file with header line.
> adoCSVConnection.Open "Provider=Microsoft.Jet.OLEDB.4.0;" & _
> "Data Source=" & strPathtoTextFile & ";" & _
> "Extended Properties=""text;HDR=YES;FMT=Delimited"""
>
> ' Sort by all fields, so lines with blanks are first.
> ' If there is any line with no blanks, it will be the last
> ' for each unique value of the first field.
> adoCSVRecordset.Open "SELECT * FROM " & strCSVFile _
> & " ORDER BY Field1, Field2, Field3", adoCSVConnection
>
> ' Read the CSV file.
> strPrevious = ""
> Do Until adoCSVRecordset.EOF
> ' Display the last line for each unique value of the
> ' first field.
> strFlag = adoCSVRecordset.Fields("Field1").Value
> If (strPrevious <> "") And (strFlag <> strPrevious) Then
> Wscript.Echo strLine
> End If
> strLine = adoCSVRecordset.Fields("Field1").Value _
> & "," & adoCSVRecordset.Fields("Field2").Value _
> & "," & adoCSVRecordset.Fields("Field3").Value
> strPrevious = adoCSVRecordset.Fields("Field1").Value
> adoCSVRecordset.MoveNext
> Loop
> Wscript.Echo strLine
>
> ' Clean up.
> adoCSVRecordset.Close
> adoCSVConnection.Close
>
> --
> Richard Mueller
> MVP Directory Services
> Hilltop Lab -http://www.rlmueller.net
> --- Hide quoted text -
>
> - Show quoted text -

Thanks Richard your script does the job! Except for one thing - it's
changing the value of the second field. Looks to be chopping it off
after a certain number of decimal points.

This is more clearly illustrated if you change the contents of the CSV
file to the following, and then run your script:

Field1,Field2,Field3
Software Program One,1.2.3.456789,12/31/2009
Software Program Two,2.1.234.56789,12/31/2009
Software Program Two,
Software Program Three,3.123.456.789,12/31/2009
Software Program Four,4.12.34.56.789,12/31/2009
Software Program Five,5.1234567.89,9,12/31/2009
Software Program Five,,12/31/2009
Software Program Six,6.123456.78.9,12/31/2009

- Dave

From: Richard Mueller [MVP] on 22 Sep 2009 10:47

"HL0105" <tron9901(a)msn.com> wrote in message
news:97f6384c-e7ac-44c2-9b43-64fd17f4aff4(a)m11g2000vbl.googlegroups.com...
On Sep 21, 5:48 pm, "Richard Mueller [MVP]" <rlmueller-
nos...(a)ameritech.nospam.net> wrote:
> "Highlander" <tron9...(a)msn.com> wrote in message
>snip...

Thanks Richard your script does the job! Except for one thing - it's
changing the value of the second field. Looks to be chopping it off
after a certain number of decimal points.

This is more clearly illustrated if you change the contents of the CSV
file to the following, and then run your script:

Field1,Field2,Field3
Software Program One,1.2.3.456789,12/31/2009
Software Program Two,2.1.234.56789,12/31/2009
Software Program Two,
Software Program Three,3.123.456.789,12/31/2009
Software Program Four,4.12.34.56.789,12/31/2009
Software Program Five,5.1234567.89,9,12/31/2009
Software Program Five,,12/31/2009
Software Program Six,6.123456.78.9,12/31/2009

- Dave
-----------

The driver assumes the value is numeric. The solution is to create a
schema.ini file in the folder with the csv file. I used the following
schema.ini file:
===========
[Example.csv]
Format=CSVDelimited

Col1="Field1" Text
Col2="Field2" Text
Col3="Field3" Text
========
The first line is the name of the file in square brackets. This file told
the driver that all fields are text. There are many other things you can do
in such a file. I found this link:

http://msdn.microsoft.com/en-us/library/ms709353(VS.85).aspx

The driver knows to magically look for the schema.ini file and use it if the
specified file name matches.

--
Richard Mueller
MVP Directory Services
Hilltop Lab - http://www.rlmueller.net
--

From: HL0105 on 22 Sep 2009 11:55

On Sep 22, 9:47 am, "Richard Mueller [MVP]" <rlmueller-
nos...(a)ameritech.nospam.net> wrote:
> "HL0105" <tron9...(a)msn.com> wrote in message
>
> news:97f6384c-e7ac-44c2-9b43-64fd17f4aff4(a)m11g2000vbl.googlegroups.com...
> On Sep 21, 5:48 pm, "Richard Mueller [MVP]" <rlmueller-
>
> nos...(a)ameritech.nospam.net> wrote:
> > "Highlander" <tron9...(a)msn.com> wrote in message
> >snip...
>
> Thanks Richard your script does the job! Except for one thing - it's
> changing the value of the second field. Looks to be chopping it off
> after a certain number of decimal points.
>
> This is more clearly illustrated if you change the contents of the CSV
> file to the following, and then run your script:
>
> Field1,Field2,Field3
> Software Program One,1.2.3.456789,12/31/2009
> Software Program Two,2.1.234.56789,12/31/2009
> Software Program Two,
> Software Program Three,3.123.456.789,12/31/2009
> Software Program Four,4.12.34.56.789,12/31/2009
> Software Program Five,5.1234567.89,9,12/31/2009
> Software Program Five,,12/31/2009
> Software Program Six,6.123456.78.9,12/31/2009
>
> - Dave
> -----------
>
> The driver assumes the value is numeric. The solution is to create a
> schema.ini file in the folder with the csv file. I used the following
> schema.ini file:
> ===========
> [Example.csv]
> Format=CSVDelimited
>
> Col1="Field1" Text
> Col2="Field2" Text
> Col3="Field3" Text
> ========
> The first line is the name of the file in square brackets. This file told
> the driver that all fields are text. There are many other things you can do
> in such a file. I found this link:
>
> http://msdn.microsoft.com/en-us/library/ms709353(VS.85).aspx
>
> The driver knows to magically look for the schema.ini file and use it if the
> specified file name matches.
>
> --
> Richard Mueller
> MVP Directory Services
> Hilltop Lab -http://www.rlmueller.net
> --

The schema.ini file solved the problem. Thanks Richard!

Two more questions.

1. I'm using this code within an HTA, running it from a mapped network
drive. So I keep getting the message "This website uses a data
provider that may be unsafe. If you trust the website, click OK,
otherwise click Cancel."
Is there a way to prevent this annoying message?

2. I actually have the data in an array; I'm writing the array to a
CSV file in order to use your code. Is it possible to use this ADO
code to parse the array directly, instead of parsing the CSV file?

Thanks again!

- Dave

| Next | Last
Pages: 1 2
Prev: ADSystemInfo and ComputerName
Next: shell.application throw error on asp file