Parsing a string [Programming]

Prev: Using Color as Criteria in Countif Function
Next: Create a list with multiple criteria

From: Jim Berglund on 24 May 2010 18:58

I want to parse a lot of data with rows like the following:

A A WOODS 1479 22RD STREET CALGARY AB T3H2C4 (403) 555-9999

I would like columns for NAME ADDRESS CITY PROV POSTALCODE PHONENUMBER

I can do it in Excel with formulas, but is there a nice piece of generic
code that does this sort of thing?
(or should I parse it based on using spaces as delimiters and then join the
fields that need joining?)

Jim Berglund

From: JLGWhiz on 24 May 2010 19:35

The phone number, postal code, province and city could probably be picked
our pretty easily. Unless there is some consistency in the address and
names grouping, it would be very difficult to pick those out in one fell
swoop. Consistency would be that the address is always street number and a
two part street name separated by spaces, and the person's name would always
be first middle last separated by spaces. I doubt that those are entered
consitently in that fashion. If the street address always begins with a
number, it could be used a key to separate the name and street address,
which would then make it possible to break all the elements into their
respective groups. But I wouldn't want to tackle it.

"Jim Berglund" <jazzzbo(a)shaw.ca> wrote in message
news:%23hurWS5%23KHA.148(a)TK2MSFTNGP06.phx.gbl...
>I want to parse a lot of data with rows like the following:
>
> A A WOODS 1479 22RD STREET CALGARY AB T3H2C4 (403) 555-9999
>
> I would like columns for NAME ADDRESS CITY PROV POSTALCODE PHONENUMBER
>
> I can do it in Excel with formulas, but is there a nice piece of generic
> code that does this sort of thing?
> (or should I parse it based on using spaces as delimiters and then join
> the fields that need joining?)
>
> Jim Berglund

From: Ron Rosenfeld on 24 May 2010 21:29

On Mon, 24 May 2010 16:58:10 -0600, "Jim Berglund" <jazzzbo(a)shaw.ca> wrote:

>I want to parse a lot of data with rows like the following:
>
>A A WOODS 1479 22RD STREET CALGARY AB T3H2C4 (403) 555-9999
>
>I would like columns for NAME ADDRESS CITY PROV POSTALCODE PHONENUMBER
>
>I can do it in Excel with formulas, but is there a nice piece of generic
>code that does this sort of thing?
>(or should I parse it based on using spaces as delimiters and then join the
>fields that need joining?)
>
>Jim Berglund

How easy or hard this might be to do depends on format variations.

If you cannot set out firm rules for the data, you won't be able to do it.

Probably the PROV, POSTALCODE & PHONENUMBER can be easily parsed from the end
of the string.

The CITY would be one or several words that precede the PROV and you could use
a list of valid cities.

The ADDRESS precedes the city and starts with a number.

The NAME ends with the number.

If all of your rows can be unambiguously described, either as I have done or
some other method, then you should be able to parse either with formulas, or
with a macro.

Here's a sample macro with a very short list of possibly valid cities that
parses rows that fit the above rules. It should give you some ideas.

================================
Option Explicit
Sub ParseAddr()
Dim myRegExp As Object, myMatches As Object
Dim rg As Range, c As Range
Dim i As Long

Set rg = Selection
Set myRegExp = CreateObject("vbscript.regexp")
myRegExp.Pattern = "^(\D+)\s+(.*)\s(CALGARY|MELBOURNE|SYDNEY)" _
& "\s+([A-Z]{2})\s+(\w+)\s+($\d{3}$\s+\d{3}-\d{4})$"

For Each c In rg
If myRegExp.test(c.Text) = True Then
Set myMatches = myRegExp.Execute(c.Text)
For i = 0 To 5
c.Offset(0, i + 1) = myMatches(0).submatches(i)
Next i
End If
Next c
End Sub
====================================

The list of cities needs to be enclosed in parentheses and be pipe delimited.
They are tested in the order listed, so if you had several cities with similar
names, the order you list them in can be critical.
--ron

From: Jim Berglund on 25 May 2010 00:27

Ron. Thanks for the effort you put in on this. I would really like to
understand this more.

Is it checking each row to see if it meets the pattern, and if so, doing the
parsing and then building a new, 5-column list?
What is the following line for?
Set myRegExp = CreateObject("vbscript.regexp")

Also, could you please help me understand the following? I've never seen any
code like it and I just don't have a clue how to interpret it...

myRegExp.Pattern = "^(\D+)\s+(.*)\s(CALGARY|MELBOURNE|SYDNEY)" &
"\s+([A-Z]{2})\s+(\w+)\s+($\d{3}$\s+\d{3}-\d{4})$"

I tried it and it ran without any errors. But it didn't do anything.

BTW, For anyone else looking at this, I found the following entry...

Formulas for parsing names.

1. Using an index:

=T(INDEX($M$1:$M$6,MAX(INDEX(COUNTIF(C1,$M$2:$M$6&{". *","
*"})*{1;2;3;4;5},0))+1))
(Set up an index in M1:M5)

2. Using OR
=IF(LEFT(A30,3)="Mrs","Mrs",IF(LEFT(A30,4)="Miss","Miss",IF(OR(LEFT(A30,2)="Mr",LEFT(A30,2)="Ms",LEFT(A30,2)="Dr"),LEFT(A30,2),"X")))

3. Parsing, using VB

Option Explicit
Function ParseName(str As String, Index As Long) As String
Dim re As Object
Dim mc As Object
Dim sPat As String
Dim sTitle As String

'Index code
' 1 = Salutation
' 3 = First Name

'Pipe-delimited list of possible Titles
sTitle = "Mr|Ms|Miss|Mrs|Dr"

sPat = "^((" & sTitle & ")\.?(\s+))?(\w+)"

Set re = CreateObject("vbscript.regexp")
re.ignorecase = True
re.Pattern = sPat

If re.test(str) = True Then
Set mc = re.Execute(str)
ParseName = mc(0).submatches(Index)
End If
End Function

4. Using TRIM
Parsing names in Excel
Having just had to go through a process to clean up a list of names in a
spreadsheet, I thought I'd share the 'magic' formula I came up with.

As you can imagine, there was no pretty way to get the list into Excel,
there were several columns of names and other information, but with no
obvious delimiters to behold. I was amazed when I got the file, how bad it
looked. I can only assume this was a redirected print file from some
ancient mainframe system. It can't be that hard to get a nice comma
delimited file surely!?

The format of the name was either:
First Last
First MI Last
First Middle Last
Obtaining the first name was simple:
=LEFT(A11,(FIND(" ",A11,1)-1))
Getting the last name was a different matter:
=TRIM( IF(ISERR(FIND(" ",A11,(FIND(" ",A11,1)+1))),
MID(A11,FIND(" ",A11,1)+1,(LEN(A11)-FIND(" ",A11,1))),
MID(A11,FIND(" ",A11,(FIND(" ",A11,1)+1))+1,(LEN(A11)-FIND("
",A11,FIND(" ",A11,1))))))

Basically, remove any spaces and if the FIND returns an error, then the name
is just first and last names, otherwise there is a middle name or initial to
deal with. If no middle name/initial then get the last name, which starts 1
char after the space otherwise get the last name after locating the second
space in the string and start from there

Another Method for Parsing Names
Assume you have names of the following types in column C
Mrs Dorothy Hannity
Dr P R Rogers
Dana Delany
Mr Bradley K Pitts
Type the following formulas into the specified cells:
O1=FIND(" ",C1)
Determines the location/existence of the blank following the Salutaton or
First Name
P1=FIND(" ",C1,FIND(" ",C1)+1)
Determines the location/existence of the blank following the First Name or
Middle Initial(MI)
Q1=FIND(" ",R1)
Determines the location/existence of the blank following the Middle Initial
in the next, adjacent cell
R1=IF(ISERROR(P1),RIGHT(C1,LEN(C1)-O1),RIGHT(C1,LEN(C1)-P1))
Defines the Last Name or MI/LN if there is a MI
S1=LEFT(C1,O1-1)
Creates the Salutation column
T1=IF(ISERROR(S1)," ",MID(C1,O1+1,P1-O1))
Creates the First Name column
U1=IF(FIND(" ",R1)=2,LEFT(R1,1),"")
Creates the MI column
V1=IF(U1<>"",MID(R1,FIND(" ",R1)+1,99),R1)
Creates the Last Name column
Fill the entries down and then copy the resulting values into another set of
columns.

Jim

"Ron Rosenfeld" <ronrosenfeld(a)nospam.org> wrote in message
news:1r8mv51ickslteko3es1jnkh7142228q55(a)4ax.com...
> On Mon, 24 May 2010 16:58:10 -0600, "Jim Berglund" <jazzzbo(a)shaw.ca>
> wrote:
>
>>I want to parse a lot of data with rows like the following:
>>
>>A A WOODS 1479 22RD STREET CALGARY AB T3H2C4 (403) 555-9999
>>
>>I would like columns for NAME ADDRESS CITY PROV POSTALCODE PHONENUMBER
>>
>>I can do it in Excel with formulas, but is there a nice piece of generic
>>code that does this sort of thing?
>>(or should I parse it based on using spaces as delimiters and then join
>>the
>>fields that need joining?)
>>
>>Jim Berglund
>
> How easy or hard this might be to do depends on format variations.
>
> If you cannot set out firm rules for the data, you won't be able to do it.
>
> Probably the PROV, POSTALCODE & PHONENUMBER can be easily parsed from the
> end
> of the string.
>
> The CITY would be one or several words that precede the PROV and you could
> use
> a list of valid cities.
>
> The ADDRESS precedes the city and starts with a number.
>
> The NAME ends with the number.
>
> If all of your rows can be unambiguously described, either as I have done
> or
> some other method, then you should be able to parse either with formulas,
> or
> with a macro.
>
> Here's a sample macro with a very short list of possibly valid cities that
> parses rows that fit the above rules. It should give you some ideas.
>
> ================================
> Option Explicit
> Sub ParseAddr()
> Dim myRegExp As Object, myMatches As Object
> Dim rg As Range, c As Range
> Dim i As Long
>
> Set rg = Selection
> Set myRegExp = CreateObject("vbscript.regexp")
> myRegExp.Pattern = "^(\D+)\s+(.*)\s(CALGARY|MELBOURNE|SYDNEY)" _
> & "\s+([A-Z]{2})\s+(\w+)\s+($\d{3}$\s+\d{3}-\d{4})$"
>
> For Each c In rg
> If myRegExp.test(c.Text) = True Then
> Set myMatches = myRegExp.Execute(c.Text)
> For i = 0 To 5
> c.Offset(0, i + 1) = myMatches(0).submatches(i)
> Next i
> End If
> Next c
> End Sub
> ====================================
>
> The list of cities needs to be enclosed in parentheses and be pipe
> delimited.
> They are tested in the order listed, so if you had several cities with
> similar
> names, the order you list them in can be critical.
> --ron

From: Ron Rosenfeld on 25 May 2010 06:11

On Mon, 24 May 2010 22:27:18 -0600, "Jim Berglund" <jazzzbo(a)shaw.ca> wrote:

>Ron. Thanks for the effort you put in on this. I would really like to
>understand this more.
>
>Is it checking each row to see if it meets the pattern, and if so, doing the
>parsing and then building a new, 5-column list?

Yes. And if your row does not meet the format, it will do nothing.

>What is the following line for?
> Set myRegExp = CreateObject("vbscript.regexp")

That's called late binding, and it invokes another program which allows the use
of Regular Expressions in VBA. Regular Expressions are a powerful method of
manipulating strings.

>
>Also, could you please help me understand the following? I've never seen any
>code like it and I just don't have a clue how to interpret it...
>
>myRegExp.Pattern = "^(\D+)\s+(.*)\s(CALGARY|MELBOURNE|SYDNEY)" &
>"\s+([A-Z]{2})\s+(\w+)\s+($\d{3}$\s+\d{3}-\d{4})$"

That is the pattern derived from the example you gave.

Everything was on one line so:

^ Start at the beginning of the line

(\D+) Capture everything that is not a digit into submatch 0. (In the
definition, the first digit was the beginning of the address). This will be
the name.

\s+ matches the space(s) between the end of the non-digits and the next
group.

(.*) capture everything, into submatch 1, up to the next match. This will
be the address.

\s matches the space between the end of the address and the next group

((CALGARY|MELBOURNE|SYDNEY) Short list of valid city names. This will need
to be extended to match your data. Capture into submatch 2.

\s+ matches the space(s) between the end of the city name and the province.

([A-Z]{2}) capture the next two capital letter abbreviation as the
Province into submatch 3

the rest matches the postcode, and phone number, again, according to the
pattern that was in the example you gave.

>
>I tried it and it ran without any errors. But it didn't do anything.

It ran perfectly here using your line of sample data.

You have to enter and run it properly.

To enter this Macro (Sub), <alt-F11> opens the Visual Basic Editor.
Ensure your project is highlighted in the Project Explorer window.
Then, from the top menu, select Insert/Module and
paste the code into the window that opens.

To use this Macro (Sub), FIRST *SELECT* the data you wish to parse. Then
<alt-F8> opens the macro dialog box. Select the macro by name, and <RUN>.

If you did all that, and the macro did not do anything, then the data you ran
it against did not match, in format, what I had laid out. Or, possibly, your
newsreader wrapped a line and you did not copy it correctly into the VBEditor.
***Both I and JLGWhiz emphasized to you the importance of properly defining
your format.***

--ron

| Next | Last
Pages: 1 2 3
Prev: Using Color as Criteria in Countif Function
Next: Create a list with multiple criteria