Extract words from a .dita file [VbScript]

Prev: How to disable and enable my Local Area Connection using scripts
Next: VB script

From: Huber57 on 31 Mar 2010 10:11

To whom:
I have a directory with a number of .dita files in it (each can be opened in
notepad).
Inside these files are differing numbers of 'keywords' and 'index terms'.
Each of these words is between tags.

eg: <keyword>help</keyword>
or
<indexterm>data administration</indexterm>

I would like to be able to run a script to pull these keywords and index
terms out and place them in either a MS word doc or an MS excel spreadsheet.

Am I am in the right forum? How would I go about doing this?

Sincerely,
Doug

From: Bob Barrows on 31 Mar 2010 14:01

So the idea is to get the just the two elements from 506 files into a
spreadsheet? Or is it enough to simply get all the data from all 506 files
into a single spreadsheet?

If the former, I still need to see the entire structure of at least one of
the rows of data (two rows would be better).

If the latter, assuming that all 506 files are in a single folder, it
shouldn't be too hard to use the msxml parser in combination with
filesystemobject to loop through the files and append the contents of each
file to a single xml document. Again, though, if you need more specific
help, you need to provide more information.

Huber57 wrote:
> Bob,
>
> Thanks much for the reply. I renamed one of the files (to an .xml
> format) and opened it in excel and it (very nicely) dropped the file
> into the spreadsheet with headers and the data listed below.
>
> Unfortunately, I have 506 files. I was hoping to automate. I have
> never done any scripting before.
>
> Thoughts?
>
> Doug
>
> "Bob Barrows" wrote:
>
>> Huber57 wrote:
>>> To whom:
>>> I have a directory with a number of .dita files in it (each can be
>>> opened in notepad).
>>> Inside these files are differing numbers of 'keywords' and 'index
>>> terms'. Each of these words is between tags.
>>>
>>> eg: <keyword>help</keyword>
>>> or
>>> <indexterm>data administration</indexterm>
>>>
>>> I would like to be able to run a script to pull these keywords and
>>> index terms out and place them in either a MS word doc or an MS
>>> excel spreadsheet.
>>>
>>> Am I am in the right forum? How would I go about doing this?
>>>
>> This would be a trivial problem if the files contained well-formed
>> xml as your samples make it appear. The problem is, we cannot be
>> sure if they really contain well-formed xml based on what you've
>> described. You need to show us an actual sample of the data
>> contained in one of these files.
>>
>> To illustrate how trivial this problem might be, create a text file
>> containing nothing but:
>>
>> <items>
>> <item>
>> <keyword>help</keyword>
>> <indexterm>data administration</indexterm>
>> </item>
>> <item>
>> <keyword>help</keyword>
>> <indexterm>network administration</indexterm>
>> </item>
>> </items>
>>
>> Save it as xmltest.txt. Then open Excel, click the Open button on the
>> toolbar, navigate to the folder containing the file you just saved,
>> change the file type to XML Files so you can see the file you saved
>> and open it. Excel will prompt you to tell it how to handle it -
>> tell it to import it as an XML List.
>>
>> If the files don't really contain valid, well-formed xml, we will
>> need to see more of what they contain if you need more than generic
>> advice.
>>

--
Microsoft MVP - ASP/ASP.NET - 2004-2007
Please reply to the newsgroup. This email account is my spam trap so I
don't check it very often. If you must reply off-line, then remove the
"NO SPAM"

From: Huber57 on 31 Mar 2010 14:46

Bob,

I would prefer the former (1 spreadsheet, all index terms and keywords).

Here is some sample code.

<title>Records</title>
<prolog>
<author>Mystery Writers</author>

<metadata><keywords>
<keyword>complication</keyword>
<keyword>complications</keyword>
<keyword>data</keyword>
<keyword>health</keyword>
<keyword>info</keyword>
<keyword>information</keyword>
<keyword>logbook</keyword>
<keyword>logbooks</keyword>
<keyword>my</keyword>
<indexterm>complications</indexterm>
<indexterm>logbook and records, complications</indexterm>
</keywords></metadata>
</prolog>

<conbody>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum felis
massa, ultricies eu auctor in, aliquam et mauris. Sed lobortis facilisis
nisl, vitae sagittis eros interdum ac. In dolor velit, 
</conbody>
<related-links>
<linklist>
<title>Related Links</title>
<link href=otherrecords.dita"><linktext>Other Records</linktext></link>
</linklist>
</related-links>
</concept>

Please let me know if you need anything else.

"Bob Barrows" wrote:

> So the idea is to get the just the two elements from 506 files into a
> spreadsheet? Or is it enough to simply get all the data from all 506 files
> into a single spreadsheet?
>
> If the former, I still need to see the entire structure of at least one of
> the rows of data (two rows would be better).
>
> If the latter, assuming that all 506 files are in a single folder, it
> shouldn't be too hard to use the msxml parser in combination with
> filesystemobject to loop through the files and append the contents of each
> file to a single xml document. Again, though, if you need more specific
> help, you need to provide more information.
>
>
> Huber57 wrote:
> > Bob,
> >
> > Thanks much for the reply. I renamed one of the files (to an .xml
> > format) and opened it in excel and it (very nicely) dropped the file
> > into the spreadsheet with headers and the data listed below.
> >
> > Unfortunately, I have 506 files. I was hoping to automate. I have
> > never done any scripting before.
> >
> > Thoughts?
> >
> > Doug
> >
> > "Bob Barrows" wrote:
> >
> >> Huber57 wrote:
> >>> To whom:
> >>> I have a directory with a number of .dita files in it (each can be
> >>> opened in notepad).
> >>> Inside these files are differing numbers of 'keywords' and 'index
> >>> terms'. Each of these words is between tags.
> >>>
> >>> eg: <keyword>help</keyword>
> >>> or
> >>> <indexterm>data administration</indexterm>
> >>>
> >>> I would like to be able to run a script to pull these keywords and
> >>> index terms out and place them in either a MS word doc or an MS
> >>> excel spreadsheet.
> >>>
> >>> Am I am in the right forum? How would I go about doing this?
> >>>
> >> This would be a trivial problem if the files contained well-formed
> >> xml as your samples make it appear. The problem is, we cannot be
> >> sure if they really contain well-formed xml based on what you've
> >> described. You need to show us an actual sample of the data
> >> contained in one of these files.
> >>
> >> To illustrate how trivial this problem might be, create a text file
> >> containing nothing but:
> >>
> >> <items>
> >> <item>
> >> <keyword>help</keyword>
> >> <indexterm>data administration</indexterm>
> >> </item>
> >> <item>
> >> <keyword>help</keyword>
> >> <indexterm>network administration</indexterm>
> >> </item>
> >> </items>
> >>
> >> Save it as xmltest.txt. Then open Excel, click the Open button on the
> >> toolbar, navigate to the folder containing the file you just saved,
> >> change the file type to XML Files so you can see the file you saved
> >> and open it. Excel will prompt you to tell it how to handle it -
> >> tell it to import it as an XML List.
> >>
> >> If the files don't really contain valid, well-formed xml, we will
> >> need to see more of what they contain if you need more than generic
> >> advice.
> >>
>
> --
> Microsoft MVP - ASP/ASP.NET - 2004-2007
> Please reply to the newsgroup. This email account is my spam trap so I
> don't check it very often. If you must reply off-line, then remove the
> "NO SPAM"
>
>
> .
>

From: Bob Barrows on 31 Mar 2010 16:23

I will not be able to return to this until tonight. Hopefully someone else
will beat me to it, but if not, I'll check back then.

Huber57 wrote:
> Bob,
>
> I would prefer the former (1 spreadsheet, all index terms and
> keywords).
>
> Here is some sample code.
>
> <title>Records</title>
> <prolog>
> <author>Mystery Writers</author>
>
> <metadata><keywords>
> <keyword>complication</keyword>
> <keyword>complications</keyword>
> <keyword>data</keyword>
> <keyword>health</keyword>
> <keyword>info</keyword>
> <keyword>information</keyword>
> <keyword>logbook</keyword>
> <keyword>logbooks</keyword>
> <keyword>my</keyword>
> <indexterm>complications</indexterm>
> <indexterm>logbook and records, complications</indexterm>
> </keywords></metadata>
> </prolog>
>
> <conbody>
> Lorem ipsum dolor sit amet, consectetur adipiscing elit.
> Vestibulum felis massa, ultricies eu auctor in, aliquam et mauris.
> Sed lobortis facilisis nisl, vitae sagittis eros interdum ac. In
> dolor velit, </conbody>
> <related-links>
> <linklist>
> <title>Related Links</title>
> <link href=otherrecords.dita"><linktext>Other
> Records</linktext></link> </linklist>
> </related-links>
> </concept>
>
> Please let me know if you need anything else.
>
> "Bob Barrows" wrote:
>
>> So the idea is to get the just the two elements from 506 files into a
>> spreadsheet? Or is it enough to simply get all the data from all 506
>> files into a single spreadsheet?
>>
>> If the former, I still need to see the entire structure of at least
>> one of the rows of data (two rows would be better).
>>
>> If the latter, assuming that all 506 files are in a single folder, it
>> shouldn't be too hard to use the msxml parser in combination with
>> filesystemobject to loop through the files and append the contents
>> of each file to a single xml document. Again, though, if you need
>> more specific help, you need to provide more information.
>>
>>
>> Huber57 wrote:
>>> Bob,
>>>
>>> Thanks much for the reply. I renamed one of the files (to an .xml
>>> format) and opened it in excel and it (very nicely) dropped the file
>>> into the spreadsheet with headers and the data listed below.
>>>
>>> Unfortunately, I have 506 files. I was hoping to automate. I have
>>> never done any scripting before.
>>>
>>> Thoughts?
>>>
>>> Doug
>>>
>>> "Bob Barrows" wrote:
>>>
>>>> Huber57 wrote:
>>>>> To whom:
>>>>> I have a directory with a number of .dita files in it (each can be
>>>>> opened in notepad).
>>>>> Inside these files are differing numbers of 'keywords' and 'index
>>>>> terms'. Each of these words is between tags.
>>>>>
>>>>> eg: <keyword>help</keyword>
>>>>> or
>>>>> <indexterm>data administration</indexterm>
>>>>>
>>>>> I would like to be able to run a script to pull these keywords and
>>>>> index terms out and place them in either a MS word doc or an MS
>>>>> excel spreadsheet.
>>>>>
>>>>> Am I am in the right forum? How would I go about doing this?
>>>>>
>>>> This would be a trivial problem if the files contained well-formed
>>>> xml as your samples make it appear. The problem is, we cannot be
>>>> sure if they really contain well-formed xml based on what you've
>>>> described. You need to show us an actual sample of the data
>>>> contained in one of these files.
>>>>
>>>> To illustrate how trivial this problem might be, create a text file
>>>> containing nothing but:
>>>>
>>>> <items>
>>>> <item>
>>>> <keyword>help</keyword>
>>>> <indexterm>data administration</indexterm>
>>>> </item>
>>>> <item>
>>>> <keyword>help</keyword>
>>>> <indexterm>network administration</indexterm>
>>>> </item>
>>>> </items>
>>>>
>>>> Save it as xmltest.txt. Then open Excel, click the Open button on
>>>> the toolbar, navigate to the folder containing the file you just
>>>> saved, change the file type to XML Files so you can see the file
>>>> you saved
>>>> and open it. Excel will prompt you to tell it how to handle it -
>>>> tell it to import it as an XML List.
>>>>
>>>> If the files don't really contain valid, well-formed xml, we will
>>>> need to see more of what they contain if you need more than generic
>>>> advice.
>>>>
>>
>> --
>> Microsoft MVP - ASP/ASP.NET - 2004-2007
>> Please reply to the newsgroup. This email account is my spam trap so
>> I don't check it very often. If you must reply off-line, then remove
>> the "NO SPAM"
>>
>>
>> .

--
Microsoft MVP - ASP/ASP.NET - 2004-2007
Please reply to the newsgroup. This email account is my spam trap so I
don't check it very often. If you must reply off-line, then remove the
"NO SPAM"

From: Bob Barrows on 31 Mar 2010 21:31

The problem with this sample is there is no root document. If the entire
content was nested inside a single element, perhaps called "dita" (see below
for what I am talking about), then there would be no problem. I tried
creating a test file with your data and opening it in Excel and got the
expected error "document can contain only one top element". So, given your
statement that you were able to open one of these files in Excel, I have to
conclude that you have not shown me the actual structure. There are other
syntax problems with this data (missing quotes, closing tags without opening
tags) that I will have to correct as well. I am going to have to change your
sample data into the correct format to test my code, which will appear at
the bottom of this post. It's quick and dirty, but it is tested and it
works.

Huber57 wrote:
> Bob,
>
> I would prefer the former (1 spreadsheet, all index terms and
> keywords).
>
> Here is some sample code.
>
<dita>
<title>Records</title>
<prolog>
<author>Mystery Writers</author>

<metadata><keywords>
<keyword>complication</keyword>
<keyword>complications</keyword>
<keyword>data</keyword>
<keyword>health</keyword>
<keyword>info</keyword>
<keyword>information</keyword>
<keyword>logbook</keyword>
<keyword>logbooks</keyword>
<keyword>my</keyword>
<indexterm>complications</indexterm>
<indexterm>logbook and records, complications</indexterm>
</keywords></metadata>
</prolog>

<conbody>
Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Vestibulum felis
massa, ultricies eu auctor in, aliquam et mauris. Sed lobortis

facilisis
nisl, vitae sagittis eros interdum ac. In dolor velit, 
</conbody>
<related-links>
<linklist>
<title>Related Links</title>
<link href="otherrecords.dita"><linktext>Other

Records</linktext></link>
</linklist>
</related-links>
<concept></concept>
</dita>
************************************************************************************
dim fso,fldr,fil,xmldoc,nodes, kwnode,itnode, xl,wb,ws,kwrow,itrow
dim pathtofiles
'replace with your path
pathtofiles="c:\filelib"
set fso=createobject("scripting.filesystemobject")
set fldr=fso.getfolder(pathtofiles & "\dita")
set xmldoc=createobject("msxml2.domdocument")
set xl=createobject("excel.application")
xl.workbooks.add
set wb=xl.workbooks(1)
set ws=wb.sheets(1)
ws.name="dita_values"
kwrow=1
itrow=1
ws.cells(kwrow,1).FormulaR1C1="Keywords"
ws.cells(itrow,2).FormulaR1C1="Index Terms"
ws.range("A1:B1").font.bold=true
'wscript.quit
kwrow=2
itrow=2
for each fil in fldr.files
xmldoc.load fil.path
set nodes = nothing
set nodes = xmldoc.selectnodes("//keyword")
if not nodes is nothing then
for each kwnode in nodes
ws.cells(kwrow,1).FormulaR1C1=kwnode.text
kwrow=kwrow+1
next
else
msgbox "no nodes were found"
end if
set nodes = nothing
set nodes = xmldoc.selectnodes("//indexterm")
if not nodes is nothing then
for each itnode in nodes
ws.cells(itrow,2).FormulaR1C1=itnode.text
itrow=itrow+1
next
end if
next
wb.saveas pathtofiles & "\keyword_indexterms.xls"
xl.quit

--
Microsoft MVP - ASP/ASP.NET - 2004-2007
Please reply to the newsgroup. This email account is my spam trap so I
don't check it very often. If you must reply off-line, then remove the
"NO SPAM"

| Next | Last
Pages: 1 2 3 4 5
Prev: How to disable and enable my Local Area Connection using scripts
Next: VB script