Reading in BibTeX files using SAS [SAS]

Prev: Data subsetting with Arrays
Next: By statement to select cases

From: Dave Haans on 7 Apr 2010 15:32

Hi all,

I'm having trouble reading in a BibTeX file, partially because I'm not
sure which method would work the best.

BibTeX is simply a text list of bibliographic references in a standard
format. Here's a sample listing (one of 1,000):

@ARTICLE{Hotton2004,
author = {T. Hotton and D. Haans},
title = {{Alcohol and Drug Use in Early Adolescence}},
journal = {Health Reports},
year = {2004},
volume = {15},
pages = {9 - 19},
number = {3},
abstract = {Objectives

SNIP -- but the abstract covers many lines.},
datasets = {NLSCY (National Longitudinal Survey of Children and
Youth)},
documenturl = {http://www.statcan.ca/bsolc/english/bsolc?
catno=82-003-X20030036846},
keywords = {adolescent behaviour, alcoholic intoxication, marijuana,
behaviours
influencing health, child and adolescent health, family
relationships,
social networks, drug use, drug prevalence, parenting, cannabis,
marijuana, cocaine, inhalents, heroin, lsd, alcohol, tobacco,
cigarettes},
timestamp = {2009.09.01},
url = {http://www.statcan.gc.ca/bsolc/olc-cel/olc-cel?catno=82-003-
X20030036846&lang=eng}
}

(...onto next reference after a line break...)

I simply need a one-reference per observation dataset out of this,
with pre-defined variable names (author, title, etc.).

I started by reading in this file so that each line is an observation,
and started parsing the text using SUBSTR and INDEXC, etc., from
there. I stopped because reading in multiple lines as one variable is
a bit complex, and I'd have to do it for many fields (author,
title(s), keywords, etc.). I'm familiar with RETAIN but don't know
how to store a abstract number of temp values (one for each line) over
several lines, then concatenate x number of temp values.

I also investigated the many INPUT and INFILE options but can't use
named input without modifying the BibTeX file, and I'm unsure if
reading it in as a binary file would work although that seems to be
the way to go.

What would you suggest as a good method to read this in? I guess if
there's a simple way of using INFILE and INPUT that would be best, but
I'd also great appreciate advice on reading in the value of one
variable (say, 'abstract') when it occurs over many lines.

Thanks in advance!

Cheers,

Dave.

From: Dave Haans on 7 Apr 2010 15:35

P.S. Note that the BibTeX reference here got wrapped -- normally,
there are two spaces before each line other than the start @ and
ending } for the reference.

From: Dave Haans on 8 Apr 2010 13:48

Here's what I did to get a pretty good reading in of a BibTeX file to
a SAS data set. One requirement is that the BibTeX file have non-
wrapping lines, as much as possible. This eliminates the need to read
in a single field across observations. Using JabRef, an open-source
bibliography database, I was able to go to Preferences and indicate
which fields I didn't want to wrap when saving the file. This worked
well except for some abstracts which were entered with line breaks.

Sample SAS code:

/* Read in BibTeX file as series of lines */
data rawtemp;
length line $ 10000;
infile 'C:\bibliography.bib' lrecl=10000 dlm='09'x missover
encoding='utf-8';
input line;
run;

/* Parse lines from BibTeX file */
data rawentry; set rawtemp;
length bibtextype $ 13 bibtexkey $ 50 author $ 500 title $ 2000
journal $ 250 year $ 4 volume $ 5
number $ 20 pages $ 11 abstract $ 4000 keywords $ 1000 contractid $
10 networkref $ 10
owner $ 25 rdc $ 1000 timestamp $ 10 url $ 500 documenturl $ 500
datasets $ 500 publisher $ 50
editor $ 500 booktitle $ 2000 city $ 25 department $ 50 university $
50;
retain bibtextype bibtexkey author title journal year volume number
pages abstract keywords contractid
networkref owner rdc timestamp url documenturl datasets publisher
editor booktitle city
department university;

/* Initialize fields and grab BibTeX reference type and BibTeX key */
if substr(line,1,1)='@' then do;

bibtextype=''; bibtexkey=''; author=''; title=''; journal=''; year='';
volume=''; number=''; pages='';
abstract=''; keywords=''; contractid=''; networkref=''; owner='';
rdc=''; timestamp=''; url='';
documenturl=''; datasets=''; publisher=''; editor=''; booktitle='';
city=''; department=''; university='';

bibtextype=lowcase(substr(line,2,((indexc(line,'{'))-2)));
bibtexkey=substr(line,((indexc(line,'{'))+1),((indexc(line,','))-
(indexc(line,'{'))-1));
end;

/* Get author(s) */
if index(line,'author = {') and index(line,'},') then
author=substr(line,(index(line,'author = {')+10),(index(line,'},')-
index(line,'author = {'))-10);

/* Get title - Note: COMPRESS is used to remove curly braces from
around the title */
if index(line,'title = {') and index(line,'},') then
title=compress((substr(line,(index(line,'title = {')+9),
(index(line,'},')-index(line,'title = {'))-9)),'{}');

/* Get journal */
if index(line,'journal = {') and index(line,'},') then
journal=substr(line,(index(line,'journal = {')+11),(index(line,'},')-
index(line,'journal = {'))-11);

/*...do same for all fields...*/

/* Output at end of each record */
if substr(line,1,1)='}' then output;
run;

|
Pages: 1
Prev: Data subsetting with Arrays
Next: By statement to select cases