From: Dave Haans on 7 Apr 2010 15:32 Hi all, I'm having trouble reading in a BibTeX file, partially because I'm not sure which method would work the best. BibTeX is simply a text list of bibliographic references in a standard format. Here's a sample listing (one of 1,000): @ARTICLE{Hotton2004, author = {T. Hotton and D. Haans}, title = {{Alcohol and Drug Use in Early Adolescence}}, journal = {Health Reports}, year = {2004}, volume = {15}, pages = {9 - 19}, number = {3}, abstract = {Objectives SNIP -- but the abstract covers many lines.}, datasets = {NLSCY (National Longitudinal Survey of Children and Youth)}, documenturl = {http://www.statcan.ca/bsolc/english/bsolc? catno=82-003-X20030036846}, keywords = {adolescent behaviour, alcoholic intoxication, marijuana, behaviours influencing health, child and adolescent health, family relationships, social networks, drug use, drug prevalence, parenting, cannabis, marijuana, cocaine, inhalents, heroin, lsd, alcohol, tobacco, cigarettes}, timestamp = {2009.09.01}, url = {http://www.statcan.gc.ca/bsolc/olc-cel/olc-cel?catno=82-003- X20030036846&lang=eng} } (...onto next reference after a line break...) I simply need a one-reference per observation dataset out of this, with pre-defined variable names (author, title, etc.). I started by reading in this file so that each line is an observation, and started parsing the text using SUBSTR and INDEXC, etc., from there. I stopped because reading in multiple lines as one variable is a bit complex, and I'd have to do it for many fields (author, title(s), keywords, etc.). I'm familiar with RETAIN but don't know how to store a abstract number of temp values (one for each line) over several lines, then concatenate x number of temp values. I also investigated the many INPUT and INFILE options but can't use named input without modifying the BibTeX file, and I'm unsure if reading it in as a binary file would work although that seems to be the way to go. What would you suggest as a good method to read this in? I guess if there's a simple way of using INFILE and INPUT that would be best, but I'd also great appreciate advice on reading in the value of one variable (say, 'abstract') when it occurs over many lines. Thanks in advance! Cheers, Dave.
From: Dave Haans on 7 Apr 2010 15:35 P.S. Note that the BibTeX reference here got wrapped -- normally, there are two spaces before each line other than the start @ and ending } for the reference.
From: Dave Haans on 8 Apr 2010 13:48 Here's what I did to get a pretty good reading in of a BibTeX file to a SAS data set. One requirement is that the BibTeX file have non- wrapping lines, as much as possible. This eliminates the need to read in a single field across observations. Using JabRef, an open-source bibliography database, I was able to go to Preferences and indicate which fields I didn't want to wrap when saving the file. This worked well except for some abstracts which were entered with line breaks. Sample SAS code: /* Read in BibTeX file as series of lines */ data rawtemp; length line $ 10000; infile 'C:\bibliography.bib' lrecl=10000 dlm='09'x missover encoding='utf-8'; input line; run; /* Parse lines from BibTeX file */ data rawentry; set rawtemp; length bibtextype $ 13 bibtexkey $ 50 author $ 500 title $ 2000 journal $ 250 year $ 4 volume $ 5 number $ 20 pages $ 11 abstract $ 4000 keywords $ 1000 contractid $ 10 networkref $ 10 owner $ 25 rdc $ 1000 timestamp $ 10 url $ 500 documenturl $ 500 datasets $ 500 publisher $ 50 editor $ 500 booktitle $ 2000 city $ 25 department $ 50 university $ 50; retain bibtextype bibtexkey author title journal year volume number pages abstract keywords contractid networkref owner rdc timestamp url documenturl datasets publisher editor booktitle city department university; /* Initialize fields and grab BibTeX reference type and BibTeX key */ if substr(line,1,1)='@' then do; bibtextype=''; bibtexkey=''; author=''; title=''; journal=''; year=''; volume=''; number=''; pages=''; abstract=''; keywords=''; contractid=''; networkref=''; owner=''; rdc=''; timestamp=''; url=''; documenturl=''; datasets=''; publisher=''; editor=''; booktitle=''; city=''; department=''; university=''; bibtextype=lowcase(substr(line,2,((indexc(line,'{'))-2))); bibtexkey=substr(line,((indexc(line,'{'))+1),((indexc(line,','))- (indexc(line,'{'))-1)); end; /* Get author(s) */ if index(line,'author = {') and index(line,'},') then author=substr(line,(index(line,'author = {')+10),(index(line,'},')- index(line,'author = {'))-10); /* Get title - Note: COMPRESS is used to remove curly braces from around the title */ if index(line,'title = {') and index(line,'},') then title=compress((substr(line,(index(line,'title = {')+9), (index(line,'},')-index(line,'title = {'))-9)),'{}'); /* Get journal */ if index(line,'journal = {') and index(line,'},') then journal=substr(line,(index(line,'journal = {')+11),(index(line,'},')- index(line,'journal = {'))-11); /*...do same for all fields...*/ /* Output at end of each record */ if substr(line,1,1)='}' then output; run;
|
Pages: 1 Prev: Data subsetting with Arrays Next: By statement to select cases |