Multiline regex [Python]

Prev: An ODBC interface for Python 3?
Next: ANN: blist 1.2.0

From: Jeremy Sanders on 21 Jul 2010 12:56

Brandon Harris wrote:

> I'm trying to read in and parse an ascii type file that contains
> information that can span several lines.
> Example:

What about something like this (you need re.MULTILINE):

In [16]: re.findall('^([^ ].*\n([ ].*\n)+)', a, re.MULTILINE)
Out[16]:
[('createNode animCurveTU -n "test:master_globalSmooth";\n setAttr ".tan"
9;\n setAttr -s 4 ".ktv[0:3]" 101 0 163 0 169 0 201 0;\n setAttr -s 4
".kit[3]" 10;\n setAttr -s 4 ".kot[3]" 10;\n',
' setAttr -s 4 ".kot[3]" 10;\n'),
('createNode animCurveTU -n "test:master_res";\n setAttr ".tan" 9;\n
setAttr ".ktv[0]" 103 0;\n setAttr ".kot[0]" 5;\n',
' setAttr ".kot[0]" 5;\n'),
('createNode animCurveTU -n "test:master_faceRig";\n setAttr ".tan" 9;\n
setAttr ".ktv[0]" 103 0;\n',
' setAttr ".ktv[0]" 103 0;\n')]

If you blocks start without a space and subsequent lines with a space.

Jeremy

From: Steven D'Aprano on 21 Jul 2010 20:25

On Wed, 21 Jul 2010 10:06:14 -0500, Brandon Harris wrote:

> what do you mean by slurp the entire file? I'm trying to use regular
> expressions because line by line parsing will be too slow. And example
> file would have somewhere in the realm of 6 million lines of code.

And you think trying to run a regex over all 6 million lines at once will
be faster? I think you're going to be horribly, horribly disappointed.

And then on Wed, 21 Jul 2010 10:42:11 -0500, Brandon Harris wrote:

> I could make it that simple, but that is also incredibly slow and on a
> file with several million lines, it takes somewhere in the league of
> half an hour to grab all the data. I need this to grab data from many
> many file and return the data quickly.

What do you mean "grab" all the data? If all you mean is read the file,
then 30 minutes to read ~ 100MB of data is incredibly slow and you're
probably doing something wrong, or you're reading it over a broken link
with very high packet loss, or something.

If you mean read the data AND parse it, then whether that is "incredibly
slow" or "amazingly fast" depends entirely on how complicated your parser
needs to be.

If *all* you mean is "read the file and group the lines, for later
processing", then I would expect it to take well under a minute to group
millions of lines. Here's a simulation I ran, using 2001000 lines of text
based on the examples you gave. It grabs the blocks, as required, but
does no further parsing of them.

def merge(lines):
"""Join multiple lines into a single block."""
accumulator = []
for line in lines:
if line.lower().startswith('createnode'):
if accumulator:
yield ''.join(accumulator)
accumulator = []
accumulator.append(line)
if accumulator:
yield ''.join(accumulator)

def test():
import time
t = time.time()
count = 0
f = open('/steve/test.junk')
for block in merge(f):
# do some make-work
n = sum([1 for c in block if c in '1234567890'])
count += 1
print "Processed %d blocks from 2M+ lines." % count
print "Time taken:", time.time() - t, "seconds"

And the result on a low-end PC:

>>> test()
Processed 1000 blocks from 2M+ lines.
Time taken: 17.4497909546 seconds

--
Steven

First | Prev |
Pages: 1 2 3
Prev: An ODBC interface for Python 3?
Next: ANN: blist 1.2.0