parsing an Excel formula with the re module [Python]

Prev: embedded python on mac - linking problem
Next: TypeError

From: Chris Withers on 13 Jan 2010 04:44

John Machin wrote:
> The OP was planning to dig the formula text out using COM then parse the
> formula text looking for cell references and appeared to have a rather
> simplistic view of the ease of parsing Excel formula text -- that's why
> I pointed him at those facilities (existing, released, proven in the
> field) in xlwt.

Which bits of xlwt are you referring to? (at a guess, the stuff that
turns a piece of text into the correct formulae gubbinz when you write a
formula to a cell with xlwt?)

cheers,

Chris

--
Simplistix - Content Management, Batch Processing & Python Consulting
- http://www.simplistix.co.uk

From: Gabriel Genellina on 13 Jan 2010 22:05

En Wed, 13 Jan 2010 05:15:52 -0300, Paul McGuire <ptmcg(a)austin.rr.com>
escribi�:
>> vsoler wrote:
>> > Hence, I need toparseExcel formulas. Can I do it by means only of re
>> > (regular expressions)?
>
> This might give the OP a running start:
>
> from pyparsing import (CaselessKeyword, Suppress, ...

Did you build those parsing rules just by common sense, or following some
actual specification?

--
Gabriel Genellina

From: John Machin on 14 Jan 2010 19:16

On Jan 13, 7:15 pm, Paul McGuire <pt...(a)austin.rr.com> wrote:
> On Jan 5, 1:49 pm, Tim Chase <python.l...(a)tim.thechases.com> wrote:
>
>
>
> > vsoler wrote:
> > > Hence, I need toparseExcel formulas. Can I do it by means only of re
> > > (regular expressions)?
>
> > > I know that for simple formulas such as "=3*A7+5" it is indeed
> > > possible. What about complex for formulas that include functions,
> > > sheet names and possibly other *.xls files?
>
> > Where things start getting ugly is when you have nested function
> > calls, such as
>
> > =if(Sum(A1:A25)>42,Min(B1:B25), if(Sum(C1:C25)>3.14,
> > (Min(C1:C25)+3)*18,Max(B1:B25)))
>
> > Regular expressions don't do well with nested parens (especially
> > arbitrarily-nesting-depth such as are possible), so I'd suggest
> > going for a full-blown parsing solution like pyparsing.
>
> > If you have fair control over what can be contained in the
> > formulas and you know they won't contain nested parens/functions,
> > you might be able to formulate some sort of "kinda, sorta, maybe
> > parses some forms of formulas" regexp.
>
> > -tkc
>
> This might give the OP a running start:

Unfortunately "this" will blow up after only a few paces; see
below ...

>
> from pyparsing import (CaselessKeyword, Suppress, Word, alphas,
> alphanums, nums, Optional, Group, oneOf, Forward, Regex,
> operatorPrecedence, opAssoc, dblQuotedString)
>
> test1 = "=3*A7+5"
> test2 = "=3*Sheet1!$A$7+5"

test2a ="=3*'Sheet 1'!$A$7+5"
test2b ="=3*'O''Reilly''s sheet'!$A$7+5"

> test3 = "=if(Sum(A1:A25)>42,Min(B1:B25), " \
> "if(Sum(C1:C25)>3.14, (Min(C1:C25)+3)*18,Max(B1:B25)))"

Many functions can take a variable number of args and they are not
restricted to cell references e.g.

test3a = "=sum(a1:a25,10,min(b1,c2,d3))"

The arg separator is comma or semicolon depending on the locale ... a
parser should accept either.

> EQ,EXCL,LPAR,RPAR,COLON,COMMA,DOLLAR = map(Suppress, '=!():,$')
> sheetRef = Word(alphas, alphanums)
> colRef = Optional(DOLLAR) + Word(alphas,max=2)
> rowRef = Optional(DOLLAR) + Word(nums)
> cellRef = Group(Optional(sheetRef + EXCL)("sheet") + colRef("col") +
> rowRef("row"))
>
> cellRange = (Group(cellRef("start") + COLON + cellRef("end"))
> ("range")
> | cellRef )
>
> expr = Forward()
>
> COMPARISON_OP = oneOf("< = > >= <= != <>")
> condExpr = expr + COMPARISON_OP + expr
>
> ifFunc = (CaselessKeyword("if") +
> LPAR +
> Group(condExpr)("condition") +

that should be any expression; at run-time it expects a boolean (TRUE
or FALSE) or a number (0 means false, non-0 means true). Text causes a
#VALUE! error. Trying to subdivide expressions into conditional /
numeric /text just won't work.

> COMMA + expr("if_true") +
> COMMA + expr("if_false") + RPAR)
> statFunc = lambda name : CaselessKeyword(name) + LPAR + cellRange +
> RPAR
> sumFunc = statFunc("sum")
> minFunc = statFunc("min")
> maxFunc = statFunc("max")
> aveFunc = statFunc("ave")
> funcCall = ifFunc | sumFunc | minFunc | maxFunc | aveFunc
>
> multOp = oneOf("* /")
> addOp = oneOf("+ -")

needs power op "^"

> numericLiteral = Regex(r"\-?\d+(\.\d+)?")

Sorry, that "-" in there is a unary minus operator. What about 1e23 ?

> operand = numericLiteral | funcCall | cellRange | cellRef
> arithExpr = operatorPrecedence(operand,
> [
> (multOp, 2, opAssoc.LEFT),
> (addOp, 2, opAssoc.LEFT),
> ])
>
> textOperand = dblQuotedString | cellRef
> textExpr = operatorPrecedence(textOperand,
> [
> ('&', 2, opAssoc.LEFT),
> ])

Excel evaluates excessively permissively, and the punters are
definitely not known for self-restraint. The above just won't work:
2.3 & 4.5 produces text "2.34.5", while "2.3" + "4.5" produces number
6.8.

From: John Machin on 14 Jan 2010 19:23

On Jan 14, 2:05 pm, "Gabriel Genellina" <gagsl-...(a)yahoo.com.ar>
wrote:
> En Wed, 13 Jan 2010 05:15:52 -0300, Paul McGuire <pt...(a)austin.rr.com>
> escribió:
>
> >> vsoler wrote:
> >> > Hence, I need toparseExcel formulas. Can I do it by means only of re
> >> > (regular expressions)?
>
> > This might give the OP a running start:
>
> > from pyparsing import (CaselessKeyword, Suppress, ...
>
> Did you build those parsing rules just by common sense, or following some
> actual specification?

Leave your common sense with the barkeep when you enter the Excel
saloon; it is likely to be a hindrance. The specification is what
Excel does.

From: Paul McGuire on 14 Jan 2010 23:41

I never represented that this parser would handle any and all Excel
formulas! But I should hope the basic structure of a pyparsing
solution might help the OP add some of the other features you cited,
if necessary. It's actually pretty common to take an incremental
approach in making such a parser, and so here are some of the changes
that you would need to make based on the deficiencies you pointed out:

functions can have a variable number of arguments, of any kind of
expression
- statFunc = lambda name : CaselessKeyword(name) + LPAR + delimitedList
(expr) + RPAR

sheet name could also be a quoted string
- sheetRef = Word(alphas, alphanums) | QuotedString("'",escQuote="''")

add boolean literal support
- boolLiteral = oneOf("TRUE FALSE")
- operand = numericLiteral | funcCall | boolLiteral | cellRange |
cellRef

These small changes are enough to extend the parser to successfully
handle the test2a, 2b, and 3a cases. (I'll add this to the pyparsing
wiki examples, as it looks like it is a good start on a familiar but
complex expression.)

-- Paul

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: embedded python on mac - linking problem
Next: TypeError