From: Steve Holden on 5 Jan 2010 21:06 Tim Chase wrote: > vsoler wrote: >> Hence, I need to parse Excel formulas. Can I do it by means only of re >> (regular expressions)? >> >> I know that for simple formulas such as "=3*A7+5" it is indeed >> possible. What about complex for formulas that include functions, >> sheet names and possibly other *.xls files? > > Where things start getting ugly is when you have nested function calls, > such as > > =if(Sum(A1:A25)>42,Min(B1:B25), if(Sum(C1:C25)>3.14, > (Min(C1:C25)+3)*18,Max(B1:B25))) > > Regular expressions don't do well with nested parens (especially > arbitrarily-nesting-depth such as are possible), so I'd suggest going > for a full-blown parsing solution like pyparsing. > > If you have fair control over what can be contained in the formulas and > you know they won't contain nested parens/functions, you might be able > to formulate some sort of "kinda, sorta, maybe parses some forms of > formulas" regexp. > And don't forget about named ranges, which can reference cells without using anything but a plain identifier ... regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 PyCon is coming! Atlanta, Feb 2010 http://us.pycon.org/ Holden Web LLC http://www.holdenweb.com/ UPCOMING EVENTS: http://holdenweb.eventbrite.com/
From: John Machin on 6 Jan 2010 19:53 On Jan 6, 6:54 am, vsoler <vicente.so...(a)gmail.com> wrote: > On 5 ene, 20:21, vsoler <vicente.so...(a)gmail.com> wrote: > > > > > On 5 ene, 20:05, Mensanator <mensana...(a)aol.com> wrote: > > > > On Jan 5, 12:35 pm, MRAB <pyt...(a)mrabarnett.plus.com> wrote: > > > > > vsoler wrote: > > > > > Hello, > > > > > > I am acessing an Excel file by means of Win 32 COM technology. > > > > > For a given cell, I am able to read its formula. I want to make a map > > > > > of how cells reference one another, how different sheets reference one > > > > > another, how workbooks reference one another, etc. > > > > > > Hence, I need to parse Excel formulas. Can I do it by means only of re > > > > > (regular expressions)? > > > > > > I know that for simple formulas such as "=3*A7+5" it is indeed > > > > > possible. What about complex for formulas that include functions, > > > > > sheet names and possibly other *.xls files? > > > > > > For example "=Book1!A5+8" should be parsed into ["=","Book1", "!", > > > > > "A5","+","8"] > > > > > > Can anybody help? Any suggestions? > > > > > Do you mean "how" or do you really mean "whether", ie, get a list of the > > > > other cells that are referred to by a certain cell, for example, > > > > "=3*A7+5" should give ["A7"] and "=Book1!A5+8" should give ["Book1!A5] > > > > Ok, although "Book1" would be the default name of a workbook, with > > > default > > > worksheets labeled "Sheet1". "Sheet2", etc. > > > > If I had a worksheet named "Sheety" that wanted to reference a cell on > > > "Sheetx" > > > OF THE SAME WORKBOOK, it would be =Sheet2!A7. If the reference was to > > > a completely > > > different workbook (say Book1 with worksheets labeled "Sheet1", > > > "Sheet2") then > > > the cell might have =[Book1]Sheet1!A7. > > > > And don't forget the $'s! You may see =[Book1]Sheet1!$A$7. > > > Yes, Mensanator, but... what re should I use? I'm looking for the re > > statement. No doubt you can help! > > > Thank you. > > Let me give you an example: > > >>> import re > >>> re.split("([^0-9])", "123+456*/") > > [123, +, 456, *, , /, ] > > I find it excellent that one single statement is able to do a lexical > analysis of an expression! That is NOT lexical analysis. > > If the expression contains variables, such as A12 or B9, I can try > another re expression. Which one should I use? > > And if my expression contains parenthesis? And the sin() function? You need a proper lexical analysis, followed by a parser. What you are trying to do can NOT be accomplished in any generality with a single regex. The Excel formula syntax has several tricky bits. E.g. IIRC whether TAX09 is a (macro) name or a cell reference depends on what version of Excel you are targetting but if it appears like TAX09! A1:B2 then it's a sheet name. The xlwt package (of which I am the maintainer) has a lexer and parser for a largish subset of the syntax ... see http://pypi.python.org/pypi/xlwt
From: Chris Withers on 12 Jan 2010 02:26 John Machin wrote: > The xlwt package (of which I am the maintainer) has a lexer and parser > for a largish subset of the syntax ... see http://pypi.python.org/pypi/xlwt xlrd, no? Also worth pointing out that the topic of Python and Excel has its own web site: http://www.python-excel.org ....which links to a more specialist group. Chris -- Simplistix - Content Management, Batch Processing & Python Consulting - http://www.simplistix.co.uk
From: John Machin on 12 Jan 2010 07:35 On 12/01/2010 6:26 PM, Chris Withers wrote: > John Machin wrote: >> The xlwt package (of which I am the maintainer) has a lexer and parser >> for a largish subset of the syntax ... see >> http://pypi.python.org/pypi/xlwt > > xlrd, no? A facility in xlrd to decompile Excel formula bytecode into a text formula is currently *under discussion*. The OP was planning to dig the formula text out using COM then parse the formula text looking for cell references and appeared to have a rather simplistic view of the ease of parsing Excel formula text -- that's why I pointed him at those facilities (existing, released, proven in the field) in xlwt.
From: Paul McGuire on 13 Jan 2010 03:15
On Jan 5, 1:49 pm, Tim Chase <python.l...(a)tim.thechases.com> wrote: > vsoler wrote: > > Hence, I need toparseExcel formulas. Can I do it by means only of re > > (regular expressions)? > > > I know that for simple formulas such as "=3*A7+5" it is indeed > > possible. What about complex for formulas that include functions, > > sheet names and possibly other *.xls files? > > Where things start getting ugly is when you have nested function > calls, such as > > =if(Sum(A1:A25)>42,Min(B1:B25), if(Sum(C1:C25)>3.14, > (Min(C1:C25)+3)*18,Max(B1:B25))) > > Regular expressions don't do well with nested parens (especially > arbitrarily-nesting-depth such as are possible), so I'd suggest > going for a full-blown parsing solution like pyparsing. > > If you have fair control over what can be contained in the > formulas and you know they won't contain nested parens/functions, > you might be able to formulate some sort of "kinda, sorta, maybe > parses some forms of formulas" regexp. > > -tkc This might give the OP a running start: from pyparsing import (CaselessKeyword, Suppress, Word, alphas, alphanums, nums, Optional, Group, oneOf, Forward, Regex, operatorPrecedence, opAssoc, dblQuotedString) test1 = "=3*A7+5" test2 = "=3*Sheet1!$A$7+5" test3 = "=if(Sum(A1:A25)>42,Min(B1:B25), " \ "if(Sum(C1:C25)>3.14, (Min(C1:C25)+3)*18,Max(B1:B25)))" EQ,EXCL,LPAR,RPAR,COLON,COMMA,DOLLAR = map(Suppress, '=!():,$') sheetRef = Word(alphas, alphanums) colRef = Optional(DOLLAR) + Word(alphas,max=2) rowRef = Optional(DOLLAR) + Word(nums) cellRef = Group(Optional(sheetRef + EXCL)("sheet") + colRef("col") + rowRef("row")) cellRange = (Group(cellRef("start") + COLON + cellRef("end")) ("range") | cellRef ) expr = Forward() COMPARISON_OP = oneOf("< = > >= <= != <>") condExpr = expr + COMPARISON_OP + expr ifFunc = (CaselessKeyword("if") + LPAR + Group(condExpr)("condition") + COMMA + expr("if_true") + COMMA + expr("if_false") + RPAR) statFunc = lambda name : CaselessKeyword(name) + LPAR + cellRange + RPAR sumFunc = statFunc("sum") minFunc = statFunc("min") maxFunc = statFunc("max") aveFunc = statFunc("ave") funcCall = ifFunc | sumFunc | minFunc | maxFunc | aveFunc multOp = oneOf("* /") addOp = oneOf("+ -") numericLiteral = Regex(r"\-?\d+(\.\d+)?") operand = numericLiteral | funcCall | cellRange | cellRef arithExpr = operatorPrecedence(operand, [ (multOp, 2, opAssoc.LEFT), (addOp, 2, opAssoc.LEFT), ]) textOperand = dblQuotedString | cellRef textExpr = operatorPrecedence(textOperand, [ ('&', 2, opAssoc.LEFT), ]) expr << (arithExpr | textExpr) import pprint for test in (test1,test2, test3): print test pprint.pprint( (EQ + expr).parseString(test).asList() ) Prints: =3*A7+5 [[['3', '*', ['A', '7']], '+', '5']] =3*Sheet1!$A$7+5 [[['3', '*', ['Sheet1', 'A', '7']], '+', '5']] =if(Sum(A1:A25)>42,Min(B1:B25), if(Sum(C1:C25)>3.14, (Min(C1:C25)+3) *18,Max(B1:B25))) ['if', ['sum', [['A', '1'], ['A', '25']], '>', '42'], 'min', [['B', '1'], ['B', '25']], 'if', ['sum', [['C', '1'], ['C', '25']], '>', '3.14'], [['min', [['C', '1'], ['C', '25']], '+', '3'], '*', '18'], 'max', [['B', '1'], ['B', '25']]] -- Paul |