From: Collins on 7 Jun 2010 15:28 Hello List I am relatively new to ruby. I have set myself the problem of writing a lexical analyzer in ruby to learn some of it's capabilites. I have pasted the code for that class and for the calling test harness below. I beg the lists indulgence in several ways 1) has this problem already been solved in a "gem"? I'd love to see how a more sophisticated rubyist solves it 2) There is object manipulation with which I'm still not comfortable. In particular, in the buffer manipulation code in the method analyze makes me unhappy and I'd be happy to receive instructions in a better way to do it 3) Every lanuage has its idioms. I'm not at all sure that I'm using the best or most "ruby-like" way of doing certain things. Again I welcome suggestions. Thanks in advance Collins ##### code snippet 1 ###### class Rule attr_reader :tokID, :re def initialize(_tokID, _re) @tokID = _tokID @re = _re end def to_s self.class.name + ": " + @tokID.to_s + "::= " + @re.to_s end end class Match attr_reader :rule, :lexeme def initialize(_r, _s) @rule = _r @lexeme = _s end def to_s self.class.name + ": " + @rule.to_s + "\nmatches: " + @lexeme.to_s end end class Lexer attr_reader :errString # keep a collection of regular expressions and values to return as token # types # then match text to the longest substring yet seen def initialize @rules = Array.new @buff = String.new @aFile = nil @errString = nil end def addToken (tokID, re) if re.class.name == "String" @rules << Rule.new(tokID, Regexp.new(re)) elsif re.class.name == "Regexp" @rules << Rule.new(tokID, re) else print "unsupported type in addToken: ", re.class.name, "\n" end end def findMatch maxLexeme, maxMatch = String.new, nil matchCount, rule2 = 0, nil @rules.each { |rule| # loop invariant: # maxLexeme contains the longest matching prefix of @buff found so far, # matchCount contains the number of rules that have matched maxLexeme, # maxMatch contains the proposed return value # rule2 contains a subsequent rule that matches maxLexeme # # if rule matches from beginning of @buff AND # does not match all of @buff AND # match is longer than previous longest match # then update maxMatch and maxLexeme and matchCount and rule2 # # but... we have to avoid matching and keep looking if we make it to the # end of @buff with a match active (it could still collect more # characters) OR if more than one match is still active. If the end of # the @buff is also the end of the file then it's ok to match to the end # # TODO: think about prepending an anchor to the regexp to eliminate the # false matches (those not to the beginning of the @buff) # md = rule.re.match(@buff) if !md.nil? && md.pre_match.length == 0 if md[0].length == @buff.length && !@aFile.eof? # @buff is potentially ambiguous and there is more file to parse return nil elsif md[0].length > maxLexeme.length # either matching less than whole buffer or at eof AND # match is longer than any prior match matchCount, rule2 = 1, nil maxLexeme, maxMatch = md[0], Match.new(rule,md[0]) elsif md[0].length == maxLexeme.length # a subsequent match of equal length has been found matchCount += 1 rule2 = rule else # short match... skip end else # either rule did not match @buff OR # rule did not match the start of @buff end } if !maxMatch.nil? && matchCount == 1 #return an unambiguous match return maxMatch elsif !maxMatch.nil? && matchCount > 1 print "ambiguous: ", maxLexeme, " : ", maxMatch.rule.to_s, " : ", rule2.to_s, "\n" return nil else # no match was found return nil end end def analyze aMatch = findMatch if !aMatch.nil? #remove matched text from buff oldBuff = String.new(@buff) newBuff = @buff[aMatch.lexeme.length,@buff.length-1] if oldBuff != aMatch.lexeme + newBuff puts oldBuff puts "compare failure!" puts aMatch.lexeme + newBuff end @buff = newBuff end return aMatch end def parseFile(_name) @fileName = _name @aFile = File.new(@fileName, "r") @aFile.each {|line| # add lines from file to @buff... after each addition yield as many # tokens as possible @buff += line # comsume all the tokens from @buff that can be found... when no more # can be found analyze will return nil... so we'll get another line aMatch = analyze while !aMatch.nil? # deliver one <token, lexeme pair> at a time to caller... # by convention a nil tokID is one about which the caller does not # care to hear... yield aMatch.rule.tokID, aMatch.lexeme if ! aMatch.rule.tokID.nil? aMatch = analyze end } # @buff contains the earliest unmatched text... if @buff is not empty when # we finish with the file, this is an error if !@buff.empty? @errString = "error: unmatched text:\n" + @buff[0,[80, @buff.length].min] return false else @errStrng = "no errors detected\n" return true end end end ##### code snippet 2 ###### WhiteSpaceToken = 0 CommentToken = 1 QuotedStringToken = 2 WordToken = 3 require "lexer" l = Lexer.new l.addToken(nil, Regexp.new("\\s+", Regexp::MULTILINE)) l.addToken(nil, Regexp.new("#.*[\\n\\r]+")) #l.addToken(QuotedStringToken, Regexp.new('["][^"]*["]', Regexp::MULTILINE)) l.addToken(QuotedStringToken,'["]((\\\")|[^\\\"])*"') l.addToken(WordToken,Regexp.new("\\w+")) foo = l.parseFile("testFile1") { |token, lexeme| print token.to_s + ":" + lexeme.to_s + "\n" } if foo print "pass!\n" else print "fail: " + l.errString + "\n" end
From: Robert Klemme on 7 Jun 2010 16:05 On 07.06.2010 21:28, Collins wrote: > Hello List > > I am relatively new to ruby. I have set myself the problem of writing > a lexical analyzer in ruby to learn some of it's capabilites. I have > pasted the code for that class and for the calling test harness > below. I beg the lists indulgence in several ways > > 1) has this problem already been solved in a "gem"? I'd love to see > how a more sophisticated rubyist solves it There are certainly parser and lexer generators for Ruby. I cannot remember one off the top of my head but you'll likely find one in RAA: http://raa.ruby-lang.org/search.rhtml?search=lexer http://raa.ruby-lang.org/search.rhtml?search=parser > 2) There is object manipulation with which I'm still not comfortable. > In particular, in the buffer manipulation code in the method analyze > makes me unhappy and I'd be happy to receive instructions in a better > way to do it > 3) Every lanuage has its idioms. I'm not at all sure that I'm using > the best or most "ruby-like" way of doing certain things. Again I > welcome suggestions. > > Thanks in advance > > Collins > > ##### code snippet 1 ###### > class Rule > attr_reader :tokID, :re > def initialize(_tokID, _re) > @tokID = _tokID > @re = _re > end > > def to_s > self.class.name + ": " + @tokID.to_s + "::= " + @re.to_s > end > end There are several things in the code above: we use tok_id instead of tokID for members and instance variables. Only classes use CamelCase. Also it seems highly uncommon to start identifiers with an underscore. An alternative way to create the String would be def to_s "#{self.class.name}: #{@tok_id}::= #{@re}" end String interpolation implicitly applies #to_s. Finally you can define that class in one line: Rule = Struct.new :tok_id, :re > class Match > attr_reader :rule, :lexeme > def initialize(_r, _s) > @rule = _r > @lexeme = _s > end > > def to_s > self.class.name + ": " + @rule.to_s + "\nmatches: " + @lexeme.to_s > end > end > > class Lexer > attr_reader :errString > # keep a collection of regular expressions and values to return as > token > # types > # then match text to the longest substring yet seen > def initialize > @rules = Array.new If you are lazy you can as well do @rules = [] > @buff = String.new We usually simply do @buff = "" This also creates a new empty String and is easier to spot. > @aFile = nil > @errString = nil > end > > def addToken (tokID, re) > if re.class.name == "String" > @rules<< Rule.new(tokID, Regexp.new(re)) > elsif re.class.name == "Regexp" > @rules<< Rule.new(tokID, re) > else > print "unsupported type in addToken: ", re.class.name, "\n" > end > end def add_token(tok_id, re) @rules << Rule.new(tok_id, case re when String Regexp.new(re) when Regexp re else raise ArgumentError, "Neither String nor regexp" end) end "case" works by using the #=== operator which happens do be defined for classes as kind_of? check. It's also safer to work with class instances than with class names. In error cases we throw exceptions and let the someone up the call chain decide how to deal with the error. In your case you just get output on the console which might not be appropriate in all cases. > def findMatch > maxLexeme, maxMatch = String.new, nil > matchCount, rule2 = 0, nil > @rules.each { |rule| > # loop invariant: > # maxLexeme contains the longest matching prefix of @buff found > so far, > # matchCount contains the number of rules that have matched > maxLexeme, > # maxMatch contains the proposed return value > # rule2 contains a subsequent rule that matches maxLexeme > # > # if rule matches from beginning of @buff AND > # does not match all of @buff AND > # match is longer than previous longest match > # then update maxMatch and maxLexeme and matchCount and rule2 > # > # but... we have to avoid matching and keep looking if we make > it to the > # end of @buff with a match active (it could still collect more > # characters) OR if more than one match is still active. If the > end of > # the @buff is also the end of the file then it's ok to match to > the end > # > # TODO: think about prepending an anchor to the regexp to > eliminate the > # false matches (those not to the beginning of the @buff) > # > > md = rule.re.match(@buff) > if !md.nil?&& md.pre_match.length == 0 > if md[0].length == @buff.length&& !@aFile.eof? > # @buff is potentially ambiguous and there is more file to parse > return nil > elsif md[0].length> maxLexeme.length > # either matching less than whole buffer or at eof AND > # match is longer than any prior match > matchCount, rule2 = 1, nil > maxLexeme, maxMatch = md[0], Match.new(rule,md[0]) > elsif md[0].length == maxLexeme.length > # a subsequent match of equal length has been found > matchCount += 1 > rule2 = rule > else > # short match... skip > end > else > # either rule did not match @buff OR > # rule did not match the start of @buff > end > } > if !maxMatch.nil?&& matchCount == 1 > #return an unambiguous match > return maxMatch > elsif !maxMatch.nil?&& matchCount> 1 > print "ambiguous: ", maxLexeme, " : ", maxMatch.rule.to_s, " : > ", > rule2.to_s, "\n" > return nil > else > # no match was found > return nil > end > end Somehow this method seems a bit lengthy. I did not look too deep into the details but I'd probably pick a different strategy. First, I'd anchor expressions (like you suggested in your comment). Then I'd just do matches = {} @rules.each do |rule| m = rule.re.match and matches[rule] = m end matches Now you know that case matches.size when 0 # nothing matches any more, take last match and strip # buffer when 1 # single match, remember as last match else # many matches, continue end If you place that in a loop that adds a character to the buffer at a time and then invokes find_match you can do the evaluation like indicated above. > def analyze > aMatch = findMatch > if !aMatch.nil? > #remove matched text from buff > oldBuff = String.new(@buff) > newBuff = @buff[aMatch.lexeme.length,@buff.length-1] > if oldBuff != aMatch.lexeme + newBuff > puts oldBuff > puts "compare failure!" > puts aMatch.lexeme + newBuff > end > @buff = newBuff > end > return aMatch > end > > def parseFile(_name) > @fileName = _name > @aFile = File.new(@fileName, "r") > @aFile.each {|line| Better use the block form of File.open() or use File.foreach. That way you can be sure that the file handle is always properly closed. See http://blog.rubybestpractices.com/posts/rklemme/001-Using_blocks_for_Robustness.html I'd also choose a different reading strategy - one character at a time or fixed buffer width. But I would not read lines > # add lines from file to @buff... after each addition yield as > many > # tokens as possible > @buff += line @buff << line is more efficient. > # comsume all the tokens from @buff that can be found... when no > more > # can be found analyze will return nil... so we'll get another > line > aMatch = analyze > while !aMatch.nil? > # deliver one<token, lexeme pair> at a time to caller... > # by convention a nil tokID is one about which the caller does not > # care to hear... > yield aMatch.rule.tokID, aMatch.lexeme if ! > aMatch.rule.tokID.nil? > aMatch = analyze > end > } > # @buff contains the earliest unmatched text... if @buff is not > empty when > # we finish with the file, this is an error > if !@buff.empty? > @errString = "error: unmatched text:\n" + @buff[0,[80, > @buff.length].min] > return false > else > @errStrng = "no errors detected\n" > return true > end > end > end > > ##### code snippet 2 ###### > > WhiteSpaceToken = 0 > CommentToken = 1 > QuotedStringToken = 2 > WordToken = 3 I would rather use Symbols as token keys (names). They are similarly efficient but make the constant definitions superfluous. > require "lexer" > l = Lexer.new > l.addToken(nil, Regexp.new("\\s+", Regexp::MULTILINE)) > l.addToken(nil, Regexp.new("#.*[\\n\\r]+")) > #l.addToken(QuotedStringToken, Regexp.new('["][^"]*["]', > Regexp::MULTILINE)) > l.addToken(QuotedStringToken,'["]((\\\")|[^\\\"])*"') > l.addToken(WordToken,Regexp.new("\\w+")) > foo = l.parseFile("testFile1") { |token, lexeme| > print token.to_s + ":" + lexeme.to_s + "\n" > } > if foo > print "pass!\n" > else > print "fail: " + l.errString + "\n" > end There are of course completely different strategies to tackle this. Lexers usually are built as a DFA or NFA (like every Regexp implementation uses internally). You would then feed a character at a time to the FA and derive token types from states. Also, another option would be to lump all tokens into a single regular expression with group matches for every token type and analysing matchign groups, e.g. re = %r{ (\s+) # white space | ("(?:\\.|[^"\\])*") # quoted string }x File.read(file_name).scan re do |m| case when $1 printf "whitespace %p\n", $1 when $2 printf "quote %p\n", $2 else raise "Cannot be, scanner error" end end Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/
From: Rein Henrichs on 7 Jun 2010 16:19 A work-in-progress Ruby Style Guide is available here: http://github.com/chneukirchen/styleguide/blob/master/RUBY-STYLE A few suggestions from your code samples (some are Ruby-specific, some are general "clean code" suggestions): 1. Use snake_case for variable and method names, not camelCase. Use SHOUTING_SNAKE_CASE for constants, CamelCase for classes and modules. 2. Don't needlessly shorten variable names (like @tokID). The goal is not brevity but clarity. 2. Do not start variable names with underscores. 3. Use string interpolation ( "#{self.class.name}: #{@token_id}" ) rather than concatenation via +. 4. Use puts instead of print with an "\n", see also #3. 5. Use literals where available. [] instead of Array.new, "" instead of String.new, {} instead of Hash.new. 6. Take advantage of Ruby's sane conditional evaluation semantics. Instead of `if !foo.nil?`, simply `if foo`. 7. Do not put extensive conditional logic in #each blocks. Rather, extract such logic into a method with an intention revealing name. 8. Take advantage of Enumerable methods that are more specific to your needs than #each. For instance, the findMatch method might use #detect instead of #each. 9. Variables do not need to be initialized. You assign nil to a number of variables and then later reassign them. This is unnecessary. 10. Use do/end for multiline blocks and {} for single-line blocks. 11. Write query methods to wrap complex truthiness expressions in intention revealing methods. For instance: md = rule.re.match(@buff) if !md.nil? && md.pre_match.length == 0 could be changed to: # Define query method class Rule def matches?(buffer) md = rule.re.match(@buff) md && md.pre_match.empty? end end # Use query method if rule.matches?(@buff) Note that .empty? is often a more expressive replacement for .length == 0 11. Use a_string.dup instead of String.new(a_string) 12. Alternatively, use non-destructive methods to avoid mutable state concerns. 13. Do not put a space between a method and its parenthesized arguments. Good: foo(bar) Bad: foo (bar) Also good (unless clarity suffers): foo bar 14. Do not use parallel assignment ( foo, bar = 1, 2 ) except where doing so results in a clear improvement in the clarity (rather than brevity) of your code. Finally, there is prior art that may interest you. Ruby parser/lexers include: racc treetop ripper grammar All are available as gems and are probably available on Github as well. -- Rein Henrichs http://puppetlabs.com http://reinh.com
From: Josh Cheek on 8 Jun 2010 04:16 [Note: parts of this message were removed to make it a legal post.] > > Finally, there is prior art that may interest you. Ruby parser/lexers > include: > > racc > treetop > ripper > grammar > If you're interested in Treetop, here is a video of it being presented at a conference (has a nice live code section). I tried playing with it a while back, and with a lot of effort was able to get past the basics. But I think the wall I ultimately ran into has to do with understanding what to do with it once parsed. Maybe I'll try to get into a compiler class next semester. Also the docs could be a lot better.
From: Brian Candler on 8 Jun 2010 04:29 Robert Klemme wrote: > def add_token(tok_id, re) > @rules << Rule.new(tok_id, > case re > when String > Regexp.new(re) > when Regexp > re > else > raise ArgumentError, "Neither String nor regexp" > end) > end or even just this: def add_token(tok_id, re) @rules << Rule.new(tok_id, Regexp.new(re)) end That's not exactly the same since Regexp.new(a_regexp) does actually create a new regexp object, but AFAICT it's a functionally identical regexp. >> r1 = /foo/i => /foo/i >> r2 = Regexp.new(r1) => /foo/i >> r1 == r2 => true >> r1.equal?(r2) => false -- Posted via http://www.ruby-forum.com/.
|
Pages: 1 Prev: Complex numbers contradiction? Next: [ANN] JRuby 1.5.1 Released |