Prev: Newbie
Next: RubyGems 1.3.7
From: Cs Webgrl on 30 Jun 2010 09:32 Hello, I am working with scraping quite a bit of data and I would like to make sure that I'm following some best practices for string manipulation. I would like to be sure to take into account any speed and garbage collection issues. Does anyone know of any posts, websites, books or other resources that provide "do this, not that" types of guidance? For example, my understanding is that globbing everything into one line when manipulating a string is not the best use of resources. not good "string+var".gsub('+','').strip.capitalize better s = "string+var s.gsub('+','') s.strip! s.capitalize s => 'String Var' Are there resources that explain why one is better than the other that also provides more best practices like this? Thanks. -- Posted via http://www.ruby-forum.com/.
From: Peter Hickman on 30 Jun 2010 09:45 [Note: parts of this message were removed to make it a legal post.] Personally doing things on one line is not a sin of itself. Only when it is overdone! As to what counts as overdone depends on your reading ability. Splitting things onto individual lines allows you to insert logging at various points without fear of breaking the code which the one line approach does not. However the multiline approach can make an insignificant part of the code take up lots of screen real estate which can make the larger code harder to read. For example x.downcase.gsub(/\s+/, ' ').strip.capitalize is a fairly easy to read clean up on a string but if it goes multiline x.downcase! x.gsub!(/\s+/, ' ') x.strip! x.capitalize! not only does it take up more of the screen but it has also altered x, something that the single line version did not. Of course if things get really silly you could just create a function and stuff all the code in there.
From: Brian Candler on 30 Jun 2010 09:49 Cs Webgrl wrote: > better > s = "string+var > s.gsub('+','') > s.strip! > s.capitalize > s => 'String Var' (You need gsub! and capitalize! of course) > Are there resources that explain why one is better than the other that > also provides more best practices like this? Methods like capitalize! work on the existing string buffer in memory. The non-bang methods create a whole new string, which involves work copying it, and then later garbage-collecting the original. Most of the non-bang methods are implemented as a dup followed by calling the bang method on the copy. They're written in C, but are effectively like this: class String def capitalize dup.capitalize! end def capitalize! # scan the string and modify it in place end end Of course, in most apps the original chained code you wrote will be just fine, and it's easy to write and understand. If you will be processing files which are hundreds of megabytes long then it may be worthwhile rewriting to the second form. Other thoughts: * for large files, process them in chunks or lines rather than reading them all in at once * use block form when opening a file, to ensure it's closed as soon as you've finished with it File.open("/path/to/file","rb") do |f| f.each_line do |line| ... end end -- Posted via http://www.ruby-forum.com/.
From: Cs Webgrl on 30 Jun 2010 10:07 Thanks so much for the help and guidance. Most of my data is parsed from mechanize and broken into smaller chunks that will manipulated to get the final format. From my understanding, I should be ok. I definitely agree that the conciseness of fewer lines of code is easier to read. Just wanted to make sure that I'm not compromising speed or garbage collection for readability on these types of methods. Brian Candler wrote: > Cs Webgrl wrote: >> better >> s = "string+var >> s.gsub('+','') >> s.strip! >> s.capitalize >> s => 'String Var' > > (You need gsub! and capitalize! of course) > >> Are there resources that explain why one is better than the other that >> also provides more best practices like this? > > Methods like capitalize! work on the existing string buffer in memory. > The non-bang methods create a whole new string, which involves work > copying it, and then later garbage-collecting the original. > > Most of the non-bang methods are implemented as a dup followed by > calling the bang method on the copy. They're written in C, but are > effectively like this: > > class String > def capitalize > dup.capitalize! > end > > def capitalize! > # scan the string and modify it in place > end > end > > Of course, in most apps the original chained code you wrote will be just > fine, and it's easy to write and understand. If you will be processing > files which are hundreds of megabytes long then it may be worthwhile > rewriting to the second form. > > Other thoughts: > > * for large files, process them in chunks or lines rather than reading > them all in at once > > * use block form when opening a file, to ensure it's closed as soon as > you've finished with it > > File.open("/path/to/file","rb") do |f| > f.each_line do |line| > ... > end > end -- Posted via http://www.ruby-forum.com/.
From: Josh Cheek on 30 Jun 2010 17:17
[Note: parts of this message were removed to make it a legal post.] On Wed, Jun 30, 2010 at 8:32 AM, Cs Webgrl <cschaller(a)gmail.com> wrote: > Hello, > > I am working with scraping quite a bit of data and I would like to make > sure that I'm following some best practices for string manipulation. I > would like to be sure to take into account any speed and garbage > collection issues. > > Does anyone know of any posts, websites, books or other resources that > provide "do this, not that" types of guidance? > > For example, my understanding is that globbing everything into one line > when manipulating a string is not the best use of resources. > > not good > "string+var".gsub('+','').strip.capitalize > > > better > s = "string+var > s.gsub('+','') > s.strip! > s.capitalize > s => 'String Var' > > Are there resources that explain why one is better than the other that > also provides more best practices like this? > > Thanks. > -- > Posted via http://www.ruby-forum.com/. > > I don't know about a specific site, but if you do not need to keep the value of string, then string << var is better than string + var, since it mutates string, rather than creating a new object. I once read benchmarks about this, but I can't remember where I read them, and I can't seem to recreate them, so maybe I am wrong. # plus returns a new String string , var = 'abc' , 'def' string + var # => "abcdef" string # => "abc" # << mutates the receiver string << var # => "abcdef" string # => "abcdef" You can use s.delete('+') instead of s.gsub('+','') and it will be faster, prettier, and more expressive. I expect the reason you heard that it is better to do it on multiple lines is that it then lets you use the bang methods, which, for whatever reason will return nil if they don't mutate the object. In general, it is faster to say s.capitalize! than s.capitalize because in bang version, we mutate s itself, in the second, we create a new object that is modified. But we are not interested in keeping the original value of s, so creating all these objects adds up. # capitalize returns the capital version regardless of the original string # so you can use it in the middle of a method chain 'Abc'.capitalize # => "Abc" 'abc'.capitalize # => "Abc" # don't use capitalize! in the middle of a method chain because it can return nil 'Abc'.capitalize! # => nil 'abc'.capitalize! # => "Abc" # capitalize creates a new string, so is less efficient if you don't care about the original # also does not modify the receiver, so you have to capture its result s = 'abc' s.capitalize # => "Abc" s # => "abc" # capitalize! mutates the original string, so is more efficient if you don't care about the original # does modify the receiver, so don't have to capture its result # in fact, _don't_ capture its result, because as shown above, result could be nil s = 'abc' s.capitalize! # => "Abc" s # => "Abc" |