Prev: FAQ 8.17 How can I measure time under a second?
Next: FAQ 5.23 AND: Perl the latest vs Perl the gratest
From: ccc31807 on 13 Aug 2010 13:08 During the discussion of the 9-11 mosque in NYC, several commentators mentioned Milestones by Sayed Qutb. I decided to read it to see that the fuss was about, and ended up with an ASCII text copy generated from a PDF original. I could have printed the text directly, but it was pretty mangled, and after attempting and failing to reformat the document using vi, I decided to write a simple Perl script to reformat it. I wanted to do several things, join paragraphs together (every line in the file was terminated by a "\n"), separate paragraphs by a blank line (block style), remove repeated periods (dots), remove form feeds (which marked pages in the original), etc. I first attempted to munge the file in place, like this: #FIRST ATTEMPT open MS, '<', $file; open OUT, '>', $out; while (<MS>) { #do stuff print OUT; } close MS; close OUT; It mostly worked, but I couldn't fine tune it. I then attempted to munge two lines together, like this: #SECOND ATTEMPT open MS, '<', $file; open OUT, '>', $out; $line1 = <MS>; while (<MS>) { $line2 = $_; #do stuff print OUT; $line 2 = $line1; } close MS; close OUT; This worked a little better, but it wasn't perfect. I then tried this and got perfect formatting: #THIRD ATTEMPT { local $/ = undef; open MS, '<', $file; $document = <MS>; close MS; } #series of transformations like this $document =~ s/\r//; open OUT, '>', $out; print OUT $document; close OUT; All of the work I have done in the past has munged the lines one by one, as in the first example. Occasionally, I have had to use the second style (e.g., where the formatting of each line depends on the content of the preceding line.) I've never used the third style at all. I liked the third way a lot. It seemed quick, easy, and worked perfectly. I was actually able to open the resulting document in Word, fancify it a little, and print a nice finished copy. However, I can't think of any actual uses of the third style in my day to day work. My question is this: Is the third attempt, slurping the entire document into memory and transforming the text by regexs, very common, or is it considered a last resort when nothing else would work? CC.
From: Uri Guttman on 13 Aug 2010 13:29 >>>>> "c" == ccc31807 <cartercc(a)gmail.com> writes: c> This worked a little better, but it wasn't perfect. I then tried this c> and got perfect formatting: c> #THIRD ATTEMPT c> { c> local $/ = undef; c> open MS, '<', $file; c> $document = <MS>; c> close MS; c> } c> All of the work I have done in the past has munged the lines one by c> one, as in the first example. Occasionally, I have had to use the c> second style (e.g., where the formatting of each line depends on the c> content of the preceding line.) I've never used the third style at c> all. it isn't as common as it should be IMNSHO. in the old days reading files line by line was almost required do to small memory machines. today, megabyte files can be slurped without fear at all but line by line is still taught as standard. it take time to change views. c> I liked the third way a lot. It seemed quick, easy, and worked c> perfectly. I was actually able to open the resulting document in c> Word, fancify it a little, and print a nice finished copy. However, c> I can't think of any actual uses of the third style in my day to c> day work. parsing and text munging is much easier when the entire file is in ram. there is no need to mix i/o with logic, the i/o is much faster, you can send/receive whole documents to servers (which could format things or whatever), etc. slurping whole files makes a lot of sense in many areas. c> My question is this: Is the third attempt, slurping the entire c> document into memory and transforming the text by regexs, very common, c> or is it considered a last resort when nothing else would work? it is not a last resort by any imagination today. and use File::Slurp instead for both reading and writing the file. it is cleaner and faster than the methods you used. uri -- Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com -- ----- Perl Code Review , Architecture, Development, Training, Support ------ --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
From: Peter J. Holzer on 13 Aug 2010 14:14 On 2010-08-13 17:08, ccc31807 <cartercc(a)gmail.com> wrote: [ 3 ways of munging a text file: line by line, pairs of lines, and whole file at once ] > I liked the third way a lot. It seemed quick, easy, and worked > perfectly. I was actually able to open the resulting document in Word, > fancify it a little, and print a nice finished copy. However, I can't > think of any actual uses of the third style in my day to day work. > > My question is this: Is the third attempt, slurping the entire > document into memory and transforming the text by regexs, very common, > or is it considered a last resort when nothing else would work? Uri would probably tell you that's what you always should do unless the file is too big to fit into memory (and you should use File::Slurp for it) :-). I do whatever allows the most straightforward implementation. Very often that means reading the whole data into memory, although not necessarily as a single scalar. hp
From: Peter J. Holzer on 13 Aug 2010 14:42 On 2010-08-13 18:14, Peter J. Holzer <hjp-usenet2(a)hjp.at> wrote: > Uri would probably tell you [...] I didn't see Uri's answer before I posted this. I swear! :-) hp
From: Uri Guttman on 13 Aug 2010 14:48 >>>>> "PJH" == Peter J Holzer <hjp-usenet2(a)hjp.at> writes: PJH> On 2010-08-13 18:14, Peter J. Holzer <hjp-usenet2(a)hjp.at> wrote: >> Uri would probably tell you [...] PJH> I didn't see Uri's answer before I posted this. I swear! :-) great minds. :) uri -- Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com -- ----- Perl Code Review , Architecture, Development, Training, Support ------ --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
|
Next
|
Last
Pages: 1 2 Prev: FAQ 8.17 How can I measure time under a second? Next: FAQ 5.23 AND: Perl the latest vs Perl the gratest |