Prev: FAQ 6.22 What's wrong with using grep in a void context?
Next: determining whether a server supports secure authentication
From: Willem on 17 May 2010 13:41 Uri Guttman wrote: )>>>>> "W" == Willem <willem(a)turtle.stack.nl> writes: ) ) W> Have you also tried removing 10 lines from a million-line file ? ) W> And for giggles, you could try a hand-rolled one that uses the functions ) W> seek(), sysread() and truncate() to accomplish the job. ) ) ahem. that is what file::readbackward does! I know. ) it may be possible to hand ) roll optimize it by removing some overhead, etc. but it was designed to ) be very fast. your earlier point about how much to remove or skip is the ) important one. truncating most of a large file will be slower but you ) still need to count lines from the end. since you don't need to read ) each line for this you could read large blocks, scan for newlines and ) count them and then truncate to the desired point. readbackwards has the ) overhead of splitting the blocks into lines and returning each one for ) counting. but you always need to read the part you are truncating if you ) are counting lines from the end. I just tried a hand-rolled sysseek/sysread/truncate version, and either that overhead is very significant, or my machine is a lot faster. > time perl trunc.pl tenmillion.txt 10000 real 0m0.013s user 0m0.011s sys 0m0.003s That's removing ten thousand lines from a ten-million line file that is over 600mb large. I'll try a ReadBackwards version now. SaSW, Willem -- Disclaimer: I am in no way responsible for any of the statements made in the above text. For all I know I might be drugged or something.. No I'm not paranoid. You all think I'm paranoid, don't you ! #EOT
From: Willem on 17 May 2010 13:49 Willem wrote: )> time perl trunc.pl tenmillion.txt 10000 ) ) real 0m0.013s ) user 0m0.011s ) sys 0m0.003s ) ) That's removing ten thousand lines from a ten-million line file ) that is over 600mb large. ) ) I'll try a ReadBackwards version now. > time perl readb.pl million.txt 10000 real 0m0.036s user 0m0.035s sys 0m0.002s Almost a factor of 3:1. Removing a hundred thousand lines holds the same 3:1 pattern. readb.pl: use warnings; use strict; use File::ReadBackwards; my $filename = shift; my $Lines_to_truncate = shift; my $bw = File::ReadBackwards->new( $filename ) or die "Could not read backwards in [$filename]: $!"; my $lines_from_end = 0; until( $bw->eof or $lines_from_end == $Lines_to_truncate ) { $lines_from_end++; } truncate( $filename, $bw->tell ); trunc.pl: use warnings; use strict; my $blocksize = 4096; my ($file, $lines) = @ARGV; open (my $fh, '+<', $file) or die "Failed to open '$file' for r/w: $!"; my $pos = sysseek($fh, 0, 2) or die "Failed to seek to EOF of '$file': $!"; while (($pos -= ($pos - 1) % $blocksize + 1) >= 0) { sysseek($fh, $pos, 0) or die "Failed to seek backwards in '$file':$!"; sysread($fh, my $block, $blocksize) or die "Failed to read from '$file': $!"; my $spos = rindex($block, "\n"); while ($spos >= 0) { if (--$lines < 0) { truncate($fh, $pos + $spos) or die "Failed to truncate '$file':$!"; exit(0); } $spos = rindex($block, "\n", $spos - 1); } } die "Not enough lines in file"; SaSW, Willem -- Disclaimer: I am in no way responsible for any of the statements made in the above text. For all I know I might be drugged or something.. No I'm not paranoid. You all think I'm paranoid, don't you ! #EOT
From: Ralph Malph on 17 May 2010 13:54 On 5/17/2010 12:43 PM, Ralph Malph wrote: [snip] > $ time cat puke | wc -l | xargs echo -10000 + | bc \ > | xargs echo head puke -n | sh > top_n-10000 > > real 0m0.312s > user 0m0.090s > sys 0m0.121s a non-pipelined shell script does better $ time ./temp.sh > top_n-10000 real 0m0.266s user 0m0.015s sys 0m0.076s ------------ temp.sh ------------ lines = `wc -l puke` let num_lines=$(($lines-10000)) head puke -n $num_lines
From: Uri Guttman on 17 May 2010 13:55 >>>>> "W" == Willem <willem(a)turtle.stack.nl> writes: W> Uri Guttman wrote: W> ) it may be possible to hand W> ) roll optimize it by removing some overhead, etc. but it was designed to W> ) be very fast. your earlier point about how much to remove or skip is the W> ) important one. truncating most of a large file will be slower but you W> ) still need to count lines from the end. since you don't need to read W> ) each line for this you could read large blocks, scan for newlines and W> ) count them and then truncate to the desired point. readbackwards has the W> ) overhead of splitting the blocks into lines and returning each one for W> ) counting. but you always need to read the part you are truncating if you W> ) are counting lines from the end. W> I just tried a hand-rolled sysseek/sysread/truncate version, and either W> that overhead is very significant, or my machine is a lot faster. W> I'll try a ReadBackwards version now. did you try my suggested algorithm? it isn't too much work reading large blocks from the end, counting newlines and then doing a truncate at the desired point. i see it at about 30 lines of code or so. uri -- Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com -- ----- Perl Code Review , Architecture, Development, Training, Support ------ --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
From: Uri Guttman on 17 May 2010 13:58 >>>>> "W" == Willem <willem(a)turtle.stack.nl> writes: W> my $spos = rindex($block, "\n"); ahh, here is your bottleneck. use tr/// to count the newlines of each block. if you haven't read enough then read another. you don't need to use rindex for each newline. also when you find the block which has the desired ending, you can use a forward regex or something else to find the nth newline in one call. perl is slow doing ops in a loop but fast doing loops internally. so always use perl ops which do more work for you. W> while ($spos >= 0) { W> if (--$lines < 0) { W> truncate($fh, $pos + $spos) W> or die "Failed to truncate '$file':$!"; W> exit(0); W> } W> $spos = rindex($block, "\n", $spos - 1); that is a slow perl loop calling rindex over and over. uri -- Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com -- ----- Perl Code Review , Architecture, Development, Training, Support ------ --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 Prev: FAQ 6.22 What's wrong with using grep in a void context? Next: determining whether a server supports secure authentication |