FAQ 5.4 How do I delete the last N lines from a file? [Perl]

Prev: FAQ 6.22 What's wrong with using grep in a void context?
Next: determining whether a server supports secure authentication

From: PerlFAQ Server on 16 May 2010 00:00

This is an excerpt from the latest version perlfaq5.pod, which
comes with the standard Perl distribution. These postings aim to
reduce the number of repeated questions as well as allow the community
to review and update the answers. The latest version of the complete
perlfaq is at http://faq.perl.org .

--------------------------------------------------------------------

5.4: How do I delete the last N lines from a file?

(contributed by brian d foy)

The easiest conceptual solution is to count the lines in the file then
start at the beginning and print the number of lines (minus the last N)
to a new file.

Most often, the real question is how you can delete the last N lines
without making more than one pass over the file, or how to do it with a
lot of copying. The easy concept is the hard reality when you might have
millions of lines in your file.

One trick is to use "File::ReadBackwards", which starts at the end of
the file. That module provides an object that wraps the real filehandle
to make it easy for you to move around the file. Once you get to the
spot you need, you can get the actual filehandle and work with it as
normal. In this case, you get the file position at the end of the last
line you want to keep and truncate the file to that point:

use File::ReadBackwards;

my $filename = 'test.txt';
my $Lines_to_truncate = 2;

my $bw = File::ReadBackwards->new( $filename )
or die "Could not read backwards in [$filename]: $!";

my $lines_from_end = 0;
until( $bw->eof or $lines_from_end == $Lines_to_truncate )
{
print "Got: ", $bw->readline;
$lines_from_end++;
}

truncate( $filename, $bw->tell );

The "File::ReadBackwards" module also has the advantage of setting the
input record separator to a regular expression.

You can also use the "Tie::File" module which lets you access the lines
through a tied array. You can use normal array operations to modify your
file, including setting the last index and using "splice".

--------------------------------------------------------------------

The perlfaq-workers, a group of volunteers, maintain the perlfaq. They
are not necessarily experts in every domain where Perl might show up,
so please include as much information as possible and relevant in any
corrections. The perlfaq-workers also don't have access to every
operating system or platform, so please include relevant details for
corrections to examples that do not work on particular platforms.
Working code is greatly appreciated.

If you'd like to help maintain the perlfaq, see the details in
perlfaq.pod.

From: sln on 16 May 2010 20:26

On Sun, 16 May 2010 04:00:02 GMT, PerlFAQ Server <brian(a)theperlreview.com> wrote:

>This is an excerpt from the latest version perlfaq5.pod, which
>comes with the standard Perl distribution. These postings aim to
>reduce the number of repeated questions as well as allow the community
>to review and update the answers. The latest version of the complete
>perlfaq is at http://faq.perl.org .
>
>--------------------------------------------------------------------
>
>5.4: How do I delete the last N lines from a file?
>
> (contributed by brian d foy)
>
> The easiest conceptual solution is to count the lines in the file then
> start at the beginning and print the number of lines (minus the last N)
> to a new file.
>
> Most often, the real question is how you can delete the last N lines
> without making more than one pass over the file, or how to do it with a
> lot of copying. The easy concept is the hard reality when you might have
> millions of lines in your file.

I believe, "or how to do it with a lot of copying." was meant to be
"or how to do it without a lot of copying."

And, I'm no so sure you're not conflating "making more than one pass over the file"
with reading/writing the file more than one time.

>
> One trick is to use "File::ReadBackwards", which starts at the end of

Is this really a trick?

I can't remember if there is a truncate at file position primitive.
If I take a guess one way, I would say this approach would work as fast
as any:

create a line stack, the size of N
read each line, store line in stack, increment a counter
when the counter equals N, drop the oldest line into a new file, newest line to stack.
repeat until end of old file
close new file
delete old file
rename new file to old

viola, truncation

-sln

From: Ralph Malph on 17 May 2010 12:43

On 5/16/2010 12:00 AM, PerlFAQ Server wrote:
> This is an excerpt from the latest version perlfaq5.pod, which
> comes with the standard Perl distribution. These postings aim to
> reduce the number of repeated questions as well as allow the community
> to review and update the answers. The latest version of the complete
> perlfaq is at http://faq.perl.org .
>
> --------------------------------------------------------------------
>
> 5.4: How do I delete the last N lines from a file?
>
> (contributed by brian d foy)
>
> The easiest conceptual solution is to count the lines in the file then
> start at the beginning and print the number of lines (minus the last N)
> to a new file.
>
> Most often, the real question is how you can delete the last N lines
> without making more than one pass over the file, or how to do it with a
> lot of copying. The easy concept is the hard reality when you might have
> millions of lines in your file.
>
> One trick is to use "File::ReadBackwards", which starts at the end of
> the file. That module provides an object that wraps the real filehandle
> to make it easy for you to move around the file. Once you get to the
> spot you need, you can get the actual filehandle and work with it as
> normal. In this case, you get the file position at the end of the last
> line you want to keep and truncate the file to that point:
>
> use File::ReadBackwards;
>
> my $filename = 'test.txt';
> my $Lines_to_truncate = 2;
>
> my $bw = File::ReadBackwards->new( $filename )
> or die "Could not read backwards in [$filename]: $!";
>
> my $lines_from_end = 0;
> until( $bw->eof or $lines_from_end == $Lines_to_truncate )
> {
> print "Got: ", $bw->readline;
> $lines_from_end++;
> }
>
> truncate( $filename, $bw->tell );
>
> The "File::ReadBackwards" module also has the advantage of setting the
> input record separator to a regular expression.
>
> You can also use the "Tie::File" module which lets you access the lines
> through a tied array. You can use normal array operations to modify your
> file, including setting the last index and using "splice".
Feeling bored I compared the code in the faq with
some bash code that would achieve the same results.
I also ran some generic perl that did basically the same
thing as the shell script(code at bottom).
The test file was named 'puke'. Contents are the integers 0 through
999999. 1 million rows total. The test is to excluded the last 10000
lines. perl 5.10.1 on cygwin. machine has 4gb ram. dual core Intel.
Anyway, in this not really scientific test the faq method using
Uri's File::ReadBackwards module is the winner. I suppose this is the
expected result but I thought the shell code would be more
competitive.

$ time perl faq.pl > top_n-10000

real 0m0.219s
user 0m0.093s
sys 0m0.061s

$ time cat puke | wc -l | xargs echo -10000 + | bc \
| xargs echo head puke -n | sh > top_n-10000

real 0m0.312s
user 0m0.090s
sys 0m0.121s

$ time perl temp.pl > top_n-10000

real 0m0.858s
user 0m0.701s
sys 0m0.062s

-----------------
temp.pl
-----------------
use strict;
use warnings;

my $num_lines_exclude=10000;

open(FH, '<', "puke") or die $!;
my $line_count=0;
while(<FH>){
$line_count++;
}
seek(FH, 0, 0);
my $lines_to_read=$line_count-$num_lines_exclude;
while($lines_to_read>0){
my $line=<FH>;
print $line;
$lines_to_read--;
}

From: Willem on 17 May 2010 13:10

Ralph Malph wrote:
) Feeling bored I compared the code in the faq with
) some bash code that would achieve the same results.
) I also ran some generic perl that did basically the same
) thing as the shell script(code at bottom).
) The test file was named 'puke'. Contents are the integers 0 through
) 999999. 1 million rows total. The test is to excluded the last 10000
) lines. perl 5.10.1 on cygwin. machine has 4gb ram. dual core Intel.
) Anyway, in this not really scientific test the faq method using
) Uri's File::ReadBackwards module is the winner. I suppose this is the
) expected result but I thought the shell code would be more
) competitive.

Why ? AIUI, ReadBackwards never touches the beginning of the file, so that
should clearly lead to a lot less disk I/O.

I'm assuming te tests you ran may have had the file still in disk cache,
though, so that would make the difference a lot less significant, but
still ReadBackwards takes time proportional to the size of the removed bit,
while the rest take time proportional to the size of the whole file.

Have you also tried removing 10 lines from a million-line file ?
And for giggles, you could try a hand-rolled one that uses the functions
seek(), sysread() and truncate() to accomplish the job.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

From: Uri Guttman on 17 May 2010 13:34

>>>>> "W" == Willem <willem(a)turtle.stack.nl> writes:

W> Have you also tried removing 10 lines from a million-line file ?
W> And for giggles, you could try a hand-rolled one that uses the functions
W> seek(), sysread() and truncate() to accomplish the job.

ahem. that is what file::readbackward does! it may be possible to hand
roll optimize it by removing some overhead, etc. but it was designed to
be very fast. your earlier point about how much to remove or skip is the
important one. truncating most of a large file will be slower but you
still need to count lines from the end. since you don't need to read
each line for this you could read large blocks, scan for newlines and
count them and then truncate to the desired point. readbackwards has the
overhead of splitting the blocks into lines and returning each one for
counting. but you always need to read the part you are truncating if you
are counting lines from the end.

uri

--
Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

| Next | Last
Pages: 1 2 3 4
Prev: FAQ 6.22 What's wrong with using grep in a void context?
Next: determining whether a server supports secure authentication