FAQ 5.29 How can I read in an entire file all at once? [Perl]

Prev: FAQ 2.3 I don't have a C compiler. How can I build my own Perl interpreter?
Next: FAQ 2.7 Is there an ISO or ANSI certified version of Perl?

From: Uri Guttman on 5 Jul 2010 15:28

>>>>> "bdf" == brian d foy <brian.d.foy(a)gmail.com> writes:

bdf> In article <87mxu7j08z.fsf(a)quad.sysarch.com>, Uri Guttman
bdf> <uri(a)StemSystems.com> wrote:

>> other than file::slurp not being in core (and it should be! :), there is
>> no reason to show the $/ = undef trick.

bdf> That's a pretty big reason though.

the sys/open followed by sysread and -s is faster and less obscure. you
already show that. the undef $/ is just poor coding imo. at least
comment on the various qualities of the methods. my comments on the mmap
are on point - it doesn't save ram and only wins for random
access. tie::file is ok for some things but for a simple
read/modify/write it is just as simple and faster to
slurp/mung/write. you can work on an array in both cases. one day i will
release file::slurp with edit_file and edit_file_lines which will make
that process even easier and faster.

uri

--
Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

From: Eric Pozharski on 6 Jul 2010 04:53

with <8739vy7wo4.fsf(a)quad.sysarch.com> Uri Guttman wrote:
>>>>>> "EP" == Eric Pozharski <whynot(a)pozharski.name> writes:
>
> EP> Please reconsider your 'always slower':
>
> try the pass by scalar reference method of read_file.

#!/usr/bin/perl

use strict;
use warnings;
use Benchmark qw{ cmpthese timethese };

use File::Slurp;
my $fn = '/etc/passwd';

cmpthese timethese -5, {
code00 => sub { my $aa = read_file $fn; },
code01 => sub { my $aa = read_file $fn, scalar_ref => 1; },
};

__END__
Benchmark: running code00, code01 for at least 5 CPU seconds...
code00: 6 wallclock secs ( 3.61 usr + 1.67 sys = 5.28 CPU) @ 33617.23/s (n=177499)
code01: 6 wallclock secs ( 3.74 usr + 1.52 sys = 5.26 CPU) @ 33122.05/s (n=174222)
Rate code01 code00
code01 33122/s -- -1%
code00 33617/s 1% --

What? However... (s{/etc/passwd}{/boot/vmlinuz})

Benchmark: running code00, code01 for at least 5 CPU seconds...
code00: 6 wallclock secs ( 1.57 usr + 3.86 sys = 5.43 CPU) @ 222.65/s (n=1209)
code01: 6 wallclock secs ( 0.23 usr + 5.04 sys = 5.27 CPU) @ 319.92/s (n=1686)
Rate code00 code01
code00 223/s -- -30%
code01 320/s 44% --

That's pretty impressive. Or not? Look, if someone is going to play
with B<read_file>'s options shouldn't he be going with B<sysread>
instead? I hardly can imagine that someone would try to make
B<read_file> to be as fast as possible instead of making slurping itself
fast.

> and check out the much more comprehensive benchmark script that comes
> with the module.

Yeah, cool stuff. Although I wasn't told beforehand to make terminal
250 columns wide. So it's still unreadable.

> and that was also redone in an unreleased version you can find on git
> at perlhunter.com/git.

Concentrate. Talking about 'unreleased' is lame.

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom

From: brian d foy on 23 Jul 2010 14:56

In article <87mxu7j08z.fsf(a)quad.sysarch.com>, Uri Guttman
<uri(a)StemSystems.com> wrote:

> i disagree with that last point. mmap always needs virtual ram allocated
> for the entire file to be mapped. it only saves ram if you map part of
> the file into a smaller virtual window.

I haven't found that to be the case for program memory at least. If you
copy parts of the file you have to copy, but

> again, i disagree. you can easily benchmark slurping an array of lines
> and looping vs line by line reading.

Well, the tension there is the trade-off between space and memory. I
could make that more clear I guess.

I will look at some benchmarks, though, and see how that illuminates
the situation.

From: Uri Guttman on 23 Jul 2010 15:27

>>>>> "bdf" == brian d foy <brian.d.foy(a)gmail.com> writes:

bdf> In article <87mxu7j08z.fsf(a)quad.sysarch.com>, Uri Guttman
bdf> <uri(a)StemSystems.com> wrote:

>> i disagree with that last point. mmap always needs virtual ram allocated
>> for the entire file to be mapped. it only saves ram if you map part of
>> the file into a smaller virtual window.

bdf> I haven't found that to be the case for program memory at least. If you
bdf> copy parts of the file you have to copy, but

mmap still needs space in the program. it may be allocated with malloc
or even builtin these days (haven't used it directly in decades! :). now
real ram could be saved but that is true for all virtual memory use. if
you seek into the mmap space and only read/write parts, then the other
sections won't be touched. so the issue comes down to random access vs
processing a whole file. most uses of slurp are for processing a whole
file so i would lean in that direction. someone sophisticated enough to
use mmap directly for random access should know the resource usage issues.

>> again, i disagree. you can easily benchmark slurping an array of lines
>> and looping vs line by line reading.

bdf> Well, the tension there is the trade-off between space and memory. I
bdf> could make that more clear I guess.

classic tradeoff. but again, these days almost all files you need to
slurp are small relative to ram (real and virtual) sizes. a 1 MB file is
nothing on a 1 GB system. but few text files are as big as 1MB. way back
when, reading line by line was almost required due to ram constraints
but ram size has way outgrown file size increases. i just want to change
the prevailing view a bit. and as i have said some things are only
doable when you have the full file in ram vs line by line.

bdf> I will look at some benchmarks, though, and see how that illuminates
bdf> the situation.

a simple one is slurping a simple config file and doing a basic parse on
it to make a hash. i have posted that code before. it would be easy to
compare that to a line by line version of that. the slurp will blow it
away as it does one s/// op and slurps the file. the line by line has to
parse each line individually and also read in each line. more perl code
and more perl guts code.

uri

--
Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

From: Uri Guttman on 23 Jul 2010 18:15

>>>>> "TW" == Tim Watts <tw(a)dionic.net> writes:

TW> Uri Guttman <uri(a)StemSystems.com>
TW> wibbled on Sunday 04 July 2010 06:15

>> i disagree with that last point. mmap always needs virtual ram allocated
>> for the entire file to be mapped. it only saves ram if you map part of
>> the file into a smaller virtual window. the win of mmap is that it won't
>> do the i/o until you touch a section. so if you want random access to
>> sections of a file, mmap is a big win. if you are going to just process
>> the whole file, there isn't any real win over File::Slurp

TW> I think it is worth some clarification - at least under linux:
TW> mmap requires virtual address space, not RAM per se, for the
TW> initial mmap.

TW> Obviously as soon as you try to read any part of the file, those
TW> blocks must be paged in to actual RAM pages.

TW> However, if you then ignore those pages and have not modified
TW> them, the LRU recovery sweeper can just drop those pages.

but a slurped file in virtual ram behaves the same way. it may be
swapped in when you read in the file and process it but as soon as that
is done, and you free the scalar in perl, perl can reuse the space. the
virtual ram can't be given back to the os but the real ram is reused.

TW> Compare to if you slurp the file into some virtual RAM that's been malloc'd:

TW> The RAM pages are all dirty (because you copied data into them) -
TW> so if the system needs to reduce the working page set, it will
TW> have to page those out to swap rather than just dropping them - it
TW> no longer has the knowledge that they are in practise backed by
TW> the original file.

that is true. the readonly aspect of a mmap slurp is a win. but given
the small sizes of most files slurped it isn't that large a win. today
we have 4k or larger page sizes and many files are smaller than
that. ram and vram are cheap as hell so fighting for each byte is a long
lost art that needs to die. :)

uri

--
Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

First | Prev | Next | Last
Pages: 1 2 3 4
Prev: FAQ 2.3 I don't have a C compiler. How can I build my own Perl interpreter?
Next: FAQ 2.7 Is there an ISO or ANSI certified version of Perl?