Prev: FAQ 2.3 I don't have a C compiler. How can I build my own Perl interpreter?
Next: FAQ 2.7 Is there an ISO or ANSI certified version of Perl?
From: Ilya Zakharevich on 23 Jul 2010 19:20 On 2010-07-23, Uri Guttman <uri(a)StemSystems.com> wrote: > mmap still needs space in the program. it may be allocated with malloc > or even builtin these days (haven't used it directly in decades! :). now > real ram could be saved but that is true for all virtual memory use. if > you seek into the mmap space and only read/write parts, then the other > sections won't be touched. so the issue comes down to random access vs > processing a whole file. most uses of slurp are for processing a whole > file so i would lean in that direction. someone sophisticated enough to > use mmap directly for random access should know the resource usage issues. I do not see it mentioned in this discussion that (a good implemenation of) mmap() also semi-unmaps-when-needed. So as far as you have enough *virtual* memory, mmap() behaves as a "smartish" intermediate ground between reading-by-line and slurping. And it "almost scales"; the limit is the virtual memory, so on 64bit systems it might even "absolutely scale". Of course, this can severely limit the amount of free physical memory on the computer, so may harden the life of other programs, AND decrease disk caching. However, if YOUR program is the only one on CPU, and THIS disk access is the only one in question, mmap() has a chance to be a clear win... Yours, Ilya
From: Peter J. Holzer on 25 Jul 2010 05:08 On 2010-07-23 22:15, Uri Guttman <uri(a)StemSystems.com> wrote: >>>>>> "TW" == Tim Watts <tw(a)dionic.net> writes: > > TW> Uri Guttman <uri(a)StemSystems.com> > TW> wibbled on Sunday 04 July 2010 06:15 > > >> i disagree with that last point. mmap always needs virtual ram allocated > >> for the entire file to be mapped. it only saves ram if you map part of > >> the file into a smaller virtual window. the win of mmap is that it won't > >> do the i/o until you touch a section. so if you want random access to > >> sections of a file, mmap is a big win. if you are going to just process > >> the whole file, there isn't any real win over File::Slurp > > TW> I think it is worth some clarification - at least under linux: > TW> mmap requires virtual address space, not RAM per se, for the > TW> initial mmap. > > TW> Obviously as soon as you try to read any part of the file, those > TW> blocks must be paged in to actual RAM pages. > > TW> However, if you then ignore those pages and have not modified > TW> them, the LRU recovery sweeper can just drop those pages. > > but a slurped file in virtual ram behaves the same way. it may be > swapped in when you read in the file and process it but as soon as that > is done, and you free the scalar in perl, perl can reuse the space. Well, *if* you free it. The nice thing about mmap is that RAM can be reused even if you don't free it. > the virtual ram can't be given back to the os That depends on the malloc implementation. GNU malloc uses heap based allocation only for small chunks (less than 128 kB by default, I think), but mmap-based allocation for larger chunks. So for a scalar larger than 128 kB, the space can and will be given back to the OS. > but the real ram is reused. > TW> Compare to if you slurp the file into some virtual RAM that's been malloc'd: > > TW> The RAM pages are all dirty (because you copied data into them) - > TW> so if the system needs to reduce the working page set, it will > TW> have to page those out to swap rather than just dropping them - it > TW> no longer has the knowledge that they are in practise backed by > TW> the original file. > > that is true. the readonly aspect of a mmap slurp is a win. but given > the small sizes of most files slurped it isn't that large a win. Yes. Mmap is only a win for large files. And I suspect "large" means really large - somewhere on the same order as available RAM. > today we have 4k or larger page sizes and many files are smaller than > that. ram and vram are cheap as hell so fighting for each byte is a > long lost art that needs to die. :) I wish Perl would fight for each byte at the low level. The overhead for each scalar, array element or hash element is enormous, and these really add up if you have enough of them. hp
From: Peter J. Holzer on 25 Jul 2010 10:16 On 2010-07-25 13:35, Tim Watts <tw(a)dionic.net> wrote: > Uri Guttman <uri(a)StemSystems.com> > wibbled on Friday 23 July 2010 23:15 >> that is true. the readonly aspect of a mmap slurp is a win. but given >> the small sizes of most files slurped it isn't that large a win. > > Yes that would be true of small files. > > But what if you're dealing with 1GB files or just multi MB files? This is > extremely likely if you were processing video or scientific data (ignoring > the fact that you probably wouldn't be using perl for either!) Perl was used in the Human Genome project. hp, who also routinely processes files in the range of a few GB.
From: Uri Guttman on 25 Jul 2010 11:38 >>>>> "TW" == Tim Watts <tw(a)dionic.net> writes: TW> Uri Guttman <uri(a)StemSystems.com> TW> wibbled on Friday 23 July 2010 23:15 >> that is true. the readonly aspect of a mmap slurp is a win. but given >> the small sizes of most files slurped it isn't that large a win. today >> we have 4k or larger page sizes and many files are smaller than >> that. ram and vram are cheap as hell so fighting for each byte is a long >> lost art that needs to die. :) TW> Yes that would be true of small files. TW> But what if you're dealing with 1GB files or just multi MB files? TW> This is extremely likely if you were processing video or TW> scientific data (ignoring the fact that you probably wouldn't be TW> using perl for either!) and your point is? and someone else pointed out that perl was and is used for genetic work. ever heard of bioperl? it is a very popular package for biogenetics. look for the article about perl saving the human genome project (that was done by the author of cgi.pm!). of course those systems don't slurp in those enormous data files. but they can always slurp in the smaller (for some definition of smaller) config, control, and other files. uri -- Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com -- ----- Perl Code Review , Architecture, Development, Training, Support ------ --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
From: Uri Guttman on 25 Jul 2010 17:51
>>>>> "TW" == Tim Watts <tw(a)dionic.net> writes: TW> BTW - I am surprised the genome project was done in perl. I TW> *would* have thought, even from a perl fanboi perspective, that C TW> would have been somewhat faster and the amount of data would have TW> made it worth optimising the project even at the expense of TW> simplicity. I shall have to read up on that. the artical i referred to can likely be found. it wasn't that the whole project was done in perl. the issue was worldwide they ended up with about 14 different data formats and they couldn't share it with each other. so this one guy (as i said author of cgi.pm and several perl books) wrote modules to convert each format to/from a common format which allowed full sharing of data. that 'saved' the project from its babel hell. since then, perl is a major language used in biogen both for having bioperl and for its great string and regex support. c sucks for both of those and its faster run speed loses out to perl's much better development time. uri -- Uri Guttman ------ uri(a)stemsystems.com -------- http://www.sysarch.com -- ----- Perl Code Review , Architecture, Development, Training, Support ------ --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com --------- |