Prev: Calendar question
Next: Hacking midlet signing
From: Spud on 25 Mar 2010 12:26 Can someone suggest a good approach for caching parts of a large data file? I've got a number of very large (multi-gigabyte) files. Each file is organized into pages of variable length. Pages can be small (< 1k) or much larger (> 1mb). The organization of the file is similar to a btree, and the access pattern is similar: you start at the root, skip to pages scattered in different places in the file, and then when you get to a leaf node you read sequentially, possibly over many megabytes. I need some kind of caching system so I'm not hitting the disk on every access. Ideally, I'd be able to use a kind of buffered RandomAccessFile object which would magically cache hunks of the file appropriately. I know this problem is similar to the problem databases have: how to cache pages optimally. There's a difference, though, because my pages are of variable length and pages in a database are usually of fixed length. So a fixed pool of uniformly-sized pages organized in an LRU cache won't work. Anyone have any insights on what I should do?
From: Knute Johnson on 25 Mar 2010 12:45 On 3/25/2010 9:26 AM, Spud wrote: > Can someone suggest a good approach for caching parts of a large data file? > > I've got a number of very large (multi-gigabyte) files. Each file is > organized into pages of variable length. Pages can be small (< 1k) or > much larger (> 1mb). The organization of the file is similar to a btree, > and the access pattern is similar: you start at the root, skip to pages > scattered in different places in the file, and then when you get to a > leaf node you read sequentially, possibly over many megabytes. > > I need some kind of caching system so I'm not hitting the disk on every > access. Ideally, I'd be able to use a kind of buffered RandomAccessFile > object which would magically cache hunks of the file appropriately. > > I know this problem is similar to the problem databases have: how to > cache pages optimally. There's a difference, though, because my pages > are of variable length and pages in a database are usually of fixed > length. So a fixed pool of uniformly-sized pages organized in an LRU > cache won't work. > > Anyone have any insights on what I should do? If you are sure that RandomAccessFile won't work then I'd look at FileChannel.map(). The time required to read a few MB off a disk these days is not that much. -- Knute Johnson email s/nospam/knute2010/
From: Lew on 25 Mar 2010 13:24 Spud wrote: >> Can someone suggest a good approach for caching parts of a large data file? > >> I've got a number of very large (multi-gigabyte) files. Each file is >> organized into pages of variable length. Pages can be small (< 1k) or >> much larger (> 1mb). The organization of the file is similar to a btree, >> and the access pattern is similar: you start at the root, skip to pages >> scattered in different places in the file, and then when you get to a >> leaf node you read sequentially, possibly over many megabytes. > >> I need some kind of caching system so I'm not hitting the disk on every >> access. Ideally, I'd be able to use a kind of buffered RandomAccessFile >> object which would magically cache hunks of the file appropriately. > Knute Johnson wrote: > If you are sure that RandomAccessFile won't work then I'd look at > FileChannel.map(). The time required to read a few MB off a disk these > days is not that much. > To speed up those large sequential reads, i.e., after finding the starting point of one of those "pages", use a disk with a large onboard buffer (common disks have buffers ranging from 2MB to 8MB) and high rotation speed (10K or 15K rpm as opposed to 7200 or 5400). <http://en.wikipedia.org/wiki/Hard_disk_drive#Data_transfer_rate> Bear in mind that the OS also caches disk reads. There are a lot of factors influencing read speed: disk-to-buffer transfer rate, buffer- to-mainboard transfer rate, OS disk-buffering strategy, use of the disk by other processes, including by the OS itself for memory swap space, and more. Anything you do by hand to try to improve the situation risks having the opposite effect. As with all attempts to optimize, remember the famous maxims: 1. Don't do it. 2. Don't do it yet. 3. Measure performance before and after, under realistic conditions. 4. Everything you think you know, or you assume, or you predict, is wrong. 5. It will work differently in the field. 6. Nothing can fix a slow algorithm. -- Lew
From: Martin Gregorie on 25 Mar 2010 13:50 On Thu, 25 Mar 2010 10:24:38 -0700, Lew wrote: > Spud wrote: >>> Can someone suggest a good approach for caching parts of a large data >>> file? >> The fastest sequential read I've ever seen was from a RAID 5 disk set using a hardware RAID controller. This did sequential reads at something like 3 times the speed expected from a single disk of the same type and would write at about the same speed as a single disk. The speed gain seemed to be because the controlled was bright enough to do look-ahead caching. We measured this performance by serially reading several thousand fairly small records from a sequential file. FWIW this was measured on an IBM AS/400, and IIRC it used 5 disk RAID sets. I expected good sequential reading performance because RAID 5 was the only storage method the AS/400 used. However, I didn't expect the read speed to be 'that' good! -- martin@ | Martin Gregorie gregorie. | Essex, UK org |
From: Wojtek on 25 Mar 2010 14:55
Spud wrote : > Anyone have any insights on what I should do? Maybe do an initial pass through the file following only the tree indexes, save the indexes plus file offsets into a TreeMap. Then use the TreeMap to do the lookup of the index and use the corresponding file offset to read the data. Won't speed up the read, but should speed up the tree following. Plus you could cache the page as you do a read (into the class holding the file offset). Mind your memory usage! -- Wojtek :-) |