From: Spud on
Can someone suggest a good approach for caching parts of a large data file?

I've got a number of very large (multi-gigabyte) files. Each file is
organized into pages of variable length. Pages can be small (< 1k) or
much larger (> 1mb). The organization of the file is similar to a btree,
and the access pattern is similar: you start at the root, skip to pages
scattered in different places in the file, and then when you get to a
leaf node you read sequentially, possibly over many megabytes.

I need some kind of caching system so I'm not hitting the disk on every
access. Ideally, I'd be able to use a kind of buffered RandomAccessFile
object which would magically cache hunks of the file appropriately.

I know this problem is similar to the problem databases have: how to
cache pages optimally. There's a difference, though, because my pages
are of variable length and pages in a database are usually of fixed
length. So a fixed pool of uniformly-sized pages organized in an LRU
cache won't work.

Anyone have any insights on what I should do?
From: Knute Johnson on
On 3/25/2010 9:26 AM, Spud wrote:
> Can someone suggest a good approach for caching parts of a large data file?
>
> I've got a number of very large (multi-gigabyte) files. Each file is
> organized into pages of variable length. Pages can be small (< 1k) or
> much larger (> 1mb). The organization of the file is similar to a btree,
> and the access pattern is similar: you start at the root, skip to pages
> scattered in different places in the file, and then when you get to a
> leaf node you read sequentially, possibly over many megabytes.
>
> I need some kind of caching system so I'm not hitting the disk on every
> access. Ideally, I'd be able to use a kind of buffered RandomAccessFile
> object which would magically cache hunks of the file appropriately.
>
> I know this problem is similar to the problem databases have: how to
> cache pages optimally. There's a difference, though, because my pages
> are of variable length and pages in a database are usually of fixed
> length. So a fixed pool of uniformly-sized pages organized in an LRU
> cache won't work.
>
> Anyone have any insights on what I should do?

If you are sure that RandomAccessFile won't work then I'd look at
FileChannel.map(). The time required to read a few MB off a disk these
days is not that much.

--

Knute Johnson
email s/nospam/knute2010/

From: Lew on
Spud wrote:
>> Can someone suggest a good approach for caching parts of a large data file?
>
>> I've got a number of very large (multi-gigabyte) files. Each file is
>> organized into pages of variable length. Pages can be small (< 1k) or
>> much larger (> 1mb). The organization of the file is similar to a btree,
>> and the access pattern is similar: you start at the root, skip to pages
>> scattered in different places in the file, and then when you get to a
>> leaf node you read sequentially, possibly over many megabytes.
>
>> I need some kind of caching system so I'm not hitting the disk on every
>> access. Ideally, I'd be able to use a kind of buffered RandomAccessFile
>> object which would magically cache hunks of the file appropriately.
>

Knute Johnson wrote:
> If you are sure that RandomAccessFile won't work then I'd look at
> FileChannel.map().  The time required to read a few MB off a disk these
> days is not that much.
>

To speed up those large sequential reads, i.e., after finding the
starting point of one of those "pages", use a disk with a large
onboard buffer (common disks have buffers ranging from 2MB to 8MB) and
high rotation speed (10K or 15K rpm as opposed to 7200 or 5400).

<http://en.wikipedia.org/wiki/Hard_disk_drive#Data_transfer_rate>

Bear in mind that the OS also caches disk reads. There are a lot of
factors influencing read speed: disk-to-buffer transfer rate, buffer-
to-mainboard transfer rate, OS disk-buffering strategy, use of the
disk by other processes, including by the OS itself for memory swap
space, and more. Anything you do by hand to try to improve the
situation risks having the opposite effect.

As with all attempts to optimize, remember the famous maxims:
1. Don't do it.
2. Don't do it yet.
3. Measure performance before and after, under realistic conditions.
4. Everything you think you know, or you assume, or you predict, is
wrong.
5. It will work differently in the field.
6. Nothing can fix a slow algorithm.

--
Lew
From: Martin Gregorie on
On Thu, 25 Mar 2010 10:24:38 -0700, Lew wrote:

> Spud wrote:
>>> Can someone suggest a good approach for caching parts of a large data
>>> file?
>>
The fastest sequential read I've ever seen was from a RAID 5 disk set
using a hardware RAID controller. This did sequential reads at something
like 3 times the speed expected from a single disk of the same type and
would write at about the same speed as a single disk. The speed gain
seemed to be because the controlled was bright enough to do look-ahead
caching. We measured this performance by serially reading several
thousand fairly small records from a sequential file.

FWIW this was measured on an IBM AS/400, and IIRC it used 5 disk RAID
sets. I expected good sequential reading performance because RAID 5 was
the only storage method the AS/400 used. However, I didn't expect the
read speed to be 'that' good!


--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |
From: Wojtek on
Spud wrote :
> Anyone have any insights on what I should do?

Maybe do an initial pass through the file following only the tree
indexes, save the indexes plus file offsets into a TreeMap. Then use
the TreeMap to do the lookup of the index and use the corresponding
file offset to read the data.

Won't speed up the read, but should speed up the tree following.

Plus you could cache the page as you do a read (into the class holding
the file offset).

Mind your memory usage!

--
Wojtek :-)


 |  Next  |  Last
Pages: 1 2 3
Prev: Calendar question
Next: Hacking midlet signing